Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.3 KiB
whoareyou v1 — WASM provider service
Date: 2026-06-05 Status: Approved design
Summary
whoareyou becomes a long-running, async HTTP service that looks up Swedish phone numbers by aggregating scraping providers. Providers are WASM components (WASM Component Model, WASI p2) loaded from disk at startup. The host does all network fetching; components are pure functions from number → requests and responses → parsed entries. The existing CLI is retired.
V1 scope: code only (runs via cargo run, components on disk). One provider:
hitta.se. No container image, no k8s manifests, no upload/enable-disable UI —
those come later.
Decisions log
| Decision | Choice |
|---|---|
| Form factor | Long-running service + HTTP API; CLI retired |
| Deployment target | Self-hosted on own infra (later); v1 is code-only |
| API surface | Single lookup endpoint (+ /healthz as daemon necessity) |
| Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work |
| Host/guest boundary | Host fetches; components are pure (no network/fs access) |
| WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) |
| Cache | In-memory TTL (moka), parsed results, 24h, key provider:number |
| V1 providers | hitta.se only |
| Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported |
Workspace layout & toolchain
Cargo workspace, edition 2024:
whoareyou/
├── Cargo.toml # workspace
├── wit/
│ └── provider.wit # the provider contract (single source of truth)
├── crates/
│ ├── server/ # bin: axum HTTP service + wasmtime host
│ └── providers/
│ └── hitta/ # cdylib → wasm32-wasip2 component
├── fixtures/ # existing HTML fixtures, reused for component tests
└── justfile # build components → build server (two-step build)
- Host stack: tokio, axum 0.8, reqwest (current, async), moka 0.12,
wasmtime 45 (
component-model), thiserror, tracing. - Provider stack:
wit-bindgenin acdylibcrate, built with plaincargo build --target wasm32-wasip2 --release. Nocargo-component(stale; the tier-2 wasip2 target makes it unnecessary). - Deleted:
src/main.rs(CLI),src/definition.rs,src/context.rs(bincode cache),definitions/, the four orphaned probe modules,_build.rs.
The WIT contract
wit/provider.wit:
package whoareyou:provider@0.1.0;
interface lookup {
record provider-info {
name: string, // e.g. "hitta.se" — key in API response + cache
version: string,
}
record request {
url: string,
}
record response {
status: u16,
body: string,
}
record comment {
timestamp: option<s64>, // unix epoch seconds, UTC
title: option<string>,
message: string,
}
record entry {
messages: list<string>,
history: list<string>,
comments: list<comment>,
}
variant lookup-error {
no-data, // fetched fine, nothing on the page
parse-failed(string), // page structure changed — scraper rot signal
}
metadata: func() -> provider-info;
requests: func(number: string) -> list<request>;
parse: func(number: string, responses: list<response>) -> result<entry, lookup-error>;
}
world provider {
export lookup;
}
Design points:
- Pure exports, no imports. Components cannot reach network or filesystem. Sandboxed, trivially testable, host owns all I/O policy.
timestamp: option<s64>replaces the oldDateenum. Components normalize site-local dates to UTC epoch seconds (date-only → 00:00:00);optionbecause some sites omit dates. The host never parses dates.lookup-errorseparatesno-datafromparse-failed— the oldResult<Entry, ()>conflated "nothing there" with "scraper broke".requestsreturns a list for single-round fan-out (e.g. two URL formats). No sequential multi-step flows in v1; if a future provider needs fetch→token→fetch, add an optional host-fetch import to the world then.- Package is versioned (
@0.1.0); future provider-upload feature hangs version negotiation off this.
The host service
Startup
- Load config from env (
WHOAREYOU_prefix): listen addr, components dir, cache TTL, fetch timeout. - Scan
components/*.wasm, compile each with wasmtime, callmetadata()once; fail fast on components that don't satisfy the WIT world. Log the loaded provider set. - Serve HTTP.
API
GET /api/v1/number/{number}
GET /healthz
Response shape:
{
"number": "0700000000",
"results": {
"hitta.se": {
"status": "ok",
"entry": {
"messages": [],
"history": ["42 andra har rapporterat detta nummer"],
"comments": [
{ "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" }
]
}
}
}
}
- Per-provider
status:"ok"|"no_data"|"fetch_failed"|"parse_failed". One provider failing never fails the request. - HTTP 200 whenever the lookup ran; 400 only for invalid number format.
/healthzis the one deliberate addition to "just lookup" — a supervised daemon needs it.
Lookup flow
- Normalize the number (strip spaces/dashes; minimal — no full E.164 in v1). Normalized form is the cache key.
- Check moka cache (key
provider:number, value: per-provider result, TTL 24h). - On miss, per provider concurrently:
requests(number)→ host fetches each URL via reqwest (shared client, timeout, descriptive User-Agent) →parse(number, responses)→ cache the result. - Assemble the JSON response.
Wasmtime mechanics
- One
Engine+ one compiledComponentper provider at startup. - Fresh
Store+ instance per call — cheap, and no state bleeds between lookups. - Guest calls run in
spawn_blocking(CPU-bound parsing, no async in guest). - Epoch-based deadline per call (
Engineepoch interruption, deadline a few seconds out) so a runaway component can't hang the service — matters once uploaded third-party components exist.
Errors and logging
- Host-side
thiserrorenum (FetchFailed,ParseFailed,ComponentTrap, …) mapped to the per-providerstatusstrings. tracingstructured logs;parse_failedlogs at WARN — it means a scraper rotted and needs attention.
The hitta component
crates/providers/hitta:
wit-bindgengenerates the trait fromwit/provider.wit.metadata()→{ name: "hitta.se", version: <crate version> }.requests(number)→["https://www.hitta.se/vem-ringde/{number}"].parse()portssrc/probe/hitta.rs: regex out__NEXT_DATA__, serde-deserialize, map comments to epoch timestamps. The oldErr(())paths split: regex miss →parse-failed; JSON ok but nophone_data→no-data.- First implementation task: verify the 2019 fixtures against today's
hitta.se. The site has likely moved off
__NEXT_DATA__-in-a-script-tag. Fetch fresh fixtures and update the parser to match reality before anything is declared working.
Testing
Three layers, no live network anywhere:
- Component logic, native: parse logic in plain functions; unit tests
run natively against
fixtures/hitta/*.htmlwith current insta (migrating existing snapshots). No WASM in the loop — fast iteration. - Component contract, in wasmtime: one host-side integration test loads
the built
.wasm, feeds it a fixture body as aresponse, asserts on the returnedentry. Proves the WIT boundary + build pipeline. - HTTP layer: axum handler tests with the provider/fetch layer behind a small trait — API-shape tests need neither network nor WASM.
fetch-fixture (updated for current site URLs) remains the manual tool for
refreshing fixtures.
Future work (explicitly out of v1)
- Container image, k8s/Pithos config, CI to registry
- Upload + enable/disable custom providers (API/UI)
- More providers (audit the four 2019 sites for survivors)
- Host-fetch import in the WIT world for multi-step providers
- Lookup history / persistent cache
- Metrics (Prometheus/OTel)