# whoareyou v1 — WASM provider service **Date:** 2026-06-05 **Status:** Approved design ## Summary whoareyou becomes a long-running, async HTTP service that looks up Swedish phone numbers by aggregating scraping providers. Providers are **WASM components** (WASM Component Model, WASI p2) loaded from disk at startup. The host does all network fetching; components are pure functions from number → requests and responses → parsed entries. The existing CLI is retired. V1 scope: code only (runs via `cargo run`, components on disk). One provider: hitta.se. No container image, no k8s manifests, no upload/enable-disable UI — those come later. ## Decisions log | Decision | Choice | |---|---| | Form factor | Long-running service + HTTP API; CLI retired | | Deployment target | Self-hosted on own infra (later); v1 is code-only | | API surface | Single lookup endpoint (+ `/healthz` as daemon necessity) | | Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work | | Host/guest boundary | Host fetches; components are pure (no network/fs access) | | WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) | | Cache | In-memory TTL (moka), parsed results, 24h, key `provider:number` | | V1 providers | hitta.se only | | Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported | ## Workspace layout & toolchain Cargo workspace, edition 2024: ``` whoareyou/ ├── Cargo.toml # workspace ├── wit/ │ └── provider.wit # the provider contract (single source of truth) ├── crates/ │ ├── server/ # bin: axum HTTP service + wasmtime host │ └── providers/ │ └── hitta/ # cdylib → wasm32-wasip2 component ├── fixtures/ # existing HTML fixtures, reused for component tests └── justfile # build components → build server (two-step build) ``` - **Host stack:** tokio, axum 0.8, reqwest (current, async), moka 0.12, wasmtime 45 (`component-model`), thiserror, tracing. - **Provider stack:** `wit-bindgen` in a `cdylib` crate, built with plain `cargo build --target wasm32-wasip2 --release`. No `cargo-component` (stale; the tier-2 wasip2 target makes it unnecessary). - **Deleted:** `src/main.rs` (CLI), `src/definition.rs`, `src/context.rs` (bincode cache), `definitions/`, the four orphaned probe modules, `_build.rs`. ## The WIT contract `wit/provider.wit`: ```wit package whoareyou:provider@0.1.0; interface lookup { record provider-info { name: string, // e.g. "hitta.se" — key in API response + cache version: string, } record request { url: string, } record response { status: u16, body: string, } record comment { timestamp: option, // unix epoch seconds, UTC title: option, message: string, } record entry { messages: list, history: list, comments: list, } variant lookup-error { no-data, // fetched fine, nothing on the page parse-failed(string), // page structure changed — scraper rot signal } metadata: func() -> provider-info; requests: func(number: string) -> list; parse: func(number: string, responses: list) -> result; } world provider { export lookup; } ``` Design points: - **Pure exports, no imports.** Components cannot reach network or filesystem. Sandboxed, trivially testable, host owns all I/O policy. - **`timestamp: option`** replaces the old `Date` enum. Components normalize site-local dates to UTC epoch seconds (date-only → 00:00:00); `option` because some sites omit dates. The host never parses dates. - **`lookup-error` separates `no-data` from `parse-failed`** — the old `Result` conflated "nothing there" with "scraper broke". - **`requests` returns a list** for single-round fan-out (e.g. two URL formats). No sequential multi-step flows in v1; if a future provider needs fetch→token→fetch, add an optional host-fetch import to the world then. - Package is versioned (`@0.1.0`); future provider-upload feature hangs version negotiation off this. ## The host service ### Startup 1. Load config from env (`WHOAREYOU_` prefix): listen addr, components dir, cache TTL, fetch timeout. 2. Scan `components/*.wasm`, compile each with wasmtime, call `metadata()` once; **fail fast** on components that don't satisfy the WIT world. Log the loaded provider set. 3. Serve HTTP. ### API ``` GET /api/v1/number/{number} GET /healthz ``` Response shape: ```json { "number": "0700000000", "results": { "hitta.se": { "status": "ok", "entry": { "messages": [], "history": ["42 andra har rapporterat detta nummer"], "comments": [ { "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" } ] } } } } ``` - Per-provider `status`: `"ok"` | `"no_data"` | `"fetch_failed"` | `"parse_failed"`. One provider failing never fails the request. - HTTP 200 whenever the lookup ran; 400 only for invalid number format. - `/healthz` is the one deliberate addition to "just lookup" — a supervised daemon needs it. ### Lookup flow 1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in v1). Normalized form is the cache key. 2. Check moka cache (key `provider:number`, value: per-provider result, TTL 24h). 3. On miss, per provider **concurrently**: `requests(number)` → host fetches each URL via reqwest (shared client, timeout, descriptive User-Agent) → `parse(number, responses)` → cache the result. 4. Assemble the JSON response. ### Wasmtime mechanics - One `Engine` + one compiled `Component` per provider at startup. - Fresh `Store` + instance per call — cheap, and no state bleeds between lookups. - Guest calls run in `spawn_blocking` (CPU-bound parsing, no async in guest). - Epoch-based deadline per call (`Engine` epoch interruption, deadline a few seconds out) so a runaway component can't hang the service — matters once uploaded third-party components exist. ### Errors and logging - Host-side `thiserror` enum (`FetchFailed`, `ParseFailed`, `ComponentTrap`, …) mapped to the per-provider `status` strings. - `tracing` structured logs; `parse_failed` logs at WARN — it means a scraper rotted and needs attention. ## The hitta component `crates/providers/hitta`: - `wit-bindgen` generates the trait from `wit/provider.wit`. - `metadata()` → `{ name: "hitta.se", version: }`. - `requests(number)` → `["https://www.hitta.se/vem-ringde/{number}"]`. - `parse()` ports `src/probe/hitta.rs`: regex out `__NEXT_DATA__`, serde-deserialize, map comments to epoch timestamps. The old `Err(())` paths split: regex miss → `parse-failed`; JSON ok but no `phone_data` → `no-data`. - **First implementation task: verify the 2019 fixtures against today's hitta.se.** The site has likely moved off `__NEXT_DATA__`-in-a-script-tag. Fetch fresh fixtures and update the parser to match reality before anything is declared working. ## Testing Three layers, no live network anywhere: 1. **Component logic, native:** parse logic in plain functions; unit tests run natively against `fixtures/hitta/*.html` with current insta (migrating existing snapshots). No WASM in the loop — fast iteration. 2. **Component contract, in wasmtime:** one host-side integration test loads the built `.wasm`, feeds it a fixture body as a `response`, asserts on the returned `entry`. Proves the WIT boundary + build pipeline. 3. **HTTP layer:** axum handler tests with the provider/fetch layer behind a small trait — API-shape tests need neither network nor WASM. `fetch-fixture` (updated for current site URLs) remains the manual tool for refreshing fixtures. ## Future work (explicitly out of v1) - Container image, k8s/Pithos config, CI to registry - Upload + enable/disable custom providers (API/UI) - More providers (audit the four 2019 sites for survivors) - Host-fetch import in the WIT world for multi-step providers - Lookup history / persistent cache - Metrics (Prometheus/OTel)