From 4093c344beeb721592cf968c0797d95c98826100 Mon Sep 17 00:00:00 2001 From: Anders Olsson Date: Fri, 5 Jun 2026 14:24:01 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20add=20v1=20design=20spec=20=E2=80=94=20?= =?UTF-8?q?WASM=20provider=20service?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 (1M context) --- ...2026-06-05-wasm-provider-service-design.md | 232 ++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-05-wasm-provider-service-design.md diff --git a/docs/superpowers/specs/2026-06-05-wasm-provider-service-design.md b/docs/superpowers/specs/2026-06-05-wasm-provider-service-design.md new file mode 100644 index 0000000..d02ad72 --- /dev/null +++ b/docs/superpowers/specs/2026-06-05-wasm-provider-service-design.md @@ -0,0 +1,232 @@ +# whoareyou v1 — WASM provider service + +**Date:** 2026-06-05 +**Status:** Approved design + +## Summary + +whoareyou becomes a long-running, async HTTP service that looks up Swedish +phone numbers by aggregating scraping providers. Providers are **WASM +components** (WASM Component Model, WASI p2) loaded from disk at startup. The +host does all network fetching; components are pure functions from number → +requests and responses → parsed entries. The existing CLI is retired. + +V1 scope: code only (runs via `cargo run`, components on disk). One provider: +hitta.se. No container image, no k8s manifests, no upload/enable-disable UI — +those come later. + +## Decisions log + +| Decision | Choice | +|---|---| +| Form factor | Long-running service + HTTP API; CLI retired | +| Deployment target | Self-hosted on own infra (later); v1 is code-only | +| API surface | Single lookup endpoint (+ `/healthz` as daemon necessity) | +| Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work | +| Host/guest boundary | Host fetches; components are pure (no network/fs access) | +| WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) | +| Cache | In-memory TTL (moka), parsed results, 24h, key `provider:number` | +| V1 providers | hitta.se only | +| Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported | + +## Workspace layout & toolchain + +Cargo workspace, edition 2024: + +``` +whoareyou/ +├── Cargo.toml # workspace +├── wit/ +│ └── provider.wit # the provider contract (single source of truth) +├── crates/ +│ ├── server/ # bin: axum HTTP service + wasmtime host +│ └── providers/ +│ └── hitta/ # cdylib → wasm32-wasip2 component +├── fixtures/ # existing HTML fixtures, reused for component tests +└── justfile # build components → build server (two-step build) +``` + +- **Host stack:** tokio, axum 0.8, reqwest (current, async), moka 0.12, + wasmtime 45 (`component-model`), thiserror, tracing. +- **Provider stack:** `wit-bindgen` in a `cdylib` crate, built with plain + `cargo build --target wasm32-wasip2 --release`. No `cargo-component` + (stale; the tier-2 wasip2 target makes it unnecessary). +- **Deleted:** `src/main.rs` (CLI), `src/definition.rs`, `src/context.rs` + (bincode cache), `definitions/`, the four orphaned probe modules, `_build.rs`. + +## The WIT contract + +`wit/provider.wit`: + +```wit +package whoareyou:provider@0.1.0; + +interface lookup { + record provider-info { + name: string, // e.g. "hitta.se" — key in API response + cache + version: string, + } + + record request { + url: string, + } + + record response { + status: u16, + body: string, + } + + record comment { + timestamp: option, // unix epoch seconds, UTC + title: option, + message: string, + } + + record entry { + messages: list, + history: list, + comments: list, + } + + variant lookup-error { + no-data, // fetched fine, nothing on the page + parse-failed(string), // page structure changed — scraper rot signal + } + + metadata: func() -> provider-info; + requests: func(number: string) -> list; + parse: func(number: string, responses: list) -> result; +} + +world provider { + export lookup; +} +``` + +Design points: + +- **Pure exports, no imports.** Components cannot reach network or + filesystem. Sandboxed, trivially testable, host owns all I/O policy. +- **`timestamp: option`** replaces the old `Date` enum. Components + normalize site-local dates to UTC epoch seconds (date-only → 00:00:00); + `option` because some sites omit dates. The host never parses dates. +- **`lookup-error` separates `no-data` from `parse-failed`** — the old + `Result` conflated "nothing there" with "scraper broke". +- **`requests` returns a list** for single-round fan-out (e.g. two URL + formats). No sequential multi-step flows in v1; if a future provider needs + fetch→token→fetch, add an optional host-fetch import to the world then. +- Package is versioned (`@0.1.0`); future provider-upload feature hangs + version negotiation off this. + +## The host service + +### Startup + +1. Load config from env (`WHOAREYOU_` prefix): listen addr, components dir, + cache TTL, fetch timeout. +2. Scan `components/*.wasm`, compile each with wasmtime, call `metadata()` + once; **fail fast** on components that don't satisfy the WIT world. Log + the loaded provider set. +3. Serve HTTP. + +### API + +``` +GET /api/v1/number/{number} +GET /healthz +``` + +Response shape: + +```json +{ + "number": "0700000000", + "results": { + "hitta.se": { + "status": "ok", + "entry": { + "messages": [], + "history": ["42 andra har rapporterat detta nummer"], + "comments": [ + { "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" } + ] + } + } + } +} +``` + +- Per-provider `status`: `"ok"` | `"no_data"` | `"fetch_failed"` | + `"parse_failed"`. One provider failing never fails the request. +- HTTP 200 whenever the lookup ran; 400 only for invalid number format. +- `/healthz` is the one deliberate addition to "just lookup" — a supervised + daemon needs it. + +### Lookup flow + +1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in + v1). Normalized form is the cache key. +2. Check moka cache (key `provider:number`, value: per-provider result, + TTL 24h). +3. On miss, per provider **concurrently**: `requests(number)` → host fetches + each URL via reqwest (shared client, timeout, descriptive User-Agent) → + `parse(number, responses)` → cache the result. +4. Assemble the JSON response. + +### Wasmtime mechanics + +- One `Engine` + one compiled `Component` per provider at startup. +- Fresh `Store` + instance per call — cheap, and no state bleeds between + lookups. +- Guest calls run in `spawn_blocking` (CPU-bound parsing, no async in guest). +- Epoch-based deadline per call (`Engine` epoch interruption, deadline a few + seconds out) so a runaway component can't hang the service — matters once + uploaded third-party components exist. + +### Errors and logging + +- Host-side `thiserror` enum (`FetchFailed`, `ParseFailed`, `ComponentTrap`, + …) mapped to the per-provider `status` strings. +- `tracing` structured logs; `parse_failed` logs at WARN — it means a + scraper rotted and needs attention. + +## The hitta component + +`crates/providers/hitta`: + +- `wit-bindgen` generates the trait from `wit/provider.wit`. +- `metadata()` → `{ name: "hitta.se", version: }`. +- `requests(number)` → `["https://www.hitta.se/vem-ringde/{number}"]`. +- `parse()` ports `src/probe/hitta.rs`: regex out `__NEXT_DATA__`, + serde-deserialize, map comments to epoch timestamps. The old `Err(())` + paths split: regex miss → `parse-failed`; JSON ok but no `phone_data` → + `no-data`. +- **First implementation task: verify the 2019 fixtures against today's + hitta.se.** The site has likely moved off `__NEXT_DATA__`-in-a-script-tag. + Fetch fresh fixtures and update the parser to match reality before + anything is declared working. + +## Testing + +Three layers, no live network anywhere: + +1. **Component logic, native:** parse logic in plain functions; unit tests + run natively against `fixtures/hitta/*.html` with current insta + (migrating existing snapshots). No WASM in the loop — fast iteration. +2. **Component contract, in wasmtime:** one host-side integration test loads + the built `.wasm`, feeds it a fixture body as a `response`, asserts on the + returned `entry`. Proves the WIT boundary + build pipeline. +3. **HTTP layer:** axum handler tests with the provider/fetch layer behind a + small trait — API-shape tests need neither network nor WASM. + +`fetch-fixture` (updated for current site URLs) remains the manual tool for +refreshing fixtures. + +## Future work (explicitly out of v1) + +- Container image, k8s/Pithos config, CI to registry +- Upload + enable/disable custom providers (API/UI) +- More providers (audit the four 2019 sites for survivors) +- Host-fetch import in the WIT world for multi-step providers +- Lookup history / persistent cache +- Metrics (Prometheus/OTel)