docs: add v1 design spec — WASM provider service

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-05 14:24:01 +02:00
parent 831d5c4184
commit 4093c344be
@@ -0,0 +1,232 @@
# whoareyou v1 — WASM provider service
**Date:** 2026-06-05
**Status:** Approved design
## Summary
whoareyou becomes a long-running, async HTTP service that looks up Swedish
phone numbers by aggregating scraping providers. Providers are **WASM
components** (WASM Component Model, WASI p2) loaded from disk at startup. The
host does all network fetching; components are pure functions from number →
requests and responses → parsed entries. The existing CLI is retired.
V1 scope: code only (runs via `cargo run`, components on disk). One provider:
hitta.se. No container image, no k8s manifests, no upload/enable-disable UI —
those come later.
## Decisions log
| Decision | Choice |
|---|---|
| Form factor | Long-running service + HTTP API; CLI retired |
| Deployment target | Self-hosted on own infra (later); v1 is code-only |
| API surface | Single lookup endpoint (+ `/healthz` as daemon necessity) |
| Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work |
| Host/guest boundary | Host fetches; components are pure (no network/fs access) |
| WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) |
| Cache | In-memory TTL (moka), parsed results, 24h, key `provider:number` |
| V1 providers | hitta.se only |
| Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported |
## Workspace layout & toolchain
Cargo workspace, edition 2024:
```
whoareyou/
├── Cargo.toml # workspace
├── wit/
│ └── provider.wit # the provider contract (single source of truth)
├── crates/
│ ├── server/ # bin: axum HTTP service + wasmtime host
│ └── providers/
│ └── hitta/ # cdylib → wasm32-wasip2 component
├── fixtures/ # existing HTML fixtures, reused for component tests
└── justfile # build components → build server (two-step build)
```
- **Host stack:** tokio, axum 0.8, reqwest (current, async), moka 0.12,
wasmtime 45 (`component-model`), thiserror, tracing.
- **Provider stack:** `wit-bindgen` in a `cdylib` crate, built with plain
`cargo build --target wasm32-wasip2 --release`. No `cargo-component`
(stale; the tier-2 wasip2 target makes it unnecessary).
- **Deleted:** `src/main.rs` (CLI), `src/definition.rs`, `src/context.rs`
(bincode cache), `definitions/`, the four orphaned probe modules, `_build.rs`.
## The WIT contract
`wit/provider.wit`:
```wit
package whoareyou:provider@0.1.0;
interface lookup {
record provider-info {
name: string, // e.g. "hitta.se" — key in API response + cache
version: string,
}
record request {
url: string,
}
record response {
status: u16,
body: string,
}
record comment {
timestamp: option<s64>, // unix epoch seconds, UTC
title: option<string>,
message: string,
}
record entry {
messages: list<string>,
history: list<string>,
comments: list<comment>,
}
variant lookup-error {
no-data, // fetched fine, nothing on the page
parse-failed(string), // page structure changed — scraper rot signal
}
metadata: func() -> provider-info;
requests: func(number: string) -> list<request>;
parse: func(number: string, responses: list<response>) -> result<entry, lookup-error>;
}
world provider {
export lookup;
}
```
Design points:
- **Pure exports, no imports.** Components cannot reach network or
filesystem. Sandboxed, trivially testable, host owns all I/O policy.
- **`timestamp: option<s64>`** replaces the old `Date` enum. Components
normalize site-local dates to UTC epoch seconds (date-only → 00:00:00);
`option` because some sites omit dates. The host never parses dates.
- **`lookup-error` separates `no-data` from `parse-failed`** — the old
`Result<Entry, ()>` conflated "nothing there" with "scraper broke".
- **`requests` returns a list** for single-round fan-out (e.g. two URL
formats). No sequential multi-step flows in v1; if a future provider needs
fetch→token→fetch, add an optional host-fetch import to the world then.
- Package is versioned (`@0.1.0`); future provider-upload feature hangs
version negotiation off this.
## The host service
### Startup
1. Load config from env (`WHOAREYOU_` prefix): listen addr, components dir,
cache TTL, fetch timeout.
2. Scan `components/*.wasm`, compile each with wasmtime, call `metadata()`
once; **fail fast** on components that don't satisfy the WIT world. Log
the loaded provider set.
3. Serve HTTP.
### API
```
GET /api/v1/number/{number}
GET /healthz
```
Response shape:
```json
{
"number": "0700000000",
"results": {
"hitta.se": {
"status": "ok",
"entry": {
"messages": [],
"history": ["42 andra har rapporterat detta nummer"],
"comments": [
{ "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" }
]
}
}
}
}
```
- Per-provider `status`: `"ok"` | `"no_data"` | `"fetch_failed"` |
`"parse_failed"`. One provider failing never fails the request.
- HTTP 200 whenever the lookup ran; 400 only for invalid number format.
- `/healthz` is the one deliberate addition to "just lookup" — a supervised
daemon needs it.
### Lookup flow
1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in
v1). Normalized form is the cache key.
2. Check moka cache (key `provider:number`, value: per-provider result,
TTL 24h).
3. On miss, per provider **concurrently**: `requests(number)` → host fetches
each URL via reqwest (shared client, timeout, descriptive User-Agent) →
`parse(number, responses)` → cache the result.
4. Assemble the JSON response.
### Wasmtime mechanics
- One `Engine` + one compiled `Component` per provider at startup.
- Fresh `Store` + instance per call — cheap, and no state bleeds between
lookups.
- Guest calls run in `spawn_blocking` (CPU-bound parsing, no async in guest).
- Epoch-based deadline per call (`Engine` epoch interruption, deadline a few
seconds out) so a runaway component can't hang the service — matters once
uploaded third-party components exist.
### Errors and logging
- Host-side `thiserror` enum (`FetchFailed`, `ParseFailed`, `ComponentTrap`,
…) mapped to the per-provider `status` strings.
- `tracing` structured logs; `parse_failed` logs at WARN — it means a
scraper rotted and needs attention.
## The hitta component
`crates/providers/hitta`:
- `wit-bindgen` generates the trait from `wit/provider.wit`.
- `metadata()``{ name: "hitta.se", version: <crate version> }`.
- `requests(number)``["https://www.hitta.se/vem-ringde/{number}"]`.
- `parse()` ports `src/probe/hitta.rs`: regex out `__NEXT_DATA__`,
serde-deserialize, map comments to epoch timestamps. The old `Err(())`
paths split: regex miss → `parse-failed`; JSON ok but no `phone_data`
`no-data`.
- **First implementation task: verify the 2019 fixtures against today's
hitta.se.** The site has likely moved off `__NEXT_DATA__`-in-a-script-tag.
Fetch fresh fixtures and update the parser to match reality before
anything is declared working.
## Testing
Three layers, no live network anywhere:
1. **Component logic, native:** parse logic in plain functions; unit tests
run natively against `fixtures/hitta/*.html` with current insta
(migrating existing snapshots). No WASM in the loop — fast iteration.
2. **Component contract, in wasmtime:** one host-side integration test loads
the built `.wasm`, feeds it a fixture body as a `response`, asserts on the
returned `entry`. Proves the WIT boundary + build pipeline.
3. **HTTP layer:** axum handler tests with the provider/fetch layer behind a
small trait — API-shape tests need neither network nor WASM.
`fetch-fixture` (updated for current site URLs) remains the manual tool for
refreshing fixtures.
## Future work (explicitly out of v1)
- Container image, k8s/Pithos config, CI to registry
- Upload + enable/disable custom providers (API/UI)
- More providers (audit the four 2019 sites for survivors)
- Host-fetch import in the WIT world for multi-step providers
- Lookup history / persistent cache
- Metrics (Prometheus/OTel)