docs: add v1 design spec — WASM provider service
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,232 @@
|
||||
# whoareyou v1 — WASM provider service
|
||||
|
||||
**Date:** 2026-06-05
|
||||
**Status:** Approved design
|
||||
|
||||
## Summary
|
||||
|
||||
whoareyou becomes a long-running, async HTTP service that looks up Swedish
|
||||
phone numbers by aggregating scraping providers. Providers are **WASM
|
||||
components** (WASM Component Model, WASI p2) loaded from disk at startup. The
|
||||
host does all network fetching; components are pure functions from number →
|
||||
requests and responses → parsed entries. The existing CLI is retired.
|
||||
|
||||
V1 scope: code only (runs via `cargo run`, components on disk). One provider:
|
||||
hitta.se. No container image, no k8s manifests, no upload/enable-disable UI —
|
||||
those come later.
|
||||
|
||||
## Decisions log
|
||||
|
||||
| Decision | Choice |
|
||||
|---|---|
|
||||
| Form factor | Long-running service + HTTP API; CLI retired |
|
||||
| Deployment target | Self-hosted on own infra (later); v1 is code-only |
|
||||
| API surface | Single lookup endpoint (+ `/healthz` as daemon necessity) |
|
||||
| Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work |
|
||||
| Host/guest boundary | Host fetches; components are pure (no network/fs access) |
|
||||
| WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) |
|
||||
| Cache | In-memory TTL (moka), parsed results, 24h, key `provider:number` |
|
||||
| V1 providers | hitta.se only |
|
||||
| Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported |
|
||||
|
||||
## Workspace layout & toolchain
|
||||
|
||||
Cargo workspace, edition 2024:
|
||||
|
||||
```
|
||||
whoareyou/
|
||||
├── Cargo.toml # workspace
|
||||
├── wit/
|
||||
│ └── provider.wit # the provider contract (single source of truth)
|
||||
├── crates/
|
||||
│ ├── server/ # bin: axum HTTP service + wasmtime host
|
||||
│ └── providers/
|
||||
│ └── hitta/ # cdylib → wasm32-wasip2 component
|
||||
├── fixtures/ # existing HTML fixtures, reused for component tests
|
||||
└── justfile # build components → build server (two-step build)
|
||||
```
|
||||
|
||||
- **Host stack:** tokio, axum 0.8, reqwest (current, async), moka 0.12,
|
||||
wasmtime 45 (`component-model`), thiserror, tracing.
|
||||
- **Provider stack:** `wit-bindgen` in a `cdylib` crate, built with plain
|
||||
`cargo build --target wasm32-wasip2 --release`. No `cargo-component`
|
||||
(stale; the tier-2 wasip2 target makes it unnecessary).
|
||||
- **Deleted:** `src/main.rs` (CLI), `src/definition.rs`, `src/context.rs`
|
||||
(bincode cache), `definitions/`, the four orphaned probe modules, `_build.rs`.
|
||||
|
||||
## The WIT contract
|
||||
|
||||
`wit/provider.wit`:
|
||||
|
||||
```wit
|
||||
package whoareyou:provider@0.1.0;
|
||||
|
||||
interface lookup {
|
||||
record provider-info {
|
||||
name: string, // e.g. "hitta.se" — key in API response + cache
|
||||
version: string,
|
||||
}
|
||||
|
||||
record request {
|
||||
url: string,
|
||||
}
|
||||
|
||||
record response {
|
||||
status: u16,
|
||||
body: string,
|
||||
}
|
||||
|
||||
record comment {
|
||||
timestamp: option<s64>, // unix epoch seconds, UTC
|
||||
title: option<string>,
|
||||
message: string,
|
||||
}
|
||||
|
||||
record entry {
|
||||
messages: list<string>,
|
||||
history: list<string>,
|
||||
comments: list<comment>,
|
||||
}
|
||||
|
||||
variant lookup-error {
|
||||
no-data, // fetched fine, nothing on the page
|
||||
parse-failed(string), // page structure changed — scraper rot signal
|
||||
}
|
||||
|
||||
metadata: func() -> provider-info;
|
||||
requests: func(number: string) -> list<request>;
|
||||
parse: func(number: string, responses: list<response>) -> result<entry, lookup-error>;
|
||||
}
|
||||
|
||||
world provider {
|
||||
export lookup;
|
||||
}
|
||||
```
|
||||
|
||||
Design points:
|
||||
|
||||
- **Pure exports, no imports.** Components cannot reach network or
|
||||
filesystem. Sandboxed, trivially testable, host owns all I/O policy.
|
||||
- **`timestamp: option<s64>`** replaces the old `Date` enum. Components
|
||||
normalize site-local dates to UTC epoch seconds (date-only → 00:00:00);
|
||||
`option` because some sites omit dates. The host never parses dates.
|
||||
- **`lookup-error` separates `no-data` from `parse-failed`** — the old
|
||||
`Result<Entry, ()>` conflated "nothing there" with "scraper broke".
|
||||
- **`requests` returns a list** for single-round fan-out (e.g. two URL
|
||||
formats). No sequential multi-step flows in v1; if a future provider needs
|
||||
fetch→token→fetch, add an optional host-fetch import to the world then.
|
||||
- Package is versioned (`@0.1.0`); future provider-upload feature hangs
|
||||
version negotiation off this.
|
||||
|
||||
## The host service
|
||||
|
||||
### Startup
|
||||
|
||||
1. Load config from env (`WHOAREYOU_` prefix): listen addr, components dir,
|
||||
cache TTL, fetch timeout.
|
||||
2. Scan `components/*.wasm`, compile each with wasmtime, call `metadata()`
|
||||
once; **fail fast** on components that don't satisfy the WIT world. Log
|
||||
the loaded provider set.
|
||||
3. Serve HTTP.
|
||||
|
||||
### API
|
||||
|
||||
```
|
||||
GET /api/v1/number/{number}
|
||||
GET /healthz
|
||||
```
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"number": "0700000000",
|
||||
"results": {
|
||||
"hitta.se": {
|
||||
"status": "ok",
|
||||
"entry": {
|
||||
"messages": [],
|
||||
"history": ["42 andra har rapporterat detta nummer"],
|
||||
"comments": [
|
||||
{ "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- Per-provider `status`: `"ok"` | `"no_data"` | `"fetch_failed"` |
|
||||
`"parse_failed"`. One provider failing never fails the request.
|
||||
- HTTP 200 whenever the lookup ran; 400 only for invalid number format.
|
||||
- `/healthz` is the one deliberate addition to "just lookup" — a supervised
|
||||
daemon needs it.
|
||||
|
||||
### Lookup flow
|
||||
|
||||
1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in
|
||||
v1). Normalized form is the cache key.
|
||||
2. Check moka cache (key `provider:number`, value: per-provider result,
|
||||
TTL 24h).
|
||||
3. On miss, per provider **concurrently**: `requests(number)` → host fetches
|
||||
each URL via reqwest (shared client, timeout, descriptive User-Agent) →
|
||||
`parse(number, responses)` → cache the result.
|
||||
4. Assemble the JSON response.
|
||||
|
||||
### Wasmtime mechanics
|
||||
|
||||
- One `Engine` + one compiled `Component` per provider at startup.
|
||||
- Fresh `Store` + instance per call — cheap, and no state bleeds between
|
||||
lookups.
|
||||
- Guest calls run in `spawn_blocking` (CPU-bound parsing, no async in guest).
|
||||
- Epoch-based deadline per call (`Engine` epoch interruption, deadline a few
|
||||
seconds out) so a runaway component can't hang the service — matters once
|
||||
uploaded third-party components exist.
|
||||
|
||||
### Errors and logging
|
||||
|
||||
- Host-side `thiserror` enum (`FetchFailed`, `ParseFailed`, `ComponentTrap`,
|
||||
…) mapped to the per-provider `status` strings.
|
||||
- `tracing` structured logs; `parse_failed` logs at WARN — it means a
|
||||
scraper rotted and needs attention.
|
||||
|
||||
## The hitta component
|
||||
|
||||
`crates/providers/hitta`:
|
||||
|
||||
- `wit-bindgen` generates the trait from `wit/provider.wit`.
|
||||
- `metadata()` → `{ name: "hitta.se", version: <crate version> }`.
|
||||
- `requests(number)` → `["https://www.hitta.se/vem-ringde/{number}"]`.
|
||||
- `parse()` ports `src/probe/hitta.rs`: regex out `__NEXT_DATA__`,
|
||||
serde-deserialize, map comments to epoch timestamps. The old `Err(())`
|
||||
paths split: regex miss → `parse-failed`; JSON ok but no `phone_data` →
|
||||
`no-data`.
|
||||
- **First implementation task: verify the 2019 fixtures against today's
|
||||
hitta.se.** The site has likely moved off `__NEXT_DATA__`-in-a-script-tag.
|
||||
Fetch fresh fixtures and update the parser to match reality before
|
||||
anything is declared working.
|
||||
|
||||
## Testing
|
||||
|
||||
Three layers, no live network anywhere:
|
||||
|
||||
1. **Component logic, native:** parse logic in plain functions; unit tests
|
||||
run natively against `fixtures/hitta/*.html` with current insta
|
||||
(migrating existing snapshots). No WASM in the loop — fast iteration.
|
||||
2. **Component contract, in wasmtime:** one host-side integration test loads
|
||||
the built `.wasm`, feeds it a fixture body as a `response`, asserts on the
|
||||
returned `entry`. Proves the WIT boundary + build pipeline.
|
||||
3. **HTTP layer:** axum handler tests with the provider/fetch layer behind a
|
||||
small trait — API-shape tests need neither network nor WASM.
|
||||
|
||||
`fetch-fixture` (updated for current site URLs) remains the manual tool for
|
||||
refreshing fixtures.
|
||||
|
||||
## Future work (explicitly out of v1)
|
||||
|
||||
- Container image, k8s/Pithos config, CI to registry
|
||||
- Upload + enable/disable custom providers (API/UI)
|
||||
- More providers (audit the four 2019 sites for survivors)
|
||||
- Host-fetch import in the WIT world for multi-step providers
|
||||
- Lookup history / persistent cache
|
||||
- Metrics (Prometheus/OTel)
|
||||
Reference in New Issue
Block a user