4093c344be
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
233 lines
8.3 KiB
Markdown
233 lines
8.3 KiB
Markdown
# whoareyou v1 — WASM provider service
|
|
|
|
**Date:** 2026-06-05
|
|
**Status:** Approved design
|
|
|
|
## Summary
|
|
|
|
whoareyou becomes a long-running, async HTTP service that looks up Swedish
|
|
phone numbers by aggregating scraping providers. Providers are **WASM
|
|
components** (WASM Component Model, WASI p2) loaded from disk at startup. The
|
|
host does all network fetching; components are pure functions from number →
|
|
requests and responses → parsed entries. The existing CLI is retired.
|
|
|
|
V1 scope: code only (runs via `cargo run`, components on disk). One provider:
|
|
hitta.se. No container image, no k8s manifests, no upload/enable-disable UI —
|
|
those come later.
|
|
|
|
## Decisions log
|
|
|
|
| Decision | Choice |
|
|
|---|---|
|
|
| Form factor | Long-running service + HTTP API; CLI retired |
|
|
| Deployment target | Self-hosted on own infra (later); v1 is code-only |
|
|
| API surface | Single lookup endpoint (+ `/healthz` as daemon necessity) |
|
|
| Provider mechanism | WASM components, fixed set loaded at startup; upload/enable/disable is future work |
|
|
| Host/guest boundary | Host fetches; components are pure (no network/fs access) |
|
|
| WASM plumbing | Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI) |
|
|
| Cache | In-memory TTL (moka), parsed results, 24h, key `provider:number` |
|
|
| V1 providers | hitta.se only |
|
|
| Old code | CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported |
|
|
|
|
## Workspace layout & toolchain
|
|
|
|
Cargo workspace, edition 2024:
|
|
|
|
```
|
|
whoareyou/
|
|
├── Cargo.toml # workspace
|
|
├── wit/
|
|
│ └── provider.wit # the provider contract (single source of truth)
|
|
├── crates/
|
|
│ ├── server/ # bin: axum HTTP service + wasmtime host
|
|
│ └── providers/
|
|
│ └── hitta/ # cdylib → wasm32-wasip2 component
|
|
├── fixtures/ # existing HTML fixtures, reused for component tests
|
|
└── justfile # build components → build server (two-step build)
|
|
```
|
|
|
|
- **Host stack:** tokio, axum 0.8, reqwest (current, async), moka 0.12,
|
|
wasmtime 45 (`component-model`), thiserror, tracing.
|
|
- **Provider stack:** `wit-bindgen` in a `cdylib` crate, built with plain
|
|
`cargo build --target wasm32-wasip2 --release`. No `cargo-component`
|
|
(stale; the tier-2 wasip2 target makes it unnecessary).
|
|
- **Deleted:** `src/main.rs` (CLI), `src/definition.rs`, `src/context.rs`
|
|
(bincode cache), `definitions/`, the four orphaned probe modules, `_build.rs`.
|
|
|
|
## The WIT contract
|
|
|
|
`wit/provider.wit`:
|
|
|
|
```wit
|
|
package whoareyou:provider@0.1.0;
|
|
|
|
interface lookup {
|
|
record provider-info {
|
|
name: string, // e.g. "hitta.se" — key in API response + cache
|
|
version: string,
|
|
}
|
|
|
|
record request {
|
|
url: string,
|
|
}
|
|
|
|
record response {
|
|
status: u16,
|
|
body: string,
|
|
}
|
|
|
|
record comment {
|
|
timestamp: option<s64>, // unix epoch seconds, UTC
|
|
title: option<string>,
|
|
message: string,
|
|
}
|
|
|
|
record entry {
|
|
messages: list<string>,
|
|
history: list<string>,
|
|
comments: list<comment>,
|
|
}
|
|
|
|
variant lookup-error {
|
|
no-data, // fetched fine, nothing on the page
|
|
parse-failed(string), // page structure changed — scraper rot signal
|
|
}
|
|
|
|
metadata: func() -> provider-info;
|
|
requests: func(number: string) -> list<request>;
|
|
parse: func(number: string, responses: list<response>) -> result<entry, lookup-error>;
|
|
}
|
|
|
|
world provider {
|
|
export lookup;
|
|
}
|
|
```
|
|
|
|
Design points:
|
|
|
|
- **Pure exports, no imports.** Components cannot reach network or
|
|
filesystem. Sandboxed, trivially testable, host owns all I/O policy.
|
|
- **`timestamp: option<s64>`** replaces the old `Date` enum. Components
|
|
normalize site-local dates to UTC epoch seconds (date-only → 00:00:00);
|
|
`option` because some sites omit dates. The host never parses dates.
|
|
- **`lookup-error` separates `no-data` from `parse-failed`** — the old
|
|
`Result<Entry, ()>` conflated "nothing there" with "scraper broke".
|
|
- **`requests` returns a list** for single-round fan-out (e.g. two URL
|
|
formats). No sequential multi-step flows in v1; if a future provider needs
|
|
fetch→token→fetch, add an optional host-fetch import to the world then.
|
|
- Package is versioned (`@0.1.0`); future provider-upload feature hangs
|
|
version negotiation off this.
|
|
|
|
## The host service
|
|
|
|
### Startup
|
|
|
|
1. Load config from env (`WHOAREYOU_` prefix): listen addr, components dir,
|
|
cache TTL, fetch timeout.
|
|
2. Scan `components/*.wasm`, compile each with wasmtime, call `metadata()`
|
|
once; **fail fast** on components that don't satisfy the WIT world. Log
|
|
the loaded provider set.
|
|
3. Serve HTTP.
|
|
|
|
### API
|
|
|
|
```
|
|
GET /api/v1/number/{number}
|
|
GET /healthz
|
|
```
|
|
|
|
Response shape:
|
|
|
|
```json
|
|
{
|
|
"number": "0700000000",
|
|
"results": {
|
|
"hitta.se": {
|
|
"status": "ok",
|
|
"entry": {
|
|
"messages": [],
|
|
"history": ["42 andra har rapporterat detta nummer"],
|
|
"comments": [
|
|
{ "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- Per-provider `status`: `"ok"` | `"no_data"` | `"fetch_failed"` |
|
|
`"parse_failed"`. One provider failing never fails the request.
|
|
- HTTP 200 whenever the lookup ran; 400 only for invalid number format.
|
|
- `/healthz` is the one deliberate addition to "just lookup" — a supervised
|
|
daemon needs it.
|
|
|
|
### Lookup flow
|
|
|
|
1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in
|
|
v1). Normalized form is the cache key.
|
|
2. Check moka cache (key `provider:number`, value: per-provider result,
|
|
TTL 24h).
|
|
3. On miss, per provider **concurrently**: `requests(number)` → host fetches
|
|
each URL via reqwest (shared client, timeout, descriptive User-Agent) →
|
|
`parse(number, responses)` → cache the result.
|
|
4. Assemble the JSON response.
|
|
|
|
### Wasmtime mechanics
|
|
|
|
- One `Engine` + one compiled `Component` per provider at startup.
|
|
- Fresh `Store` + instance per call — cheap, and no state bleeds between
|
|
lookups.
|
|
- Guest calls run in `spawn_blocking` (CPU-bound parsing, no async in guest).
|
|
- Epoch-based deadline per call (`Engine` epoch interruption, deadline a few
|
|
seconds out) so a runaway component can't hang the service — matters once
|
|
uploaded third-party components exist.
|
|
|
|
### Errors and logging
|
|
|
|
- Host-side `thiserror` enum (`FetchFailed`, `ParseFailed`, `ComponentTrap`,
|
|
…) mapped to the per-provider `status` strings.
|
|
- `tracing` structured logs; `parse_failed` logs at WARN — it means a
|
|
scraper rotted and needs attention.
|
|
|
|
## The hitta component
|
|
|
|
`crates/providers/hitta`:
|
|
|
|
- `wit-bindgen` generates the trait from `wit/provider.wit`.
|
|
- `metadata()` → `{ name: "hitta.se", version: <crate version> }`.
|
|
- `requests(number)` → `["https://www.hitta.se/vem-ringde/{number}"]`.
|
|
- `parse()` ports `src/probe/hitta.rs`: regex out `__NEXT_DATA__`,
|
|
serde-deserialize, map comments to epoch timestamps. The old `Err(())`
|
|
paths split: regex miss → `parse-failed`; JSON ok but no `phone_data` →
|
|
`no-data`.
|
|
- **First implementation task: verify the 2019 fixtures against today's
|
|
hitta.se.** The site has likely moved off `__NEXT_DATA__`-in-a-script-tag.
|
|
Fetch fresh fixtures and update the parser to match reality before
|
|
anything is declared working.
|
|
|
|
## Testing
|
|
|
|
Three layers, no live network anywhere:
|
|
|
|
1. **Component logic, native:** parse logic in plain functions; unit tests
|
|
run natively against `fixtures/hitta/*.html` with current insta
|
|
(migrating existing snapshots). No WASM in the loop — fast iteration.
|
|
2. **Component contract, in wasmtime:** one host-side integration test loads
|
|
the built `.wasm`, feeds it a fixture body as a `response`, asserts on the
|
|
returned `entry`. Proves the WIT boundary + build pipeline.
|
|
3. **HTTP layer:** axum handler tests with the provider/fetch layer behind a
|
|
small trait — API-shape tests need neither network nor WASM.
|
|
|
|
`fetch-fixture` (updated for current site URLs) remains the manual tool for
|
|
refreshing fixtures.
|
|
|
|
## Future work (explicitly out of v1)
|
|
|
|
- Container image, k8s/Pithos config, CI to registry
|
|
- Upload + enable/disable custom providers (API/UI)
|
|
- More providers (audit the four 2019 sites for survivors)
|
|
- Host-fetch import in the WIT world for multi-step providers
|
|
- Lookup history / persistent cache
|
|
- Metrics (Prometheus/OTel)
|