Files
whoareyou/docs/superpowers/specs/2026-06-05-wasm-provider-service-design.md
2026-06-05 14:24:01 +02:00

8.3 KiB

whoareyou v1 — WASM provider service

Date: 2026-06-05 Status: Approved design

Summary

whoareyou becomes a long-running, async HTTP service that looks up Swedish phone numbers by aggregating scraping providers. Providers are WASM components (WASM Component Model, WASI p2) loaded from disk at startup. The host does all network fetching; components are pure functions from number → requests and responses → parsed entries. The existing CLI is retired.

V1 scope: code only (runs via cargo run, components on disk). One provider: hitta.se. No container image, no k8s manifests, no upload/enable-disable UI — those come later.

Decisions log

Decision Choice
Form factor Long-running service + HTTP API; CLI retired
Deployment target Self-hosted on own infra (later); v1 is code-only
API surface Single lookup endpoint (+ /healthz as daemon necessity)
Provider mechanism WASM components, fixed set loaded at startup; upload/enable/disable is future work
Host/guest boundary Host fetches; components are pure (no network/fs access)
WASM plumbing Component Model: wasmtime + WIT + wit-bindgen (chosen over Extism and raw-module ABI)
Cache In-memory TTL (moka), parsed results, 24h, key provider:number
V1 providers hitta.se only
Old code CLI, TOML/YAML definitions, bincode cache, orphaned probes: deleted. Hitta parse logic + fixtures: ported

Workspace layout & toolchain

Cargo workspace, edition 2024:

whoareyou/
├── Cargo.toml              # workspace
├── wit/
│   └── provider.wit        # the provider contract (single source of truth)
├── crates/
│   ├── server/             # bin: axum HTTP service + wasmtime host
│   └── providers/
│       └── hitta/          # cdylib → wasm32-wasip2 component
├── fixtures/               # existing HTML fixtures, reused for component tests
└── justfile                # build components → build server (two-step build)
  • Host stack: tokio, axum 0.8, reqwest (current, async), moka 0.12, wasmtime 45 (component-model), thiserror, tracing.
  • Provider stack: wit-bindgen in a cdylib crate, built with plain cargo build --target wasm32-wasip2 --release. No cargo-component (stale; the tier-2 wasip2 target makes it unnecessary).
  • Deleted: src/main.rs (CLI), src/definition.rs, src/context.rs (bincode cache), definitions/, the four orphaned probe modules, _build.rs.

The WIT contract

wit/provider.wit:

package whoareyou:provider@0.1.0;

interface lookup {
    record provider-info {
        name: string,          // e.g. "hitta.se" — key in API response + cache
        version: string,
    }

    record request {
        url: string,
    }

    record response {
        status: u16,
        body: string,
    }

    record comment {
        timestamp: option<s64>,   // unix epoch seconds, UTC
        title: option<string>,
        message: string,
    }

    record entry {
        messages: list<string>,
        history: list<string>,
        comments: list<comment>,
    }

    variant lookup-error {
        no-data,                  // fetched fine, nothing on the page
        parse-failed(string),     // page structure changed — scraper rot signal
    }

    metadata: func() -> provider-info;
    requests: func(number: string) -> list<request>;
    parse: func(number: string, responses: list<response>) -> result<entry, lookup-error>;
}

world provider {
    export lookup;
}

Design points:

  • Pure exports, no imports. Components cannot reach network or filesystem. Sandboxed, trivially testable, host owns all I/O policy.
  • timestamp: option<s64> replaces the old Date enum. Components normalize site-local dates to UTC epoch seconds (date-only → 00:00:00); option because some sites omit dates. The host never parses dates.
  • lookup-error separates no-data from parse-failed — the old Result<Entry, ()> conflated "nothing there" with "scraper broke".
  • requests returns a list for single-round fan-out (e.g. two URL formats). No sequential multi-step flows in v1; if a future provider needs fetch→token→fetch, add an optional host-fetch import to the world then.
  • Package is versioned (@0.1.0); future provider-upload feature hangs version negotiation off this.

The host service

Startup

  1. Load config from env (WHOAREYOU_ prefix): listen addr, components dir, cache TTL, fetch timeout.
  2. Scan components/*.wasm, compile each with wasmtime, call metadata() once; fail fast on components that don't satisfy the WIT world. Log the loaded provider set.
  3. Serve HTTP.

API

GET /api/v1/number/{number}
GET /healthz

Response shape:

{
  "number": "0700000000",
  "results": {
    "hitta.se": {
      "status": "ok",
      "entry": {
        "messages": [],
        "history": ["42 andra har rapporterat detta nummer"],
        "comments": [
          { "timestamp": 1547746162, "title": null, "message": "Varmsälj från Folksam" }
        ]
      }
    }
  }
}
  • Per-provider status: "ok" | "no_data" | "fetch_failed" | "parse_failed". One provider failing never fails the request.
  • HTTP 200 whenever the lookup ran; 400 only for invalid number format.
  • /healthz is the one deliberate addition to "just lookup" — a supervised daemon needs it.

Lookup flow

  1. Normalize the number (strip spaces/dashes; minimal — no full E.164 in v1). Normalized form is the cache key.
  2. Check moka cache (key provider:number, value: per-provider result, TTL 24h).
  3. On miss, per provider concurrently: requests(number) → host fetches each URL via reqwest (shared client, timeout, descriptive User-Agent) → parse(number, responses) → cache the result.
  4. Assemble the JSON response.

Wasmtime mechanics

  • One Engine + one compiled Component per provider at startup.
  • Fresh Store + instance per call — cheap, and no state bleeds between lookups.
  • Guest calls run in spawn_blocking (CPU-bound parsing, no async in guest).
  • Epoch-based deadline per call (Engine epoch interruption, deadline a few seconds out) so a runaway component can't hang the service — matters once uploaded third-party components exist.

Errors and logging

  • Host-side thiserror enum (FetchFailed, ParseFailed, ComponentTrap, …) mapped to the per-provider status strings.
  • tracing structured logs; parse_failed logs at WARN — it means a scraper rotted and needs attention.

The hitta component

crates/providers/hitta:

  • wit-bindgen generates the trait from wit/provider.wit.
  • metadata(){ name: "hitta.se", version: <crate version> }.
  • requests(number)["https://www.hitta.se/vem-ringde/{number}"].
  • parse() ports src/probe/hitta.rs: regex out __NEXT_DATA__, serde-deserialize, map comments to epoch timestamps. The old Err(()) paths split: regex miss → parse-failed; JSON ok but no phone_datano-data.
  • First implementation task: verify the 2019 fixtures against today's hitta.se. The site has likely moved off __NEXT_DATA__-in-a-script-tag. Fetch fresh fixtures and update the parser to match reality before anything is declared working.

Testing

Three layers, no live network anywhere:

  1. Component logic, native: parse logic in plain functions; unit tests run natively against fixtures/hitta/*.html with current insta (migrating existing snapshots). No WASM in the loop — fast iteration.
  2. Component contract, in wasmtime: one host-side integration test loads the built .wasm, feeds it a fixture body as a response, asserts on the returned entry. Proves the WIT boundary + build pipeline.
  3. HTTP layer: axum handler tests with the provider/fetch layer behind a small trait — API-shape tests need neither network nor WASM.

fetch-fixture (updated for current site URLs) remains the manual tool for refreshing fixtures.

Future work (explicitly out of v1)

  • Container image, k8s/Pithos config, CI to registry
  • Upload + enable/disable custom providers (API/UI)
  • More providers (audit the four 2019 sites for survivors)
  • Host-fetch import in the WIT world for multi-step providers
  • Lookup history / persistent cache
  • Metrics (Prometheus/OTel)