diff --git a/docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md b/docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md new file mode 100644 index 0000000..3f4f00b --- /dev/null +++ b/docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md @@ -0,0 +1,619 @@ +# TrueSkill-TT Engine Redesign — Design + +**Date:** 2026-04-23 +**Status:** Approved (pending implementation plan) + +## Summary + +Comprehensive redesign of the TrueSkill-TT engine targeting four orthogonal goals: + +1. **Performance** — substantially faster offline convergence and incremental online updates. +2. **Accuracy and richer match formats** — support for score margins, free-for-all with partial orders, correlated skills. +3. **Better convergence** — replace ad-hoc capped iteration with a pluggable `Schedule` trait covering all three nested loops. +4. **Better API surface** — typed event description, observer-based progress reporting, generic time axis, structured errors, ergonomic builders. + +The design is comprehensive (Approach 1 of three considered) but delivered in five tiers so each step is independently shippable and validated by benchmarks. + +## Goals & non-goals + +**Goals** + +- 10–30× speedup on the offline convergence path for representative workloads (1000+ players, 1000+ events, 30 iterations) +- Order-of-magnitude speedup on incremental "add a single event" workloads +- Pluggable factor graph allowing new factor types without engine changes +- Optional Rayon-backed parallelism on top of `Send + Sync`-correct internals +- Typed, ergonomic public API; replace nested `Vec>>` shapes with `Event` / `Team` / `Member` +- Generic time axis: `Untimed`, `i64`, or user-supplied +- Observer-based progress instead of `verbose: bool` + `println!` +- Structured `Result<_, InferenceError>` at API boundaries + +**Non-goals** + +- WebAssembly support is not a goal; we may break it if a crate or feature requires. +- No GPU offload. +- No `no_std` support. +- No persistent format / serde — possible future feature. +- No replacement of the Gaussian/EP approximation itself in this design (the underlying inference math stays the same; we change layout, dispatch, scheduling, and API around it). + +## Workload assumptions + +Baseline workload that drives perf decisions: + +- ~1000+ players +- ~1000+ events total +- ~50–60 events per time slice (per day) +- Both online (incremental adds) and offline (full convergence) are common +- Offline convergence runs frequently + +## Section 1 — Core types & traits + +The foundation everything else builds on. + +### `Gaussian` — natural-parameter storage + +Switch storage from `(mu, sigma)` to natural parameters `(pi, tau)` where `pi = sigma⁻²`, `tau = mu · pi`. Multiplication and division dominate the hot path; in nat-params they are direct adds/subs of the components, no `sqrt`. Reads of `mu`/`sigma` become accessor methods (`tau / pi`, `1.0 / pi.sqrt()`). The trade is correct because reads are vanishingly rare compared to writes in EP. + +```rust +pub struct Gaussian { pi: f64, tau: f64 } +pub const UNIFORM: Gaussian = Gaussian { pi: 0.0, tau: 0.0 }; // replaces N_INF +``` + +### `Time` trait + +Replaces the bare `i64` time field. Keeps `History` parametric. + +```rust +pub trait Time: Copy + Ord + Send + Sync + 'static { + fn elapsed_to(&self, later: &Self) -> i64; +} +pub struct Untimed; // ZST for the no-time-axis case +impl Time for Untimed { fn elapsed_to(&self, _: &Self) -> i64 { 0 } } +impl Time for i64 { fn elapsed_to(&self, later: &Self) -> i64 { later - self } } +// Optional impls behind feature flags: time::OffsetDateTime, chrono types +``` + +### `Drift` trait + +Generic over `T: Time` so seasonal/calendar-aware drift is possible without going through `i64`. + +```rust +pub trait Drift: Copy + Send + Sync { + fn variance_delta(&self, from: &T, to: &T) -> f64; +} +``` + +`ConstantDrift(f64)` impl: `to.elapsed_to(from) as f64 * gamma * gamma`. + +### `Index` and `KeyTable` + +`Index(usize)` is the handle into dense per-`History` `Vec` storage. Public, but intended for use by power users on hot paths who want to skip the `KeyTable` lookup. Casual API takes `&K`. `KeyTable` (renamed from `IndexMap`, to avoid colliding with the `indexmap` crate's type) maps user keys → `Index`. + +### `Observer` trait + +Replaces `verbose: bool` + `println!`. Default no-op impls; user overrides what they need. + +```rust +pub trait Observer: Send + Sync { + fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {} + fn on_batch_processed(&self, _time: &T, _idx: usize, _n_events: usize) {} + fn on_converged(&self, _iters: usize, _final_step: (f64, f64)) {} +} +pub struct NullObserver; +impl Observer for NullObserver {} +``` + +### Trade-offs + +- `Gaussian` natural-param representation: anyone reading `mu`/`sigma` in a hot loop pays a sqrt — but that's correct, hot reads are rare. +- `Time` as a trait (not enum) keeps it open-ended at zero runtime cost; default `History` keeps the call sites familiar. +- `Observer` is a trait (not a closure) so different sites can have different signatures without losing type safety. `NullObserver` is a ZST. + +## Section 2 — Factor graph architecture + +The current `Game::likelihoods` is a hand-rolled, hard-coded graph. To unlock richer formats and let us experiment with EP schedules, the graph itself becomes a data structure. + +### Variable / Factor model + +Variables hold their current Gaussian marginal. Factors hold their outgoing messages to each connected variable plus do the local computation. Standard EP: factor's update is "divide marginal by old outgoing → cavity → apply local approximation → multiply marginal by new outgoing." + +```rust +pub trait Factor: Send + Sync { + fn variables(&self) -> &[VarId]; + fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64); // returns max delta + fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 } +} +``` + +### Built-in factor catalog + +| Factor | Purpose | Status | +|---|---|---| +| `PerformanceFactor` | skill → performance (add β² noise, optional weight) | replaces inline `performance() * weight` | +| `TeamSumFactor` | weighted sum of player perfs → team perf | replaces inline `fold` | +| `RankDiffFactor` | (team_a perf) − (team_b perf) → diff var | currently `team[e].posterior_win() − team[e+1].posterior_lose()` | +| `TruncFactor` | EP truncation: `P(diff > margin)` or `P(|diff| < margin)` for draws | wraps current `v_w` / `approx` | +| `MarginFactor` *(future)* | use observed score margin as soft evidence | enables richer match formats | +| `SynergyFactor` *(future)* | couples teammates' skills | enables different topology | +| `ScoreFactor` *(future)* | continuous outcome (e.g., points scored) | enables score-based outcomes | + +The first four together exactly reproduce today's algorithm. The last three are extension slots. + +### Game = factor graph + schedule + +```rust +pub struct Game { + vars: VarStore, // SoA: Vec marginals + factors: FactorList, // enum dispatch over BuiltinFactor (see Open Questions) + schedule: S, +} +``` + +Lean toward **enum dispatch** (`enum BuiltinFactor { Perf(...), Sum(...), RankDiff(...), Trunc(...), ... }`) over `Box` for the built-ins: + +- avoids per-message vtable overhead in the hottest loop +- keeps factor data inline (no heap indirection) +- still allows user-defined factors via a `BuiltinFactor::Custom(Box)` variant + +### Schedule trait + +Controls iteration order and stopping. Default = current behavior (sweep forward, then backward, until ε or max iters). Pluggable so we can later try damped EP or junction-tree schedules. + +### High-level constructors + +```rust +Game::ranked(teams, results, options) // dominant case +Game::free_for_all(players, ranking) // FFA with possible ties +Game::custom(builder) // power users build their own graph +``` + +`GameOptions` carries iteration cap, epsilon, p_draw, and approximation choice. Today these are scattered between method args and module constants. + +### Trade-offs + +- Enum dispatch over trait objects for built-ins; richer factors drop in via new enum variants. +- Variables and factor messages stored as `Vec` indexed by `VarId` / edge slot — flat, cache-friendly. +- `Schedule` is a generic parameter (zero-cost); most users get default; experimentation is open. + +### Open question + +Whether `enum BuiltinFactor` will feel too closed-world. The `Custom(Box)` escape hatch helps but inner-loop perf for user factors will be slower. Acceptable for now; flagged for future revisit if it becomes a problem. + +## Section 3 — Storage layout (SoA + arenas) + +### Dense Vec keyed by `Index` + +Every `HashMap` becomes a `Vec` (or `Vec>` for sparse) indexed directly by `Index.0`. The public-facing `KeyTable` continues to map arbitrary keys → `Index`. + +### SoA at hot layers, AoS at boundaries + +The `Skill` struct stays as a public type for the API (returned from `learning_curves`, etc.), but inside `TimeSlice` we lay it out column-wise: + +```rust +struct TimeSliceSkills { + forward: Vec, // [n_agents] + backward: Vec, + likelihood: Vec, + online: Vec, + elapsed: Vec, + present: Vec, +} +``` + +Within a slice, the inner loops touch one column repeatedly across many events — keeping the column contiguous improves cache utilization and makes the eventual SIMD step (Section 6) straightforward. + +`Gaussian` itself stays as a single 16-byte struct in the `Vec`. Splitting into two parallel `Vec`s wins for pure SIMD over thousands of Gaussians but loses for the random-access patterns dominant in EP. Revisit if benchmarks demand it. + +### Arena allocator inside `Game` + +Replace per-event allocations with a `ScratchArena` reused across calls. + +```rust +pub struct ScratchArena { + var_buf: Vec, + factor_buf: Vec, // edge messages + bool_buf: Vec, + f64_buf: Vec, +} +impl ScratchArena { + fn reset(&mut self); // sets len=0, keeps capacity + fn alloc_vars(&mut self, n: usize) -> &mut [Gaussian]; +} +``` + +`TimeSlice` owns one `ScratchArena`; each event borrows it for the duration of its `Game` construction and inference. For the parallel-slice story (Section 6), each Rayon task gets its own arena. + +### Per-event storage layout + +Inside a `TimeSlice`, each event is stored column-wise as well, with `Item` inlined into team-level parallel arrays: + +```rust +struct EventStorage { + teams: SmallVec<[TeamStorage; 4]>, + outcome: Outcome, + weights: SmallVec<[SmallVec<[f64; 4]>; 4]>, + evidence: f64, +} +struct TeamStorage { + competitors: SmallVec<[Index; 4]>, // who's on the team + edge_messages: SmallVec<[Gaussian; 4]>, // outgoing message per slot + output: f64, +} +``` + +Iteration over `(competitor, edge_message)` pairs zips two slices — no per-element struct. + +### SmallVec for typical shapes + +Teams ≤ ~5 players, games ≤ ~8 teams. `SmallVec<[T; 8]>` for team membership and `SmallVec<[T; 4]>` for team rosters keeps the common case allocation-free. + +### Trade-offs + +- Dense `Vec` keyed by `Index` is faster but means agent removal needs tombstones (or just leaves slots present-but-inactive). Acceptable: TrueSkill histories rarely remove players. +- SoA at `TimeSlice` level only, not at `History` level. `History` keeps `Vec` because slices are heterogeneous in size. +- One `ScratchArena` per `TimeSlice` keeps the lifetime story simple. + +### Open question + +The `TimeSliceSkills` sketch above uses (b) **dense + present mask**: one slot per agent in the history, indexed directly by `Index`, with a `present: Vec` mask for batches the agent didn't participate in. The alternative is (a) **sparse columnar**: a `Vec` of present agents and parallel `Vec` columns of length `n_present`, with a separate lookup (binary search or auxiliary table) to find a given `Index`'s slot. + +(b) gives O(1) lookup and SIMD-friendly columns but wastes memory for sparsely populated slices. (a) is leaner per-slice but pays per-lookup cost in the inner loop. Bench both during T0 and pick. Default proposal: (b), since modern systems are memory-rich and the parallelism story is cleaner. + +## Section 4 — API surface + +### Typed event description + +```rust +pub struct Event { + pub time: T, + pub teams: SmallVec<[Team; 4]>, + pub outcome: Outcome, +} + +pub struct Team { + pub members: SmallVec<[Member; 4]>, +} + +pub struct Member { + pub key: K, + pub weight: f64, // default 1.0 + pub prior: Option, // per-event override +} + +pub enum Outcome { + Ranked(SmallVec<[u32; 4]>), // rank per team; equal ranks = tie + Scored(SmallVec<[f64; 4]>), // continuous score per team (engages MarginFactor) +} +``` + +`Outcome::winner(0)`, `Outcome::draw()`, `Outcome::ranking([0,1,2])` are convenience constructors. + +### Builders + +```rust +let mut history = History::::builder() + .mu(25.0).sigma(25.0/3.0).beta(25.0/6.0) + .drift(ConstantDrift(0.03)) + .p_draw(0.10) + .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 }) + .observer(LogObserver::default()) + .build(); +``` + +For the no-time case, type inference picks `Untimed`: + +```rust +let mut history = History::::builder().build(); +``` + +### Three-tier event ingestion + +```rust +// 1. Bulk ingestion (high-throughput path) +history.add_events(events_iter)?; + +// 2. One-off match (very common in practice) +history.record_winner("alice", "bob", time)?; +history.record_draw("alice", "bob", time)?; + +// 3. Builder for irregular shapes +history.event(time) + .team(["alice", "bob"]).weights([1.0, 0.7]) + .team(["carol"]) + .ranking([1, 0]) + .commit()?; +``` + +### Convergence & queries + +```rust +let report: ConvergenceReport = history.converge()?; + +let curve: Vec<(i64, Gaussian)> = history.learning_curve(&"alice"); +let all = history.learning_curves(); // HashMap<&K, Vec<(T, Gaussian)>> +let now = history.current_skill(&"alice"); // Option + +let ev = history.log_evidence(); +let ev_for = history.log_evidence_for(&["alice", "bob"]); + +let q = history.predict_quality(&[&["alice"], &["bob"]]); +let p_win = history.predict_outcome(&[&["alice"], &["bob"]]); +``` + +### Standalone Game + +```rust +let g = Game::ranked(&[&[alice], &[bob]], Outcome::winner(0), &options); +let post = g.posteriors(); + +// Convenience +let (a, b) = Game::one_v_one(&alice, &bob, Outcome::winner(0)); +``` + +### Errors + +Replace `debug_assert!`/`panic!` at the API boundary with `Result`. + +```rust +pub enum InferenceError { + MismatchedShape { kind: &'static str, expected: usize, got: usize }, + InvalidProbability { value: f64 }, + ConvergenceFailed { last_step: (f64, f64), iterations: usize }, + NegativePrecision { pi: f64 }, +} +``` + +Hot inner loops still use `debug_assert!` for invariants the API has already enforced. + +### Trade-offs + +- Generic over user's `K`; engine works in `Index`. Public outputs use `&K`. +- `SmallVec` everywhere on the event-description path. +- Three-tier API so casual users don't drown in types and bulk users still get throughput. +- `Outcome` enum replaces the "lower number wins" `&[f64]` convention. + +### Open question + +Whether to expose `Index` directly to users via an `intern_key(&K) -> Index` method, letting hot-path callers skip the `KeyTable` lookup on every call. Recommendation: yes — public `Index` handle plus `history.lookup>(&Q) -> Option`. The casual API still takes `&K` everywhere; power users can promote to `Index` when profiling demands. + +## Section 4½ — Naming pass + +| Current | New | Rationale | +|---|---|---| +| `History` | `History` (kept) | Matches upstream; reads cleanly. | +| `Batch` | `TimeSlice` | Says what it is: every event sharing one timestamp. | +| `Player` | `Rating` | The struct holds prior/beta/drift — that's a rating configuration. Resolves the `Player`/`Agent` confusion. | +| `Agent` | `Competitor` | Holds dynamic state for someone competing in the history; fits the domain. | +| `Skill` | `Skill` (kept) | Per-time-slice skill estimate; clearer than `BatchSkill`. | +| `Item` | inlined into `TeamStorage` columns (engine) / `Member` (public) | Eliminates the per-element struct in the hot path; gives API users a clear "team member" name. | +| `Game` | `Game` (kept) | `Match` collides with Rust's `match`. | +| `Index` | `Index` (kept) | Internal handle. | +| `IndexMap` | `KeyTable` | Avoids confusion with the `indexmap` crate. | + +## Section 5 — Convergence & message scheduling + +### Three nested loops, one mechanism + +The system has three nested convergence loops: + +1. Within-game: EP sweeps over the factor graph +2. Within-time-slice: re-running games as inputs change +3. Cross-history: forward-pass then backward-pass over all slices + +All three implement `Workload`; one `Schedule` impl drives all of them. + +```rust +pub trait Schedule { + fn run(&self, workload: &mut W) -> ScheduleReport; +} + +pub trait Workload { + fn step(&mut self) -> (f64, f64); + fn snapshot_evidence(&self) -> f64 { 0.0 } +} + +pub struct ScheduleReport { + pub iterations: usize, + pub final_step: (f64, f64), + pub converged: bool, +} +``` + +### Built-in schedules + +| Schedule | Behavior | Use | +|---|---|---| +| `EpsilonOrMax { eps, max }` | Default. Sweep until `(dpi, dtau) ≤ eps` or `max` iters. | All three loops. Replicates current behavior. | +| `Damped { eps, max, alpha }` | Same, but writes `α·new + (1−α)·old`. | Stuck oscillations. | +| `Residual { eps, max }` | Priority-queue: re-update factor with largest pending delta first. | Faster convergence on uneven graphs. | +| `OneShot` | Exactly one pass, no convergence check. | Online incremental adds. | + +### Stopping in natural-param space + +Switch from `(|Δmu|, |Δsigma|) ≤ epsilon` to `(|Δpi|, |Δtau|) ≤ (eps_pi, eps_tau)`: + +- `mu` and `sigma` are on different scales; one tolerance is wrong for both +- We store in nat-params anyway — checking convergence in mu/sigma costs free sqrts +- Nat-param delta is the natural geometry of the EP fixed point + +Default `EpsilonOrMax::default()` exposes a single `epsilon` for simplicity; advanced ctor exposes both tolerances. + +### Within-game improvements + +- Replace hard-cap of 10 iterations with `GameOptions::schedule` that propagates `ScheduleReport` upward +- Fast path: graphs with no diff chain (1v1 with 1 iter sufficient) skip the loop entirely +- FFA / many-team ranks benefit from `Residual`; opt-in + +### Within-slice and cross-history improvements + +- **No more old/new HashMap snapshotting**: track deltas inline as we write under SoA +- **Per-slice dirty bits**: a `TimeSlice` whose neighbor messages haven't changed since its last full sweep doesn't need to re-run. Track `time_slice.dirty` and skip clean ones during the cross-history sweep. Big win for online-add (the locality case). + +### `ConvergenceReport` + +```rust +pub struct ConvergenceReport { + pub iterations: usize, + pub final_step: (f64, f64), + pub log_evidence: f64, + pub converged: bool, + pub per_iteration_time: SmallVec<[Duration; 32]>, + pub batches_skipped: usize, +} +``` + +`Observer` continues to receive per-iteration callbacks for live UI; `ConvergenceReport` is the post-hoc summary. + +### Trade-offs + +- One `Schedule` trait shared across loops — fewer concepts, more composable. +- Convergence checks in nat-param space — slightly different exact threshold than today; tests' epsilons re-tuned mechanically. +- Dirty-bit skipping changes iteration order vs. today; fixed point is the same, iteration counts may shift downward. +- `Residual` and `Damped` are opt-in; default behavior matches today closely. + +### Open question + +Whether `Schedule::run` should take an optional `Observer` reference. Recommendation: observation lives at a higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver). + +## Section 6 — Concurrency & parallelism + +### What's parallelizable + +| Operation | Parallelism | Strategy | +|---|---|---| +| `History::converge()` (full forward+backward) | Sequential across slices | Within each slice: color-group events in parallel via Rayon | +| `History::add_events(...)` | Sequential append, but ingestion of typed events into `EventStorage` parallelizes trivially | n/a | +| `History::learning_curves()` | Per-key parallel | `into_par_iter()` | +| `History::log_evidence_for(targets)` | Per-batch parallel, reduce sum | `par_iter().map(...).sum()` | +| `Game` inference | Sequential | n/a (too small to amortize Rayon overhead) | + +### Within-slice color-group parallelism + +When events are added to a slice, partition them into color groups where events in the same color touch no shared `Index`. Within a color, run events in parallel via Rayon. Across colors, run sequentially. Preserves asynchronous-EP semantics exactly. + +Alternative: synchronous EP with snapshot. All events read from a frozen skill snapshot, write deltas to thread-local buffers, barrier merges. Trivially parallel but weaker per-iteration convergence — needs damping. Available as a `Schedule` impl, opt-in. + +### `Send + Sync` requirements + +All public traits (`Time`, `Drift`, `Observer`, `Factor`, `Schedule`) require `Send + Sync`. `Observer` impls must be thread-safe (called from arbitrary worker threads). + +### Rayon as default-on feature + +`rayon` as default-on feature; with `default-features = false`, parallel paths fall back to sequential iterators behind `cfg(feature = "rayon")`. + +### Expected speedup ballpark + +For 1000 players, 60 events/slice × 1000 slices, 30 convergence iterations: + +| Source | Estimated speedup vs. today | +|---|---| +| `HashMap` → dense `Vec` | 2–4× | +| Natural-param `Gaussian`, no-sqrt mul/div | 1.5–2× | +| Pre-allocated `ScratchArena` | 1.2–1.5× | +| Color-group parallel events in slice (8 cores) | 2–4× | +| Dirty-bit slice skipping (online add case) | 5–50× | +| **Combined (offline converge)** | ~10–30× | +| **Combined (online add)** | ~50–500× depending on locality | + +These are pre-implementation estimates. Each tier validates with criterion. + +### Trade-offs + +- Color-group parallelism requires up-front graph coloring at ingestion. Cost: linear in events, run once per `add_events`. Cheap. +- Default = asynchronous EP (preserves current semantics). Synchronous opt-in only. +- Cross-slice sweep stays sequential; no speculative parallel sweeps. +- Rayon default-on but feature-gated. + +### Open question + +Whether to expose color-group partitioning to users. Recommendation: hidden by default, escape hatch via `add_events_with_partition(...)` for power users who already know their event independence. + +## Section 7 — Migration, testing, and delivery plan + +The crate is unreleased, so version-bump ceremony doesn't apply. Tiers are sequencing of work and milestones, not releases. + +### Tier sequence + +**T0 — Numerical parity (no API change)** + +Internal-only. Public surface unchanged. + +- Switch `Gaussian` storage to natural parameters `(pi, tau)`. `mu()`/`sigma()` become accessors. +- Replace `HashMap` with dense `Vec<_>` keyed by `Index.0` everywhere. +- Introduce `ScratchArena` inside `Batch` so `Game::new` stops allocating per-event. +- Drop the `panic!` in `mu_sigma`; return `Result` propagated upward. + +**Acceptance:** existing test suite passes (bit-equal where possible, ULP-bounded where natural-param arithmetic shifts a rounding); `cargo bench` shows ≥3× win on `batch` benchmark; no API breakage. + +**T1 — Factor graph machinery (internal-only)** + +- Introduce `Factor`, `VarStore`, `Schedule` as `pub(crate)` types. +- Re-implement `Game::likelihoods()` on top of `BuiltinFactor::{Perf, TeamSum, RankDiff, Trunc}` driven by `EpsilonOrMax`. +- Replace within-game iteration tracking with `ScheduleReport`. + +**Acceptance:** existing test suite passes (ULP-bounded); within-game iteration counts unchanged; benchmarks ≥ T0. + +**T2 — New API surface (breaking)** + +All renames and the new public API land together. No half-renamed intermediate state. + +- New types: `Rating`, `TimeSlice`, `Competitor`, `Member`, `Outcome`, `Event`, `KeyTable`. +- `Time` trait introduced; `History>` is generic. +- Three-tier API surface: `record_winner`, `event(...).team(...).commit()`, bulk `add_events(iter)`. +- `Observer` trait + `ConvergenceReport`; `verbose: bool` deleted. +- `panic!`/`debug_assert!` at API boundary become `Result<_, InferenceError>`. +- Promote `Factor`/`Schedule`/`VarStore` to `pub` under a `factors` module. + +**Acceptance:** full test suite rewritten in new API; equivalence tests prove identical posteriors vs. old API on the same inputs. + +**T3 — Concurrency** + +- `Send + Sync` audit and bounds on all public traits. +- Color-group partitioning at `TimeSlice` ingestion. +- `rayon` as default-on feature with `#[cfg(feature = "rayon")]` fallback. +- Parallel paths: within-slice color groups, `learning_curves`, `log_evidence_for`. + +**Acceptance:** deterministic posteriors across `RAYON_NUM_THREADS={1,2,4,8}`; benchmarks show >2× on 8-core for offline converge. + +**T4 — Richer factor types & schedules** + +Each shipped independently after T3. + +- `MarginFactor` → enables `Outcome::Scored`. +- `Damped` and `Residual` schedules. +- `SynergyFactor`, `ScoreFactor` → same pattern when wanted. + +Each comes with its own benchmark and a worked example in `examples/`. + +### Testing strategy + +| Layer | Approach | +|---|---| +| **Numerical correctness** | Keep existing hardcoded golden values from `test_1vs1`, `test_1vs1_draw`, `test_2vs1vs2_mixed`, etc. through T0–T1 unchanged. They are a regression net against the original Python port. | +| **API parity** | T2 adds an `equivalence` test module that runs identical inputs through old vs. new construction and compares posteriors within ULPs. | +| **Property tests** | Add `proptest` for: factor graph fixed-point invariance under message order, `Outcome` round-trip, `Gaussian` mul/div associativity in nat-params, schedule convergence regardless of starting state. | +| **Determinism** | T3 adds tests that run identical input across multiple Rayon thread counts and assert identical posteriors. | +| **Benchmark gates** | Each tier has a "must not regress" gate vs. the previous tier on the existing `batch` and `gaussian` criterion suites. T0 must beat baseline by ≥3×; T1 ≥ T0; etc. | + +### Risk management + +- **T0 risk: rounding drift in tests.** Mitigation: where natural-param arithmetic legitimately changes the last ULPs, update goldens *and* simultaneously add a parity test against a snapshot taken from baseline to prove the difference is bounded. +- **T2 risk: API design mistakes.** Mitigation: review the spec and a worked example before implementing; iterate on feedback. +- **T3 risk: subtle race conditions in color-group partitioning.** Mitigation: `loom` tests for the merge step; deterministic-output assertion across thread counts. +- **Cross-tier risk: scope creep.** Each tier has a closed checklist; new ideas go to the next tier's wishlist. + +### What we're explicitly *not* doing + +- No GPU offload. +- No `no_std` support. +- No serde / persistence in this design. +- No incremental online API beyond `record_winner` / `add_events`. + +## Open questions summary + +Collected here for the review pass: + +1. **`enum BuiltinFactor` extensibility** — may feel too closed-world; revisit if user-defined factors via `Custom(Box)` become common. +2. **Sparse vs. dense per-slice skill storage** — default to dense + `present` mask; sparse columnar is the alternative. Decided by T0 benchmarks. +3. **`Index` exposure for hot paths** — expose `intern_key`/`lookup` so power users can promote `&K` to `Index` and skip the `KeyTable` lookup; casual API still takes `&K` everywhere. +4. **`Schedule::run` and observer wiring** — observation stays at higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver). +5. **Color-group partition exposure** — hidden by default, escape hatch via `add_events_with_partition(...)`.