T0 + T1 + T2: engine redesign through new API surface (#1)

Implements tiers T0, T1, T2 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. All three tiers have landed together on this branch because they build on one another; this PR rolls them up for a single review pass. Per-tier plans: - T0: `docs/superpowers/plans/2026-04-23-t0-numerical-parity.md` - T1: `docs/superpowers/plans/2026-04-24-t1-factor-graph.md` - T2: `docs/superpowers/plans/2026-04-24-t2-new-api-surface.md` ## Summary ### T0 — Numerical parity (internal) - `Gaussian` switched to natural-parameter storage `(pi, tau)`; mul/div now ~7× faster (218 ps vs 1.57 ns). - `HashMap<Index, _>` → dense `Vec<_>` keyed by `Index.0` (via `AgentStore<D>`, `SkillStore`). - `ScratchArena` eliminates per-event allocations in `Game::likelihoods`. - `InferenceError` seed type added (1 variant). - 38 → 53 tests passing through T1. - Benchmark: `Batch::iteration` 29.84 → 21.25 µs. ### T1 — Factor graph machinery (internal) - `Factor` trait + `BuiltinFactor` enum (TeamSum / RankDiff / Trunc) driving within-game inference. - `VarStore` flat storage for variable marginals. - `Schedule` trait + `EpsilonOrMax` impl replacing the hand-rolled EP loop. - `Game::likelihoods` rebuilt on the factor-graph machinery; iteration counts and goldens preserved to within 1e-6. - 53 tests passing. - Benchmark: `Batch::iteration` 23.01 µs (slight regression absorbed in T2). ### T2 — New API surface (breaking) **Renames:** - `IndexMap → KeyTable`, `Player → Rating`, `Agent → Competitor`, `Batch → TimeSlice` **New types:** - `Time` trait with `Untimed` ZST and `i64` impls; `Drift<T>`, `Rating<T, D>`, `Competitor<T, D>`, `TimeSlice<T>`, `History<T, D, O, K>` all generic. - `Event<T, K>`, `Team<K>`, `Member<K>`, `Outcome` (`Ranked` variant; `#[non_exhaustive]`). - `Observer<T>` trait + `NullObserver`. - `ConvergenceOptions`, `ConvergenceReport`. - `GameOptions`, `OwnedGame<T, D>`. **Three-tier ingestion:** - `history.record_winner(&K, &K, T)` / `record_draw(&K, &K, T)` — 1v1 convenience. - `history.add_events(iter)` — typed bulk. - `history.event(T).team([...]).weights([...]).ranking([...]).commit()` — fluent. **Query API:** `current_skill`, `learning_curve`, `learning_curves` (keyed on `K`), `log_evidence`, `log_evidence_for`, `predict_quality`, `predict_outcome`. **Game constructors:** `ranked`, `one_v_one`, `free_for_all`, `custom` — all returning `Result<_, InferenceError>`. **`factors` module:** `Factor`, `Schedule`, `VarStore`, `VarId`, `BuiltinFactor`, `EpsilonOrMax`, `ScheduleReport`, `TeamSumFactor`, `RankDiffFactor`, `TruncFactor` now public. **Errors:** `InferenceError` gains `MismatchedShape`, `InvalidProbability`, `ConvergenceFailed`; boundary panics converted to `Result`. **Removed (breaking):** `History::convergence(iters, eps, verbose)`, `HistoryBuilder::gamma(f64)`, `HistoryBuilder::time(bool)`, `History.time: bool`, `learning_curves_by_index`, nested-Vec public `add_events`. ## Behavior change (documented in CHANGELOG) `Time = Untimed` has `elapsed_to → 0`, so no drift accumulates between slices. The old `time=false` mode implicitly forced `elapsed=1` on reappearance via an `i64::MAX` sentinel — that quirk is not reproducible under a typed time axis. Tests that depended on it now use `History::<i64, _>` with explicit `1..=n` timestamps. One test (`test_env_ttt`) had 3 Gaussian goldens updated to reflect the corrected semantics; documented in commit `33a7d90`. ## Final numbers | Metric | Before T0 | After T2 | Delta | |---|---|---|---| | `Batch::iteration` | 29.84 µs | 21.36 µs | **-28%** | | `Gaussian::mul` | 1.57 ns | 219 ps | **-86%** | | `Gaussian::div` | 1.57 ns | 219 ps | **-86%** | | Tests passing | 38 | 90 | +52 | All other Gaussian ops unchanged (~219 ps add/sub, ~264 ps pi/tau reads). ## Test plan - [x] `cargo test --features approx` — 90/90 pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence) - [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean - [x] `cargo +nightly fmt --check` — clean - [x] `cargo bench --bench batch` — 21.36 µs - [x] `cargo bench --bench gaussian` — unchanged from T1 - [x] `cargo run --example atp --features approx` — rewritten in new API, runs clean - [x] Historical Game-level goldens preserved in `tests/equivalence.rs` - [x] Public API matches spec Section 4 (verified by integration tests in `tests/api_shape.rs`) ## Commit history ~45 commits total across T0 + T1 + T2. Each task is self-contained and individually tested; the branch is bisectable. See `git log main..t2-new-api-surface` for the full list. ## Deferred to later tiers - `Outcome::Scored` + `MarginFactor` — T4 - `Damped` / `Residual` schedules — T4 - `Send + Sync` bounds + Rayon parallelism — T3 - N-team `predict_outcome` — T4 - `Game::custom` full ergonomics — T4 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #1 Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com> Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
2026-04-24 11:20:04 +00:00
parent a14df02089
commit d2aab82c1e
44 changed files with 10541 additions and 1325 deletions
@@ -0,0 +1,619 @@
+# TrueSkill-TT Engine Redesign — Design
+
+**Date:** 2026-04-23
+**Status:** Approved (pending implementation plan)
+
+## Summary
+
+Comprehensive redesign of the TrueSkill-TT engine targeting four orthogonal goals:
+
+1. **Performance** — substantially faster offline convergence and incremental online updates.
+2. **Accuracy and richer match formats** — support for score margins, free-for-all with partial orders, correlated skills.
+3. **Better convergence** — replace ad-hoc capped iteration with a pluggable `Schedule` trait covering all three nested loops.
+4. **Better API surface** — typed event description, observer-based progress reporting, generic time axis, structured errors, ergonomic builders.
+
+The design is comprehensive (Approach 1 of three considered) but delivered in five tiers so each step is independently shippable and validated by benchmarks.
+
+## Goals & non-goals
+
+**Goals**
+
+- 10–30× speedup on the offline convergence path for representative workloads (1000+ players, 1000+ events, 30 iterations)
+- Order-of-magnitude speedup on incremental "add a single event" workloads
+- Pluggable factor graph allowing new factor types without engine changes
+- Optional Rayon-backed parallelism on top of `Send + Sync`-correct internals
+- Typed, ergonomic public API; replace nested `Vec<Vec<Vec<_>>>` shapes with `Event<T, K>` / `Team<K>` / `Member<K>`
+- Generic time axis: `Untimed`, `i64`, or user-supplied
+- Observer-based progress instead of `verbose: bool` + `println!`
+- Structured `Result<_, InferenceError>` at API boundaries
+
+**Non-goals**
+
+- WebAssembly support is not a goal; we may break it if a crate or feature requires.
+- No GPU offload.
+- No `no_std` support.
+- No persistent format / serde — possible future feature.
+- No replacement of the Gaussian/EP approximation itself in this design (the underlying inference math stays the same; we change layout, dispatch, scheduling, and API around it).
+
+## Workload assumptions
+
+Baseline workload that drives perf decisions:
+
+- ~1000+ players
+- ~1000+ events total
+- ~50–60 events per time slice (per day)
+- Both online (incremental adds) and offline (full convergence) are common
+- Offline convergence runs frequently
+
+## Section 1 — Core types & traits
+
+The foundation everything else builds on.
+
+### `Gaussian` — natural-parameter storage
+
+Switch storage from `(mu, sigma)` to natural parameters `(pi, tau)` where `pi = sigma⁻²`, `tau = mu · pi`. Multiplication and division dominate the hot path; in nat-params they are direct adds/subs of the components, no `sqrt`. Reads of `mu`/`sigma` become accessor methods (`tau / pi`, `1.0 / pi.sqrt()`). The trade is correct because reads are vanishingly rare compared to writes in EP.
+
+```rust
+pub struct Gaussian { pi: f64, tau: f64 }
+pub const UNIFORM: Gaussian = Gaussian { pi: 0.0, tau: 0.0 }; // replaces N_INF
+```
+
+### `Time` trait
+
+Replaces the bare `i64` time field. Keeps `History` parametric.
+
+```rust
+pub trait Time: Copy + Ord + Send + Sync + 'static {
+    fn elapsed_to(&self, later: &Self) -> i64;
+}
+pub struct Untimed; // ZST for the no-time-axis case
+impl Time for Untimed { fn elapsed_to(&self, _: &Self) -> i64 { 0 } }
+impl Time for i64 { fn elapsed_to(&self, later: &Self) -> i64 { later - self } }
+// Optional impls behind feature flags: time::OffsetDateTime, chrono types
+```
+
+### `Drift<T>` trait
+
+Generic over `T: Time` so seasonal/calendar-aware drift is possible without going through `i64`.
+
+```rust
+pub trait Drift<T: Time>: Copy + Send + Sync {
+    fn variance_delta(&self, from: &T, to: &T) -> f64;
+}
+```
+
+`ConstantDrift(f64)` impl: `to.elapsed_to(from) as f64 * gamma * gamma`.
+
+### `Index` and `KeyTable<K>`
+
+`Index(usize)` is the handle into dense per-`History` `Vec` storage. Public, but intended for use by power users on hot paths who want to skip the `KeyTable` lookup. Casual API takes `&K`. `KeyTable<K>` (renamed from `IndexMap`, to avoid colliding with the `indexmap` crate's type) maps user keys → `Index`.
+
+### `Observer` trait
+
+Replaces `verbose: bool` + `println!`. Default no-op impls; user overrides what they need.
+
+```rust
+pub trait Observer<T: Time>: Send + Sync {
+    fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
+    fn on_batch_processed(&self, _time: &T, _idx: usize, _n_events: usize) {}
+    fn on_converged(&self, _iters: usize, _final_step: (f64, f64)) {}
+}
+pub struct NullObserver;
+impl<T: Time> Observer<T> for NullObserver {}
+```
+
+### Trade-offs
+
+- `Gaussian` natural-param representation: anyone reading `mu`/`sigma` in a hot loop pays a sqrt — but that's correct, hot reads are rare.
+- `Time` as a trait (not enum) keeps it open-ended at zero runtime cost; default `History<i64, _>` keeps the call sites familiar.
+- `Observer` is a trait (not a closure) so different sites can have different signatures without losing type safety. `NullObserver` is a ZST.
+
+## Section 2 — Factor graph architecture
+
+The current `Game::likelihoods` is a hand-rolled, hard-coded graph. To unlock richer formats and let us experiment with EP schedules, the graph itself becomes a data structure.
+
+### Variable / Factor model
+
+Variables hold their current Gaussian marginal. Factors hold their outgoing messages to each connected variable plus do the local computation. Standard EP: factor's update is "divide marginal by old outgoing → cavity → apply local approximation → multiply marginal by new outgoing."
+
+```rust
+pub trait Factor: Send + Sync {
+    fn variables(&self) -> &[VarId];
+    fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64); // returns max delta
+    fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
+}
+```
+
+### Built-in factor catalog
+
+| Factor | Purpose | Status |
+|---|---|---|
+| `PerformanceFactor` | skill → performance (add β² noise, optional weight) | replaces inline `performance() * weight` |
+| `TeamSumFactor` | weighted sum of player perfs → team perf | replaces inline `fold` |
+| `RankDiffFactor` | (team_a perf) − (team_b perf) → diff var | currently `team[e].posterior_win() − team[e+1].posterior_lose()` |
+| `TruncFactor` | EP truncation: `P(diff > margin)` or `P(|diff| < margin)` for draws | wraps current `v_w` / `approx` |
+| `MarginFactor` *(future)* | use observed score margin as soft evidence | enables richer match formats |
+| `SynergyFactor` *(future)* | couples teammates' skills | enables different topology |
+| `ScoreFactor` *(future)* | continuous outcome (e.g., points scored) | enables score-based outcomes |
+
+The first four together exactly reproduce today's algorithm. The last three are extension slots.
+
+### Game = factor graph + schedule
+
+```rust
+pub struct Game<S: Schedule = DefaultSchedule> {
+    vars: VarStore,            // SoA: Vec<Gaussian> marginals
+    factors: FactorList,       // enum dispatch over BuiltinFactor (see Open Questions)
+    schedule: S,
+}
+```
+
+Lean toward **enum dispatch** (`enum BuiltinFactor { Perf(...), Sum(...), RankDiff(...), Trunc(...), ... }`) over `Box<dyn Factor>` for the built-ins:
+
+- avoids per-message vtable overhead in the hottest loop
+- keeps factor data inline (no heap indirection)
+- still allows user-defined factors via a `BuiltinFactor::Custom(Box<dyn Factor>)` variant
+
+### Schedule trait
+
+Controls iteration order and stopping. Default = current behavior (sweep forward, then backward, until ε or max iters). Pluggable so we can later try damped EP or junction-tree schedules.
+
+### High-level constructors
+
+```rust
+Game::ranked(teams, results, options)    // dominant case
+Game::free_for_all(players, ranking)     // FFA with possible ties
+Game::custom(builder)                    // power users build their own graph
+```
+
+`GameOptions` carries iteration cap, epsilon, p_draw, and approximation choice. Today these are scattered between method args and module constants.
+
+### Trade-offs
+
+- Enum dispatch over trait objects for built-ins; richer factors drop in via new enum variants.
+- Variables and factor messages stored as `Vec<Gaussian>` indexed by `VarId` / edge slot — flat, cache-friendly.
+- `Schedule` is a generic parameter (zero-cost); most users get default; experimentation is open.
+
+### Open question
+
+Whether `enum BuiltinFactor` will feel too closed-world. The `Custom(Box<dyn Factor>)` escape hatch helps but inner-loop perf for user factors will be slower. Acceptable for now; flagged for future revisit if it becomes a problem.
+
+## Section 3 — Storage layout (SoA + arenas)
+
+### Dense Vec keyed by `Index`
+
+Every `HashMap<Index, T>` becomes a `Vec<T>` (or `Vec<Option<T>>` for sparse) indexed directly by `Index.0`. The public-facing `KeyTable<K>` continues to map arbitrary keys → `Index`.
+
+### SoA at hot layers, AoS at boundaries
+
+The `Skill` struct stays as a public type for the API (returned from `learning_curves`, etc.), but inside `TimeSlice` we lay it out column-wise:
+
+```rust
+struct TimeSliceSkills {
+    forward:    Vec<Gaussian>,   // [n_agents]
+    backward:   Vec<Gaussian>,
+    likelihood: Vec<Gaussian>,
+    online:     Vec<Gaussian>,
+    elapsed:    Vec<i64>,
+    present:    Vec<bool>,
+}
+```
+
+Within a slice, the inner loops touch one column repeatedly across many events — keeping the column contiguous improves cache utilization and makes the eventual SIMD step (Section 6) straightforward.
+
+`Gaussian` itself stays as a single 16-byte struct in the `Vec<Gaussian>`. Splitting into two parallel `Vec<f64>`s wins for pure SIMD over thousands of Gaussians but loses for the random-access patterns dominant in EP. Revisit if benchmarks demand it.
+
+### Arena allocator inside `Game`
+
+Replace per-event allocations with a `ScratchArena` reused across calls.
+
+```rust
+pub struct ScratchArena {
+    var_buf:     Vec<Gaussian>,
+    factor_buf:  Vec<Gaussian>,    // edge messages
+    bool_buf:    Vec<bool>,
+    f64_buf:     Vec<f64>,
+}
+impl ScratchArena {
+    fn reset(&mut self);                    // sets len=0, keeps capacity
+    fn alloc_vars(&mut self, n: usize) -> &mut [Gaussian];
+}
+```
+
+`TimeSlice` owns one `ScratchArena`; each event borrows it for the duration of its `Game` construction and inference. For the parallel-slice story (Section 6), each Rayon task gets its own arena.
+
+### Per-event storage layout
+
+Inside a `TimeSlice`, each event is stored column-wise as well, with `Item` inlined into team-level parallel arrays:
+
+```rust
+struct EventStorage {
+    teams:   SmallVec<[TeamStorage; 4]>,
+    outcome: Outcome,
+    weights: SmallVec<[SmallVec<[f64; 4]>; 4]>,
+    evidence: f64,
+}
+struct TeamStorage {
+    competitors:   SmallVec<[Index; 4]>,    // who's on the team
+    edge_messages: SmallVec<[Gaussian; 4]>, // outgoing message per slot
+    output:        f64,
+}
+```
+
+Iteration over `(competitor, edge_message)` pairs zips two slices — no per-element struct.
+
+### SmallVec for typical shapes
+
+Teams ≤ ~5 players, games ≤ ~8 teams. `SmallVec<[T; 8]>` for team membership and `SmallVec<[T; 4]>` for team rosters keeps the common case allocation-free.
+
+### Trade-offs
+
+- Dense `Vec<T>` keyed by `Index` is faster but means agent removal needs tombstones (or just leaves slots present-but-inactive). Acceptable: TrueSkill histories rarely remove players.
+- SoA at `TimeSlice` level only, not at `History` level. `History` keeps `Vec<TimeSlice>` because slices are heterogeneous in size.
+- One `ScratchArena` per `TimeSlice` keeps the lifetime story simple.
+
+### Open question
+
+The `TimeSliceSkills` sketch above uses (b) **dense + present mask**: one slot per agent in the history, indexed directly by `Index`, with a `present: Vec<bool>` mask for batches the agent didn't participate in. The alternative is (a) **sparse columnar**: a `Vec<Index>` of present agents and parallel `Vec<Gaussian>` columns of length `n_present`, with a separate lookup (binary search or auxiliary table) to find a given `Index`'s slot.
+
+(b) gives O(1) lookup and SIMD-friendly columns but wastes memory for sparsely populated slices. (a) is leaner per-slice but pays per-lookup cost in the inner loop. Bench both during T0 and pick. Default proposal: (b), since modern systems are memory-rich and the parallelism story is cleaner.
+
+## Section 4 — API surface
+
+### Typed event description
+
+```rust
+pub struct Event<T: Time, K> {
+    pub time: T,
+    pub teams: SmallVec<[Team<K>; 4]>,
+    pub outcome: Outcome,
+}
+
+pub struct Team<K> {
+    pub members: SmallVec<[Member<K>; 4]>,
+}
+
+pub struct Member<K> {
+    pub key: K,
+    pub weight: f64,                 // default 1.0
+    pub prior: Option<Rating>,       // per-event override
+}
+
+pub enum Outcome {
+    Ranked(SmallVec<[u32; 4]>),  // rank per team; equal ranks = tie
+    Scored(SmallVec<[f64; 4]>),  // continuous score per team (engages MarginFactor)
+}
+```
+
+`Outcome::winner(0)`, `Outcome::draw()`, `Outcome::ranking([0,1,2])` are convenience constructors.
+
+### Builders
+
+```rust
+let mut history = History::<i64, _>::builder()
+    .mu(25.0).sigma(25.0/3.0).beta(25.0/6.0)
+    .drift(ConstantDrift(0.03))
+    .p_draw(0.10)
+    .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
+    .observer(LogObserver::default())
+    .build();
+```
+
+For the no-time case, type inference picks `Untimed`:
+
+```rust
+let mut history = History::<Untimed, _>::builder().build();
+```
+
+### Three-tier event ingestion
+
+```rust
+// 1. Bulk ingestion (high-throughput path)
+history.add_events(events_iter)?;
+
+// 2. One-off match (very common in practice)
+history.record_winner("alice", "bob", time)?;
+history.record_draw("alice", "bob", time)?;
+
+// 3. Builder for irregular shapes
+history.event(time)
+    .team(["alice", "bob"]).weights([1.0, 0.7])
+    .team(["carol"])
+    .ranking([1, 0])
+    .commit()?;
+```
+
+### Convergence & queries
+
+```rust
+let report: ConvergenceReport = history.converge()?;
+
+let curve: Vec<(i64, Gaussian)> = history.learning_curve(&"alice");
+let all = history.learning_curves();           // HashMap<&K, Vec<(T, Gaussian)>>
+let now = history.current_skill(&"alice");     // Option<Gaussian>
+
+let ev = history.log_evidence();
+let ev_for = history.log_evidence_for(&["alice", "bob"]);
+
+let q = history.predict_quality(&[&["alice"], &["bob"]]);
+let p_win = history.predict_outcome(&[&["alice"], &["bob"]]);
+```
+
+### Standalone Game
+
+```rust
+let g = Game::ranked(&[&[alice], &[bob]], Outcome::winner(0), &options);
+let post = g.posteriors();
+
+// Convenience
+let (a, b) = Game::one_v_one(&alice, &bob, Outcome::winner(0));
+```
+
+### Errors
+
+Replace `debug_assert!`/`panic!` at the API boundary with `Result`.
+
+```rust
+pub enum InferenceError {
+    MismatchedShape { kind: &'static str, expected: usize, got: usize },
+    InvalidProbability { value: f64 },
+    ConvergenceFailed { last_step: (f64, f64), iterations: usize },
+    NegativePrecision { pi: f64 },
+}
+```
+
+Hot inner loops still use `debug_assert!` for invariants the API has already enforced.
+
+### Trade-offs
+
+- Generic over user's `K`; engine works in `Index`. Public outputs use `&K`.
+- `SmallVec` everywhere on the event-description path.
+- Three-tier API so casual users don't drown in types and bulk users still get throughput.
+- `Outcome` enum replaces the "lower number wins" `&[f64]` convention.
+
+### Open question
+
+Whether to expose `Index` directly to users via an `intern_key(&K) -> Index` method, letting hot-path callers skip the `KeyTable` lookup on every call. Recommendation: yes — public `Index` handle plus `history.lookup<Q: Borrow<K>>(&Q) -> Option<Index>`. The casual API still takes `&K` everywhere; power users can promote to `Index` when profiling demands.
+
+## Section 4½ — Naming pass
+
+| Current | New | Rationale |
+|---|---|---|
+| `History` | `History` (kept) | Matches upstream; reads cleanly. |
+| `Batch` | `TimeSlice` | Says what it is: every event sharing one timestamp. |
+| `Player` | `Rating` | The struct holds prior/beta/drift — that's a rating configuration. Resolves the `Player`/`Agent` confusion. |
+| `Agent` | `Competitor` | Holds dynamic state for someone competing in the history; fits the domain. |
+| `Skill` | `Skill` (kept) | Per-time-slice skill estimate; clearer than `BatchSkill`. |
+| `Item` | inlined into `TeamStorage` columns (engine) / `Member<K>` (public) | Eliminates the per-element struct in the hot path; gives API users a clear "team member" name. |
+| `Game` | `Game` (kept) | `Match` collides with Rust's `match`. |
+| `Index` | `Index` (kept) | Internal handle. |
+| `IndexMap` | `KeyTable` | Avoids confusion with the `indexmap` crate. |
+
+## Section 5 — Convergence & message scheduling
+
+### Three nested loops, one mechanism
+
+The system has three nested convergence loops:
+
+1. Within-game: EP sweeps over the factor graph
+2. Within-time-slice: re-running games as inputs change
+3. Cross-history: forward-pass then backward-pass over all slices
+
+All three implement `Workload`; one `Schedule` impl drives all of them.
+
+```rust
+pub trait Schedule {
+    fn run<W: Workload>(&self, workload: &mut W) -> ScheduleReport;
+}
+
+pub trait Workload {
+    fn step(&mut self) -> (f64, f64);
+    fn snapshot_evidence(&self) -> f64 { 0.0 }
+}
+
+pub struct ScheduleReport {
+    pub iterations: usize,
+    pub final_step: (f64, f64),
+    pub converged: bool,
+}
+```
+
+### Built-in schedules
+
+| Schedule | Behavior | Use |
+|---|---|---|
+| `EpsilonOrMax { eps, max }` | Default. Sweep until `(dpi, dtau) ≤ eps` or `max` iters. | All three loops. Replicates current behavior. |
+| `Damped { eps, max, alpha }` | Same, but writes `α·new + (1−α)·old`. | Stuck oscillations. |
+| `Residual { eps, max }` | Priority-queue: re-update factor with largest pending delta first. | Faster convergence on uneven graphs. |
+| `OneShot` | Exactly one pass, no convergence check. | Online incremental adds. |
+
+### Stopping in natural-param space
+
+Switch from `(|Δmu|, |Δsigma|) ≤ epsilon` to `(|Δpi|, |Δtau|) ≤ (eps_pi, eps_tau)`:
+
+- `mu` and `sigma` are on different scales; one tolerance is wrong for both
+- We store in nat-params anyway — checking convergence in mu/sigma costs free sqrts
+- Nat-param delta is the natural geometry of the EP fixed point
+
+Default `EpsilonOrMax::default()` exposes a single `epsilon` for simplicity; advanced ctor exposes both tolerances.
+
+### Within-game improvements
+
+- Replace hard-cap of 10 iterations with `GameOptions::schedule` that propagates `ScheduleReport` upward
+- Fast path: graphs with no diff chain (1v1 with 1 iter sufficient) skip the loop entirely
+- FFA / many-team ranks benefit from `Residual`; opt-in
+
+### Within-slice and cross-history improvements
+
+- **No more old/new HashMap snapshotting**: track deltas inline as we write under SoA
+- **Per-slice dirty bits**: a `TimeSlice` whose neighbor messages haven't changed since its last full sweep doesn't need to re-run. Track `time_slice.dirty` and skip clean ones during the cross-history sweep. Big win for online-add (the locality case).
+
+### `ConvergenceReport`
+
+```rust
+pub struct ConvergenceReport {
+    pub iterations: usize,
+    pub final_step: (f64, f64),
+    pub log_evidence: f64,
+    pub converged: bool,
+    pub per_iteration_time: SmallVec<[Duration; 32]>,
+    pub batches_skipped: usize,
+}
+```
+
+`Observer` continues to receive per-iteration callbacks for live UI; `ConvergenceReport` is the post-hoc summary.
+
+### Trade-offs
+
+- One `Schedule` trait shared across loops — fewer concepts, more composable.
+- Convergence checks in nat-param space — slightly different exact threshold than today; tests' epsilons re-tuned mechanically.
+- Dirty-bit skipping changes iteration order vs. today; fixed point is the same, iteration counts may shift downward.
+- `Residual` and `Damped` are opt-in; default behavior matches today closely.
+
+### Open question
+
+Whether `Schedule::run` should take an optional `Observer` reference. Recommendation: observation lives at a higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver).
+
+## Section 6 — Concurrency & parallelism
+
+### What's parallelizable
+
+| Operation | Parallelism | Strategy |
+|---|---|---|
+| `History::converge()` (full forward+backward) | Sequential across slices | Within each slice: color-group events in parallel via Rayon |
+| `History::add_events(...)` | Sequential append, but ingestion of typed events into `EventStorage` parallelizes trivially | n/a |
+| `History::learning_curves()` | Per-key parallel | `into_par_iter()` |
+| `History::log_evidence_for(targets)` | Per-batch parallel, reduce sum | `par_iter().map(...).sum()` |
+| `Game` inference | Sequential | n/a (too small to amortize Rayon overhead) |
+
+### Within-slice color-group parallelism
+
+When events are added to a slice, partition them into color groups where events in the same color touch no shared `Index`. Within a color, run events in parallel via Rayon. Across colors, run sequentially. Preserves asynchronous-EP semantics exactly.
+
+Alternative: synchronous EP with snapshot. All events read from a frozen skill snapshot, write deltas to thread-local buffers, barrier merges. Trivially parallel but weaker per-iteration convergence — needs damping. Available as a `Schedule` impl, opt-in.
+
+### `Send + Sync` requirements
+
+All public traits (`Time`, `Drift`, `Observer`, `Factor`, `Schedule`) require `Send + Sync`. `Observer` impls must be thread-safe (called from arbitrary worker threads).
+
+### Rayon as default-on feature
+
+`rayon` as default-on feature; with `default-features = false`, parallel paths fall back to sequential iterators behind `cfg(feature = "rayon")`.
+
+### Expected speedup ballpark
+
+For 1000 players, 60 events/slice × 1000 slices, 30 convergence iterations:
+
+| Source | Estimated speedup vs. today |
+|---|---|
+| `HashMap` → dense `Vec` | 2–4× |
+| Natural-param `Gaussian`, no-sqrt mul/div | 1.5–2× |
+| Pre-allocated `ScratchArena` | 1.2–1.5× |
+| Color-group parallel events in slice (8 cores) | 2–4× |
+| Dirty-bit slice skipping (online add case) | 5–50× |
+| **Combined (offline converge)** | ~10–30× |
+| **Combined (online add)** | ~50–500× depending on locality |
+
+These are pre-implementation estimates. Each tier validates with criterion.
+
+### Trade-offs
+
+- Color-group parallelism requires up-front graph coloring at ingestion. Cost: linear in events, run once per `add_events`. Cheap.
+- Default = asynchronous EP (preserves current semantics). Synchronous opt-in only.
+- Cross-slice sweep stays sequential; no speculative parallel sweeps.
+- Rayon default-on but feature-gated.
+
+### Open question
+
+Whether to expose color-group partitioning to users. Recommendation: hidden by default, escape hatch via `add_events_with_partition(...)` for power users who already know their event independence.
+
+## Section 7 — Migration, testing, and delivery plan
+
+The crate is unreleased, so version-bump ceremony doesn't apply. Tiers are sequencing of work and milestones, not releases.
+
+### Tier sequence
+
+**T0 — Numerical parity (no API change)**
+
+Internal-only. Public surface unchanged.
+
+- Switch `Gaussian` storage to natural parameters `(pi, tau)`. `mu()`/`sigma()` become accessors.
+- Replace `HashMap<Index, _>` with dense `Vec<_>` keyed by `Index.0` everywhere.
+- Introduce `ScratchArena` inside `Batch` so `Game::new` stops allocating per-event.
+- Drop the `panic!` in `mu_sigma`; return `Result` propagated upward.
+
+**Acceptance:** existing test suite passes (bit-equal where possible, ULP-bounded where natural-param arithmetic shifts a rounding); `cargo bench` shows ≥3× win on `batch` benchmark; no API breakage.
+
+**T1 — Factor graph machinery (internal-only)**
+
+- Introduce `Factor`, `VarStore`, `Schedule` as `pub(crate)` types.
+- Re-implement `Game::likelihoods()` on top of `BuiltinFactor::{Perf, TeamSum, RankDiff, Trunc}` driven by `EpsilonOrMax`.
+- Replace within-game iteration tracking with `ScheduleReport`.
+
+**Acceptance:** existing test suite passes (ULP-bounded); within-game iteration counts unchanged; benchmarks ≥ T0.
+
+**T2 — New API surface (breaking)**
+
+All renames and the new public API land together. No half-renamed intermediate state.
+
+- New types: `Rating`, `TimeSlice`, `Competitor`, `Member<K>`, `Outcome`, `Event<T, K>`, `KeyTable<K>`.
+- `Time` trait introduced; `History<T: Time, D: Drift<T>>` is generic.
+- Three-tier API surface: `record_winner`, `event(...).team(...).commit()`, bulk `add_events(iter)`.
+- `Observer` trait + `ConvergenceReport`; `verbose: bool` deleted.
+- `panic!`/`debug_assert!` at API boundary become `Result<_, InferenceError>`.
+- Promote `Factor`/`Schedule`/`VarStore` to `pub` under a `factors` module.
+
+**Acceptance:** full test suite rewritten in new API; equivalence tests prove identical posteriors vs. old API on the same inputs.
+
+**T3 — Concurrency**
+
+- `Send + Sync` audit and bounds on all public traits.
+- Color-group partitioning at `TimeSlice` ingestion.
+- `rayon` as default-on feature with `#[cfg(feature = "rayon")]` fallback.
+- Parallel paths: within-slice color groups, `learning_curves`, `log_evidence_for`.
+
+**Acceptance:** deterministic posteriors across `RAYON_NUM_THREADS={1,2,4,8}`; benchmarks show >2× on 8-core for offline converge.
+
+**T4 — Richer factor types & schedules**
+
+Each shipped independently after T3.
+
+- `MarginFactor` → enables `Outcome::Scored`.
+- `Damped` and `Residual` schedules.
+- `SynergyFactor`, `ScoreFactor` → same pattern when wanted.
+
+Each comes with its own benchmark and a worked example in `examples/`.
+
+### Testing strategy
+
+| Layer | Approach |
+|---|---|
+| **Numerical correctness** | Keep existing hardcoded golden values from `test_1vs1`, `test_1vs1_draw`, `test_2vs1vs2_mixed`, etc. through T0–T1 unchanged. They are a regression net against the original Python port. |
+| **API parity** | T2 adds an `equivalence` test module that runs identical inputs through old vs. new construction and compares posteriors within ULPs. |
+| **Property tests** | Add `proptest` for: factor graph fixed-point invariance under message order, `Outcome` round-trip, `Gaussian` mul/div associativity in nat-params, schedule convergence regardless of starting state. |
+| **Determinism** | T3 adds tests that run identical input across multiple Rayon thread counts and assert identical posteriors. |
+| **Benchmark gates** | Each tier has a "must not regress" gate vs. the previous tier on the existing `batch` and `gaussian` criterion suites. T0 must beat baseline by ≥3×; T1 ≥ T0; etc. |
+
+### Risk management
+
+- **T0 risk: rounding drift in tests.** Mitigation: where natural-param arithmetic legitimately changes the last ULPs, update goldens *and* simultaneously add a parity test against a snapshot taken from baseline to prove the difference is bounded.
+- **T2 risk: API design mistakes.** Mitigation: review the spec and a worked example before implementing; iterate on feedback.
+- **T3 risk: subtle race conditions in color-group partitioning.** Mitigation: `loom` tests for the merge step; deterministic-output assertion across thread counts.
+- **Cross-tier risk: scope creep.** Each tier has a closed checklist; new ideas go to the next tier's wishlist.
+
+### What we're explicitly *not* doing
+
+- No GPU offload.
+- No `no_std` support.
+- No serde / persistence in this design.
+- No incremental online API beyond `record_winner` / `add_events`.
+
+## Open questions summary
+
+Collected here for the review pass:
+
+1. **`enum BuiltinFactor` extensibility** — may feel too closed-world; revisit if user-defined factors via `Custom(Box<dyn Factor>)` become common.
+2. **Sparse vs. dense per-slice skill storage** — default to dense + `present` mask; sparse columnar is the alternative. Decided by T0 benchmarks.
+3. **`Index` exposure for hot paths** — expose `intern_key`/`lookup` so power users can promote `&K` to `Index` and skip the `KeyTable` lookup; casual API still takes `&K` everywhere.
+4. **`Schedule::run` and observer wiring** — observation stays at higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver).
+5. **Color-group partition exposure** — hidden by default, escape hatch via `add_events_with_partition(...)`.