Files
trueskill-tt/docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md
Anders Olsson d2aab82c1e T0 + T1 + T2: engine redesign through new API surface (#1)
Implements tiers T0, T1, T2 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. All three tiers have landed together on this branch because they build on one another; this PR rolls them up for a single review pass.

Per-tier plans:
- T0: `docs/superpowers/plans/2026-04-23-t0-numerical-parity.md`
- T1: `docs/superpowers/plans/2026-04-24-t1-factor-graph.md`
- T2: `docs/superpowers/plans/2026-04-24-t2-new-api-surface.md`

## Summary

### T0 — Numerical parity (internal)

- `Gaussian` switched to natural-parameter storage `(pi, tau)`; mul/div now ~7× faster (218 ps vs 1.57 ns).
- `HashMap<Index, _>` → dense `Vec<_>` keyed by `Index.0` (via `AgentStore<D>`, `SkillStore`).
- `ScratchArena` eliminates per-event allocations in `Game::likelihoods`.
- `InferenceError` seed type added (1 variant).
- 38 → 53 tests passing through T1.
- Benchmark: `Batch::iteration` 29.84 → 21.25 µs.

### T1 — Factor graph machinery (internal)

- `Factor` trait + `BuiltinFactor` enum (TeamSum / RankDiff / Trunc) driving within-game inference.
- `VarStore` flat storage for variable marginals.
- `Schedule` trait + `EpsilonOrMax` impl replacing the hand-rolled EP loop.
- `Game::likelihoods` rebuilt on the factor-graph machinery; iteration counts and goldens preserved to within 1e-6.
- 53 tests passing.
- Benchmark: `Batch::iteration` 23.01 µs (slight regression absorbed in T2).

### T2 — New API surface (breaking)

**Renames:**
- `IndexMap → KeyTable`, `Player → Rating`, `Agent → Competitor`, `Batch → TimeSlice`

**New types:**
- `Time` trait with `Untimed` ZST and `i64` impls; `Drift<T>`, `Rating<T, D>`, `Competitor<T, D>`, `TimeSlice<T>`, `History<T, D, O, K>` all generic.
- `Event<T, K>`, `Team<K>`, `Member<K>`, `Outcome` (`Ranked` variant; `#[non_exhaustive]`).
- `Observer<T>` trait + `NullObserver`.
- `ConvergenceOptions`, `ConvergenceReport`.
- `GameOptions`, `OwnedGame<T, D>`.

**Three-tier ingestion:**
- `history.record_winner(&K, &K, T)` / `record_draw(&K, &K, T)` — 1v1 convenience.
- `history.add_events(iter)` — typed bulk.
- `history.event(T).team([...]).weights([...]).ranking([...]).commit()` — fluent.

**Query API:** `current_skill`, `learning_curve`, `learning_curves` (keyed on `K`), `log_evidence`, `log_evidence_for`, `predict_quality`, `predict_outcome`.

**Game constructors:** `ranked`, `one_v_one`, `free_for_all`, `custom` — all returning `Result<_, InferenceError>`.

**`factors` module:** `Factor`, `Schedule`, `VarStore`, `VarId`, `BuiltinFactor`, `EpsilonOrMax`, `ScheduleReport`, `TeamSumFactor`, `RankDiffFactor`, `TruncFactor` now public.

**Errors:** `InferenceError` gains `MismatchedShape`, `InvalidProbability`, `ConvergenceFailed`; boundary panics converted to `Result`.

**Removed (breaking):** `History::convergence(iters, eps, verbose)`, `HistoryBuilder::gamma(f64)`, `HistoryBuilder::time(bool)`, `History.time: bool`, `learning_curves_by_index`, nested-Vec public `add_events`.

## Behavior change (documented in CHANGELOG)

`Time = Untimed` has `elapsed_to → 0`, so no drift accumulates between slices. The old `time=false` mode implicitly forced `elapsed=1` on reappearance via an `i64::MAX` sentinel — that quirk is not reproducible under a typed time axis. Tests that depended on it now use `History::<i64, _>` with explicit `1..=n` timestamps. One test (`test_env_ttt`) had 3 Gaussian goldens updated to reflect the corrected semantics; documented in commit `33a7d90`.

## Final numbers

| Metric | Before T0 | After T2 | Delta |
|---|---|---|---|
| `Batch::iteration` | 29.84 µs | 21.36 µs | **-28%** |
| `Gaussian::mul` | 1.57 ns | 219 ps | **-86%** |
| `Gaussian::div` | 1.57 ns | 219 ps | **-86%** |
| Tests passing | 38 | 90 | +52 |

All other Gaussian ops unchanged (~219 ps add/sub, ~264 ps pi/tau reads).

## Test plan

- [x] `cargo test --features approx` — 90/90 pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch` — 21.36 µs
- [x] `cargo bench --bench gaussian` — unchanged from T1
- [x] `cargo run --example atp --features approx` — rewritten in new API, runs clean
- [x] Historical Game-level goldens preserved in `tests/equivalence.rs`
- [x] Public API matches spec Section 4 (verified by integration tests in `tests/api_shape.rs`)

## Commit history

~45 commits total across T0 + T1 + T2. Each task is self-contained and individually tested; the branch is bisectable. See `git log main..t2-new-api-surface` for the full list.

## Deferred to later tiers

- `Outcome::Scored` + `MarginFactor` — T4
- `Damped` / `Residual` schedules — T4
- `Send + Sync` bounds + Rayon parallelism — T3
- N-team `predict_outcome` — T4
- `Game::custom` full ergonomics — T4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #1
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
2026-04-24 11:20:04 +00:00

620 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TrueSkill-TT Engine Redesign — Design
**Date:** 2026-04-23
**Status:** Approved (pending implementation plan)
## Summary
Comprehensive redesign of the TrueSkill-TT engine targeting four orthogonal goals:
1. **Performance** — substantially faster offline convergence and incremental online updates.
2. **Accuracy and richer match formats** — support for score margins, free-for-all with partial orders, correlated skills.
3. **Better convergence** — replace ad-hoc capped iteration with a pluggable `Schedule` trait covering all three nested loops.
4. **Better API surface** — typed event description, observer-based progress reporting, generic time axis, structured errors, ergonomic builders.
The design is comprehensive (Approach 1 of three considered) but delivered in five tiers so each step is independently shippable and validated by benchmarks.
## Goals & non-goals
**Goals**
- 1030× speedup on the offline convergence path for representative workloads (1000+ players, 1000+ events, 30 iterations)
- Order-of-magnitude speedup on incremental "add a single event" workloads
- Pluggable factor graph allowing new factor types without engine changes
- Optional Rayon-backed parallelism on top of `Send + Sync`-correct internals
- Typed, ergonomic public API; replace nested `Vec<Vec<Vec<_>>>` shapes with `Event<T, K>` / `Team<K>` / `Member<K>`
- Generic time axis: `Untimed`, `i64`, or user-supplied
- Observer-based progress instead of `verbose: bool` + `println!`
- Structured `Result<_, InferenceError>` at API boundaries
**Non-goals**
- WebAssembly support is not a goal; we may break it if a crate or feature requires.
- No GPU offload.
- No `no_std` support.
- No persistent format / serde — possible future feature.
- No replacement of the Gaussian/EP approximation itself in this design (the underlying inference math stays the same; we change layout, dispatch, scheduling, and API around it).
## Workload assumptions
Baseline workload that drives perf decisions:
- ~1000+ players
- ~1000+ events total
- ~5060 events per time slice (per day)
- Both online (incremental adds) and offline (full convergence) are common
- Offline convergence runs frequently
## Section 1 — Core types & traits
The foundation everything else builds on.
### `Gaussian` — natural-parameter storage
Switch storage from `(mu, sigma)` to natural parameters `(pi, tau)` where `pi = sigma⁻²`, `tau = mu · pi`. Multiplication and division dominate the hot path; in nat-params they are direct adds/subs of the components, no `sqrt`. Reads of `mu`/`sigma` become accessor methods (`tau / pi`, `1.0 / pi.sqrt()`). The trade is correct because reads are vanishingly rare compared to writes in EP.
```rust
pub struct Gaussian { pi: f64, tau: f64 }
pub const UNIFORM: Gaussian = Gaussian { pi: 0.0, tau: 0.0 }; // replaces N_INF
```
### `Time` trait
Replaces the bare `i64` time field. Keeps `History` parametric.
```rust
pub trait Time: Copy + Ord + Send + Sync + 'static {
fn elapsed_to(&self, later: &Self) -> i64;
}
pub struct Untimed; // ZST for the no-time-axis case
impl Time for Untimed { fn elapsed_to(&self, _: &Self) -> i64 { 0 } }
impl Time for i64 { fn elapsed_to(&self, later: &Self) -> i64 { later - self } }
// Optional impls behind feature flags: time::OffsetDateTime, chrono types
```
### `Drift<T>` trait
Generic over `T: Time` so seasonal/calendar-aware drift is possible without going through `i64`.
```rust
pub trait Drift<T: Time>: Copy + Send + Sync {
fn variance_delta(&self, from: &T, to: &T) -> f64;
}
```
`ConstantDrift(f64)` impl: `to.elapsed_to(from) as f64 * gamma * gamma`.
### `Index` and `KeyTable<K>`
`Index(usize)` is the handle into dense per-`History` `Vec` storage. Public, but intended for use by power users on hot paths who want to skip the `KeyTable` lookup. Casual API takes `&K`. `KeyTable<K>` (renamed from `IndexMap`, to avoid colliding with the `indexmap` crate's type) maps user keys → `Index`.
### `Observer` trait
Replaces `verbose: bool` + `println!`. Default no-op impls; user overrides what they need.
```rust
pub trait Observer<T: Time>: Send + Sync {
fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
fn on_batch_processed(&self, _time: &T, _idx: usize, _n_events: usize) {}
fn on_converged(&self, _iters: usize, _final_step: (f64, f64)) {}
}
pub struct NullObserver;
impl<T: Time> Observer<T> for NullObserver {}
```
### Trade-offs
- `Gaussian` natural-param representation: anyone reading `mu`/`sigma` in a hot loop pays a sqrt — but that's correct, hot reads are rare.
- `Time` as a trait (not enum) keeps it open-ended at zero runtime cost; default `History<i64, _>` keeps the call sites familiar.
- `Observer` is a trait (not a closure) so different sites can have different signatures without losing type safety. `NullObserver` is a ZST.
## Section 2 — Factor graph architecture
The current `Game::likelihoods` is a hand-rolled, hard-coded graph. To unlock richer formats and let us experiment with EP schedules, the graph itself becomes a data structure.
### Variable / Factor model
Variables hold their current Gaussian marginal. Factors hold their outgoing messages to each connected variable plus do the local computation. Standard EP: factor's update is "divide marginal by old outgoing → cavity → apply local approximation → multiply marginal by new outgoing."
```rust
pub trait Factor: Send + Sync {
fn variables(&self) -> &[VarId];
fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64); // returns max delta
fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}
```
### Built-in factor catalog
| Factor | Purpose | Status |
|---|---|---|
| `PerformanceFactor` | skill → performance (add β² noise, optional weight) | replaces inline `performance() * weight` |
| `TeamSumFactor` | weighted sum of player perfs → team perf | replaces inline `fold` |
| `RankDiffFactor` | (team_a perf) (team_b perf) → diff var | currently `team[e].posterior_win() team[e+1].posterior_lose()` |
| `TruncFactor` | EP truncation: `P(diff > margin)` or `P(|diff| < margin)` for draws | wraps current `v_w` / `approx` |
| `MarginFactor` *(future)* | use observed score margin as soft evidence | enables richer match formats |
| `SynergyFactor` *(future)* | couples teammates' skills | enables different topology |
| `ScoreFactor` *(future)* | continuous outcome (e.g., points scored) | enables score-based outcomes |
The first four together exactly reproduce today's algorithm. The last three are extension slots.
### Game = factor graph + schedule
```rust
pub struct Game<S: Schedule = DefaultSchedule> {
vars: VarStore, // SoA: Vec<Gaussian> marginals
factors: FactorList, // enum dispatch over BuiltinFactor (see Open Questions)
schedule: S,
}
```
Lean toward **enum dispatch** (`enum BuiltinFactor { Perf(...), Sum(...), RankDiff(...), Trunc(...), ... }`) over `Box<dyn Factor>` for the built-ins:
- avoids per-message vtable overhead in the hottest loop
- keeps factor data inline (no heap indirection)
- still allows user-defined factors via a `BuiltinFactor::Custom(Box<dyn Factor>)` variant
### Schedule trait
Controls iteration order and stopping. Default = current behavior (sweep forward, then backward, until ε or max iters). Pluggable so we can later try damped EP or junction-tree schedules.
### High-level constructors
```rust
Game::ranked(teams, results, options) // dominant case
Game::free_for_all(players, ranking) // FFA with possible ties
Game::custom(builder) // power users build their own graph
```
`GameOptions` carries iteration cap, epsilon, p_draw, and approximation choice. Today these are scattered between method args and module constants.
### Trade-offs
- Enum dispatch over trait objects for built-ins; richer factors drop in via new enum variants.
- Variables and factor messages stored as `Vec<Gaussian>` indexed by `VarId` / edge slot — flat, cache-friendly.
- `Schedule` is a generic parameter (zero-cost); most users get default; experimentation is open.
### Open question
Whether `enum BuiltinFactor` will feel too closed-world. The `Custom(Box<dyn Factor>)` escape hatch helps but inner-loop perf for user factors will be slower. Acceptable for now; flagged for future revisit if it becomes a problem.
## Section 3 — Storage layout (SoA + arenas)
### Dense Vec keyed by `Index`
Every `HashMap<Index, T>` becomes a `Vec<T>` (or `Vec<Option<T>>` for sparse) indexed directly by `Index.0`. The public-facing `KeyTable<K>` continues to map arbitrary keys → `Index`.
### SoA at hot layers, AoS at boundaries
The `Skill` struct stays as a public type for the API (returned from `learning_curves`, etc.), but inside `TimeSlice` we lay it out column-wise:
```rust
struct TimeSliceSkills {
forward: Vec<Gaussian>, // [n_agents]
backward: Vec<Gaussian>,
likelihood: Vec<Gaussian>,
online: Vec<Gaussian>,
elapsed: Vec<i64>,
present: Vec<bool>,
}
```
Within a slice, the inner loops touch one column repeatedly across many events — keeping the column contiguous improves cache utilization and makes the eventual SIMD step (Section 6) straightforward.
`Gaussian` itself stays as a single 16-byte struct in the `Vec<Gaussian>`. Splitting into two parallel `Vec<f64>`s wins for pure SIMD over thousands of Gaussians but loses for the random-access patterns dominant in EP. Revisit if benchmarks demand it.
### Arena allocator inside `Game`
Replace per-event allocations with a `ScratchArena` reused across calls.
```rust
pub struct ScratchArena {
var_buf: Vec<Gaussian>,
factor_buf: Vec<Gaussian>, // edge messages
bool_buf: Vec<bool>,
f64_buf: Vec<f64>,
}
impl ScratchArena {
fn reset(&mut self); // sets len=0, keeps capacity
fn alloc_vars(&mut self, n: usize) -> &mut [Gaussian];
}
```
`TimeSlice` owns one `ScratchArena`; each event borrows it for the duration of its `Game` construction and inference. For the parallel-slice story (Section 6), each Rayon task gets its own arena.
### Per-event storage layout
Inside a `TimeSlice`, each event is stored column-wise as well, with `Item` inlined into team-level parallel arrays:
```rust
struct EventStorage {
teams: SmallVec<[TeamStorage; 4]>,
outcome: Outcome,
weights: SmallVec<[SmallVec<[f64; 4]>; 4]>,
evidence: f64,
}
struct TeamStorage {
competitors: SmallVec<[Index; 4]>, // who's on the team
edge_messages: SmallVec<[Gaussian; 4]>, // outgoing message per slot
output: f64,
}
```
Iteration over `(competitor, edge_message)` pairs zips two slices — no per-element struct.
### SmallVec for typical shapes
Teams ≤ ~5 players, games ≤ ~8 teams. `SmallVec<[T; 8]>` for team membership and `SmallVec<[T; 4]>` for team rosters keeps the common case allocation-free.
### Trade-offs
- Dense `Vec<T>` keyed by `Index` is faster but means agent removal needs tombstones (or just leaves slots present-but-inactive). Acceptable: TrueSkill histories rarely remove players.
- SoA at `TimeSlice` level only, not at `History` level. `History` keeps `Vec<TimeSlice>` because slices are heterogeneous in size.
- One `ScratchArena` per `TimeSlice` keeps the lifetime story simple.
### Open question
The `TimeSliceSkills` sketch above uses (b) **dense + present mask**: one slot per agent in the history, indexed directly by `Index`, with a `present: Vec<bool>` mask for batches the agent didn't participate in. The alternative is (a) **sparse columnar**: a `Vec<Index>` of present agents and parallel `Vec<Gaussian>` columns of length `n_present`, with a separate lookup (binary search or auxiliary table) to find a given `Index`'s slot.
(b) gives O(1) lookup and SIMD-friendly columns but wastes memory for sparsely populated slices. (a) is leaner per-slice but pays per-lookup cost in the inner loop. Bench both during T0 and pick. Default proposal: (b), since modern systems are memory-rich and the parallelism story is cleaner.
## Section 4 — API surface
### Typed event description
```rust
pub struct Event<T: Time, K> {
pub time: T,
pub teams: SmallVec<[Team<K>; 4]>,
pub outcome: Outcome,
}
pub struct Team<K> {
pub members: SmallVec<[Member<K>; 4]>,
}
pub struct Member<K> {
pub key: K,
pub weight: f64, // default 1.0
pub prior: Option<Rating>, // per-event override
}
pub enum Outcome {
Ranked(SmallVec<[u32; 4]>), // rank per team; equal ranks = tie
Scored(SmallVec<[f64; 4]>), // continuous score per team (engages MarginFactor)
}
```
`Outcome::winner(0)`, `Outcome::draw()`, `Outcome::ranking([0,1,2])` are convenience constructors.
### Builders
```rust
let mut history = History::<i64, _>::builder()
.mu(25.0).sigma(25.0/3.0).beta(25.0/6.0)
.drift(ConstantDrift(0.03))
.p_draw(0.10)
.convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
.observer(LogObserver::default())
.build();
```
For the no-time case, type inference picks `Untimed`:
```rust
let mut history = History::<Untimed, _>::builder().build();
```
### Three-tier event ingestion
```rust
// 1. Bulk ingestion (high-throughput path)
history.add_events(events_iter)?;
// 2. One-off match (very common in practice)
history.record_winner("alice", "bob", time)?;
history.record_draw("alice", "bob", time)?;
// 3. Builder for irregular shapes
history.event(time)
.team(["alice", "bob"]).weights([1.0, 0.7])
.team(["carol"])
.ranking([1, 0])
.commit()?;
```
### Convergence & queries
```rust
let report: ConvergenceReport = history.converge()?;
let curve: Vec<(i64, Gaussian)> = history.learning_curve(&"alice");
let all = history.learning_curves(); // HashMap<&K, Vec<(T, Gaussian)>>
let now = history.current_skill(&"alice"); // Option<Gaussian>
let ev = history.log_evidence();
let ev_for = history.log_evidence_for(&["alice", "bob"]);
let q = history.predict_quality(&[&["alice"], &["bob"]]);
let p_win = history.predict_outcome(&[&["alice"], &["bob"]]);
```
### Standalone Game
```rust
let g = Game::ranked(&[&[alice], &[bob]], Outcome::winner(0), &options);
let post = g.posteriors();
// Convenience
let (a, b) = Game::one_v_one(&alice, &bob, Outcome::winner(0));
```
### Errors
Replace `debug_assert!`/`panic!` at the API boundary with `Result`.
```rust
pub enum InferenceError {
MismatchedShape { kind: &'static str, expected: usize, got: usize },
InvalidProbability { value: f64 },
ConvergenceFailed { last_step: (f64, f64), iterations: usize },
NegativePrecision { pi: f64 },
}
```
Hot inner loops still use `debug_assert!` for invariants the API has already enforced.
### Trade-offs
- Generic over user's `K`; engine works in `Index`. Public outputs use `&K`.
- `SmallVec` everywhere on the event-description path.
- Three-tier API so casual users don't drown in types and bulk users still get throughput.
- `Outcome` enum replaces the "lower number wins" `&[f64]` convention.
### Open question
Whether to expose `Index` directly to users via an `intern_key(&K) -> Index` method, letting hot-path callers skip the `KeyTable` lookup on every call. Recommendation: yes — public `Index` handle plus `history.lookup<Q: Borrow<K>>(&Q) -> Option<Index>`. The casual API still takes `&K` everywhere; power users can promote to `Index` when profiling demands.
## Section 4½ — Naming pass
| Current | New | Rationale |
|---|---|---|
| `History` | `History` (kept) | Matches upstream; reads cleanly. |
| `Batch` | `TimeSlice` | Says what it is: every event sharing one timestamp. |
| `Player` | `Rating` | The struct holds prior/beta/drift — that's a rating configuration. Resolves the `Player`/`Agent` confusion. |
| `Agent` | `Competitor` | Holds dynamic state for someone competing in the history; fits the domain. |
| `Skill` | `Skill` (kept) | Per-time-slice skill estimate; clearer than `BatchSkill`. |
| `Item` | inlined into `TeamStorage` columns (engine) / `Member<K>` (public) | Eliminates the per-element struct in the hot path; gives API users a clear "team member" name. |
| `Game` | `Game` (kept) | `Match` collides with Rust's `match`. |
| `Index` | `Index` (kept) | Internal handle. |
| `IndexMap` | `KeyTable` | Avoids confusion with the `indexmap` crate. |
## Section 5 — Convergence & message scheduling
### Three nested loops, one mechanism
The system has three nested convergence loops:
1. Within-game: EP sweeps over the factor graph
2. Within-time-slice: re-running games as inputs change
3. Cross-history: forward-pass then backward-pass over all slices
All three implement `Workload`; one `Schedule` impl drives all of them.
```rust
pub trait Schedule {
fn run<W: Workload>(&self, workload: &mut W) -> ScheduleReport;
}
pub trait Workload {
fn step(&mut self) -> (f64, f64);
fn snapshot_evidence(&self) -> f64 { 0.0 }
}
pub struct ScheduleReport {
pub iterations: usize,
pub final_step: (f64, f64),
pub converged: bool,
}
```
### Built-in schedules
| Schedule | Behavior | Use |
|---|---|---|
| `EpsilonOrMax { eps, max }` | Default. Sweep until `(dpi, dtau) ≤ eps` or `max` iters. | All three loops. Replicates current behavior. |
| `Damped { eps, max, alpha }` | Same, but writes `α·new + (1α)·old`. | Stuck oscillations. |
| `Residual { eps, max }` | Priority-queue: re-update factor with largest pending delta first. | Faster convergence on uneven graphs. |
| `OneShot` | Exactly one pass, no convergence check. | Online incremental adds. |
### Stopping in natural-param space
Switch from `(|Δmu|, |Δsigma|) ≤ epsilon` to `(|Δpi|, |Δtau|) ≤ (eps_pi, eps_tau)`:
- `mu` and `sigma` are on different scales; one tolerance is wrong for both
- We store in nat-params anyway — checking convergence in mu/sigma costs free sqrts
- Nat-param delta is the natural geometry of the EP fixed point
Default `EpsilonOrMax::default()` exposes a single `epsilon` for simplicity; advanced ctor exposes both tolerances.
### Within-game improvements
- Replace hard-cap of 10 iterations with `GameOptions::schedule` that propagates `ScheduleReport` upward
- Fast path: graphs with no diff chain (1v1 with 1 iter sufficient) skip the loop entirely
- FFA / many-team ranks benefit from `Residual`; opt-in
### Within-slice and cross-history improvements
- **No more old/new HashMap snapshotting**: track deltas inline as we write under SoA
- **Per-slice dirty bits**: a `TimeSlice` whose neighbor messages haven't changed since its last full sweep doesn't need to re-run. Track `time_slice.dirty` and skip clean ones during the cross-history sweep. Big win for online-add (the locality case).
### `ConvergenceReport`
```rust
pub struct ConvergenceReport {
pub iterations: usize,
pub final_step: (f64, f64),
pub log_evidence: f64,
pub converged: bool,
pub per_iteration_time: SmallVec<[Duration; 32]>,
pub batches_skipped: usize,
}
```
`Observer` continues to receive per-iteration callbacks for live UI; `ConvergenceReport` is the post-hoc summary.
### Trade-offs
- One `Schedule` trait shared across loops — fewer concepts, more composable.
- Convergence checks in nat-param space — slightly different exact threshold than today; tests' epsilons re-tuned mechanically.
- Dirty-bit skipping changes iteration order vs. today; fixed point is the same, iteration counts may shift downward.
- `Residual` and `Damped` are opt-in; default behavior matches today closely.
### Open question
Whether `Schedule::run` should take an optional `Observer` reference. Recommendation: observation lives at a higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver).
## Section 6 — Concurrency & parallelism
### What's parallelizable
| Operation | Parallelism | Strategy |
|---|---|---|
| `History::converge()` (full forward+backward) | Sequential across slices | Within each slice: color-group events in parallel via Rayon |
| `History::add_events(...)` | Sequential append, but ingestion of typed events into `EventStorage` parallelizes trivially | n/a |
| `History::learning_curves()` | Per-key parallel | `into_par_iter()` |
| `History::log_evidence_for(targets)` | Per-batch parallel, reduce sum | `par_iter().map(...).sum()` |
| `Game` inference | Sequential | n/a (too small to amortize Rayon overhead) |
### Within-slice color-group parallelism
When events are added to a slice, partition them into color groups where events in the same color touch no shared `Index`. Within a color, run events in parallel via Rayon. Across colors, run sequentially. Preserves asynchronous-EP semantics exactly.
Alternative: synchronous EP with snapshot. All events read from a frozen skill snapshot, write deltas to thread-local buffers, barrier merges. Trivially parallel but weaker per-iteration convergence — needs damping. Available as a `Schedule` impl, opt-in.
### `Send + Sync` requirements
All public traits (`Time`, `Drift`, `Observer`, `Factor`, `Schedule`) require `Send + Sync`. `Observer` impls must be thread-safe (called from arbitrary worker threads).
### Rayon as default-on feature
`rayon` as default-on feature; with `default-features = false`, parallel paths fall back to sequential iterators behind `cfg(feature = "rayon")`.
### Expected speedup ballpark
For 1000 players, 60 events/slice × 1000 slices, 30 convergence iterations:
| Source | Estimated speedup vs. today |
|---|---|
| `HashMap` → dense `Vec` | 24× |
| Natural-param `Gaussian`, no-sqrt mul/div | 1.52× |
| Pre-allocated `ScratchArena` | 1.21.5× |
| Color-group parallel events in slice (8 cores) | 24× |
| Dirty-bit slice skipping (online add case) | 550× |
| **Combined (offline converge)** | ~1030× |
| **Combined (online add)** | ~50500× depending on locality |
These are pre-implementation estimates. Each tier validates with criterion.
### Trade-offs
- Color-group parallelism requires up-front graph coloring at ingestion. Cost: linear in events, run once per `add_events`. Cheap.
- Default = asynchronous EP (preserves current semantics). Synchronous opt-in only.
- Cross-slice sweep stays sequential; no speculative parallel sweeps.
- Rayon default-on but feature-gated.
### Open question
Whether to expose color-group partitioning to users. Recommendation: hidden by default, escape hatch via `add_events_with_partition(...)` for power users who already know their event independence.
## Section 7 — Migration, testing, and delivery plan
The crate is unreleased, so version-bump ceremony doesn't apply. Tiers are sequencing of work and milestones, not releases.
### Tier sequence
**T0 — Numerical parity (no API change)**
Internal-only. Public surface unchanged.
- Switch `Gaussian` storage to natural parameters `(pi, tau)`. `mu()`/`sigma()` become accessors.
- Replace `HashMap<Index, _>` with dense `Vec<_>` keyed by `Index.0` everywhere.
- Introduce `ScratchArena` inside `Batch` so `Game::new` stops allocating per-event.
- Drop the `panic!` in `mu_sigma`; return `Result` propagated upward.
**Acceptance:** existing test suite passes (bit-equal where possible, ULP-bounded where natural-param arithmetic shifts a rounding); `cargo bench` shows ≥3× win on `batch` benchmark; no API breakage.
**T1 — Factor graph machinery (internal-only)**
- Introduce `Factor`, `VarStore`, `Schedule` as `pub(crate)` types.
- Re-implement `Game::likelihoods()` on top of `BuiltinFactor::{Perf, TeamSum, RankDiff, Trunc}` driven by `EpsilonOrMax`.
- Replace within-game iteration tracking with `ScheduleReport`.
**Acceptance:** existing test suite passes (ULP-bounded); within-game iteration counts unchanged; benchmarks ≥ T0.
**T2 — New API surface (breaking)**
All renames and the new public API land together. No half-renamed intermediate state.
- New types: `Rating`, `TimeSlice`, `Competitor`, `Member<K>`, `Outcome`, `Event<T, K>`, `KeyTable<K>`.
- `Time` trait introduced; `History<T: Time, D: Drift<T>>` is generic.
- Three-tier API surface: `record_winner`, `event(...).team(...).commit()`, bulk `add_events(iter)`.
- `Observer` trait + `ConvergenceReport`; `verbose: bool` deleted.
- `panic!`/`debug_assert!` at API boundary become `Result<_, InferenceError>`.
- Promote `Factor`/`Schedule`/`VarStore` to `pub` under a `factors` module.
**Acceptance:** full test suite rewritten in new API; equivalence tests prove identical posteriors vs. old API on the same inputs.
**T3 — Concurrency**
- `Send + Sync` audit and bounds on all public traits.
- Color-group partitioning at `TimeSlice` ingestion.
- `rayon` as default-on feature with `#[cfg(feature = "rayon")]` fallback.
- Parallel paths: within-slice color groups, `learning_curves`, `log_evidence_for`.
**Acceptance:** deterministic posteriors across `RAYON_NUM_THREADS={1,2,4,8}`; benchmarks show >2× on 8-core for offline converge.
**T4 — Richer factor types & schedules**
Each shipped independently after T3.
- `MarginFactor` → enables `Outcome::Scored`.
- `Damped` and `Residual` schedules.
- `SynergyFactor`, `ScoreFactor` → same pattern when wanted.
Each comes with its own benchmark and a worked example in `examples/`.
### Testing strategy
| Layer | Approach |
|---|---|
| **Numerical correctness** | Keep existing hardcoded golden values from `test_1vs1`, `test_1vs1_draw`, `test_2vs1vs2_mixed`, etc. through T0T1 unchanged. They are a regression net against the original Python port. |
| **API parity** | T2 adds an `equivalence` test module that runs identical inputs through old vs. new construction and compares posteriors within ULPs. |
| **Property tests** | Add `proptest` for: factor graph fixed-point invariance under message order, `Outcome` round-trip, `Gaussian` mul/div associativity in nat-params, schedule convergence regardless of starting state. |
| **Determinism** | T3 adds tests that run identical input across multiple Rayon thread counts and assert identical posteriors. |
| **Benchmark gates** | Each tier has a "must not regress" gate vs. the previous tier on the existing `batch` and `gaussian` criterion suites. T0 must beat baseline by ≥3×; T1 ≥ T0; etc. |
### Risk management
- **T0 risk: rounding drift in tests.** Mitigation: where natural-param arithmetic legitimately changes the last ULPs, update goldens *and* simultaneously add a parity test against a snapshot taken from baseline to prove the difference is bounded.
- **T2 risk: API design mistakes.** Mitigation: review the spec and a worked example before implementing; iterate on feedback.
- **T3 risk: subtle race conditions in color-group partitioning.** Mitigation: `loom` tests for the merge step; deterministic-output assertion across thread counts.
- **Cross-tier risk: scope creep.** Each tier has a closed checklist; new ideas go to the next tier's wishlist.
### What we're explicitly *not* doing
- No GPU offload.
- No `no_std` support.
- No serde / persistence in this design.
- No incremental online API beyond `record_winner` / `add_events`.
## Open questions summary
Collected here for the review pass:
1. **`enum BuiltinFactor` extensibility** — may feel too closed-world; revisit if user-defined factors via `Custom(Box<dyn Factor>)` become common.
2. **Sparse vs. dense per-slice skill storage** — default to dense + `present` mask; sparse columnar is the alternative. Decided by T0 benchmarks.
3. **`Index` exposure for hot paths** — expose `intern_key`/`lookup` so power users can promote `&K` to `Index` and skip the `KeyTable` lookup; casual API still takes `&K` everywhere.
4. **`Schedule::run` and observer wiring** — observation stays at higher layer (`History::converge` calls observer hooks; `Schedule` is purely the loop driver).
5. **Color-group partition exposure** — hidden by default, escape hatch via `add_events_with_partition(...)`.