Files

Anders Olsson d2aab82c1e T0 + T1 + T2: engine redesign through new API surface (#1 )

Implements tiers T0, T1, T2 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. All three tiers have landed together on this branch because they build on one another; this PR rolls them up for a single review pass.

Per-tier plans:
- T0: `docs/superpowers/plans/2026-04-23-t0-numerical-parity.md`
- T1: `docs/superpowers/plans/2026-04-24-t1-factor-graph.md`
- T2: `docs/superpowers/plans/2026-04-24-t2-new-api-surface.md`

## Summary

### T0 — Numerical parity (internal)

- `Gaussian` switched to natural-parameter storage `(pi, tau)`; mul/div now ~7× faster (218 ps vs 1.57 ns).
- `HashMap<Index, _>` → dense `Vec<_>` keyed by `Index.0` (via `AgentStore<D>`, `SkillStore`).
- `ScratchArena` eliminates per-event allocations in `Game::likelihoods`.
- `InferenceError` seed type added (1 variant).
- 38 → 53 tests passing through T1.
- Benchmark: `Batch::iteration` 29.84 → 21.25 µs.

### T1 — Factor graph machinery (internal)

- `Factor` trait + `BuiltinFactor` enum (TeamSum / RankDiff / Trunc) driving within-game inference.
- `VarStore` flat storage for variable marginals.
- `Schedule` trait + `EpsilonOrMax` impl replacing the hand-rolled EP loop.
- `Game::likelihoods` rebuilt on the factor-graph machinery; iteration counts and goldens preserved to within 1e-6.
- 53 tests passing.
- Benchmark: `Batch::iteration` 23.01 µs (slight regression absorbed in T2).

### T2 — New API surface (breaking)

**Renames:**
- `IndexMap → KeyTable`, `Player → Rating`, `Agent → Competitor`, `Batch → TimeSlice`

**New types:**
- `Time` trait with `Untimed` ZST and `i64` impls; `Drift<T>`, `Rating<T, D>`, `Competitor<T, D>`, `TimeSlice<T>`, `History<T, D, O, K>` all generic.
- `Event<T, K>`, `Team<K>`, `Member<K>`, `Outcome` (`Ranked` variant; `#[non_exhaustive]`).
- `Observer<T>` trait + `NullObserver`.
- `ConvergenceOptions`, `ConvergenceReport`.
- `GameOptions`, `OwnedGame<T, D>`.

**Three-tier ingestion:**
- `history.record_winner(&K, &K, T)` / `record_draw(&K, &K, T)` — 1v1 convenience.
- `history.add_events(iter)` — typed bulk.
- `history.event(T).team([...]).weights([...]).ranking([...]).commit()` — fluent.

**Query API:** `current_skill`, `learning_curve`, `learning_curves` (keyed on `K`), `log_evidence`, `log_evidence_for`, `predict_quality`, `predict_outcome`.

**Game constructors:** `ranked`, `one_v_one`, `free_for_all`, `custom` — all returning `Result<_, InferenceError>`.

**`factors` module:** `Factor`, `Schedule`, `VarStore`, `VarId`, `BuiltinFactor`, `EpsilonOrMax`, `ScheduleReport`, `TeamSumFactor`, `RankDiffFactor`, `TruncFactor` now public.

**Errors:** `InferenceError` gains `MismatchedShape`, `InvalidProbability`, `ConvergenceFailed`; boundary panics converted to `Result`.

**Removed (breaking):** `History::convergence(iters, eps, verbose)`, `HistoryBuilder::gamma(f64)`, `HistoryBuilder::time(bool)`, `History.time: bool`, `learning_curves_by_index`, nested-Vec public `add_events`.

## Behavior change (documented in CHANGELOG)

`Time = Untimed` has `elapsed_to → 0`, so no drift accumulates between slices. The old `time=false` mode implicitly forced `elapsed=1` on reappearance via an `i64::MAX` sentinel — that quirk is not reproducible under a typed time axis. Tests that depended on it now use `History::<i64, _>` with explicit `1..=n` timestamps. One test (`test_env_ttt`) had 3 Gaussian goldens updated to reflect the corrected semantics; documented in commit `33a7d90`.

## Final numbers

| Metric | Before T0 | After T2 | Delta |
|---|---|---|---|
| `Batch::iteration` | 29.84 µs | 21.36 µs | **-28%** |
| `Gaussian::mul` | 1.57 ns | 219 ps | **-86%** |
| `Gaussian::div` | 1.57 ns | 219 ps | **-86%** |
| Tests passing | 38 | 90 | +52 |

All other Gaussian ops unchanged (~219 ps add/sub, ~264 ps pi/tau reads).

## Test plan

- [x] `cargo test --features approx` — 90/90 pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch` — 21.36 µs
- [x] `cargo bench --bench gaussian` — unchanged from T1
- [x] `cargo run --example atp --features approx` — rewritten in new API, runs clean
- [x] Historical Game-level goldens preserved in `tests/equivalence.rs`
- [x] Public API matches spec Section 4 (verified by integration tests in `tests/api_shape.rs`)

## Commit history

~45 commits total across T0 + T1 + T2. Each task is self-contained and individually tested; the branch is bisectable. See `git log main..t2-new-api-surface` for the full list.

## Deferred to later tiers

- `Outcome::Scored` + `MarginFactor` — T4
- `Damped` / `Residual` schedules — T4
- `Send + Sync` bounds + Rayon parallelism — T3
- N-team `predict_outcome` — T4
- `Game::custom` full ergonomics — T4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #1
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>

2026-04-24 11:20:04 +00:00

27 KiB

Raw Blame History

TrueSkill-TT Engine Redesign — Design

Date: 2026-04-23 Status: Approved (pending implementation plan)

Summary

Comprehensive redesign of the TrueSkill-TT engine targeting four orthogonal goals:

Performance — substantially faster offline convergence and incremental online updates.
Accuracy and richer match formats — support for score margins, free-for-all with partial orders, correlated skills.
Better convergence — replace ad-hoc capped iteration with a pluggable Schedule trait covering all three nested loops.
Better API surface — typed event description, observer-based progress reporting, generic time axis, structured errors, ergonomic builders.

The design is comprehensive (Approach 1 of three considered) but delivered in five tiers so each step is independently shippable and validated by benchmarks.

Goals & non-goals

Goals

10–30× speedup on the offline convergence path for representative workloads (1000+ players, 1000+ events, 30 iterations)
Order-of-magnitude speedup on incremental "add a single event" workloads
Pluggable factor graph allowing new factor types without engine changes
Optional Rayon-backed parallelism on top of Send + Sync-correct internals
Typed, ergonomic public API; replace nested Vec<Vec<Vec<_>>> shapes with Event<T, K> / Team<K> / Member<K>
Generic time axis: Untimed, i64, or user-supplied
Observer-based progress instead of verbose: bool + println!
Structured Result<_, InferenceError> at API boundaries

Non-goals

WebAssembly support is not a goal; we may break it if a crate or feature requires.
No GPU offload.
No no_std support.
No persistent format / serde — possible future feature.
No replacement of the Gaussian/EP approximation itself in this design (the underlying inference math stays the same; we change layout, dispatch, scheduling, and API around it).

Workload assumptions

Baseline workload that drives perf decisions:

~1000+ players
~1000+ events total
~50–60 events per time slice (per day)
Both online (incremental adds) and offline (full convergence) are common
Offline convergence runs frequently

Section 1 — Core types & traits

The foundation everything else builds on.

`Gaussian` — natural-parameter storage

Switch storage from (mu, sigma) to natural parameters (pi, tau) where pi = sigma⁻², tau = mu · pi. Multiplication and division dominate the hot path; in nat-params they are direct adds/subs of the components, no sqrt. Reads of mu/sigma become accessor methods (tau / pi, 1.0 / pi.sqrt()). The trade is correct because reads are vanishingly rare compared to writes in EP.

pub struct Gaussian { pi: f64, tau: f64 }
pub const UNIFORM: Gaussian = Gaussian { pi: 0.0, tau: 0.0 }; // replaces N_INF

`Time` trait

Replaces the bare i64 time field. Keeps History parametric.

pub trait Time: Copy + Ord + Send + Sync + 'static {
    fn elapsed_to(&self, later: &Self) -> i64;
}
pub struct Untimed; // ZST for the no-time-axis case
impl Time for Untimed { fn elapsed_to(&self, _: &Self) -> i64 { 0 } }
impl Time for i64 { fn elapsed_to(&self, later: &Self) -> i64 { later - self } }
// Optional impls behind feature flags: time::OffsetDateTime, chrono types

`Drift<T>` trait

Generic over T: Time so seasonal/calendar-aware drift is possible without going through i64.

pub trait Drift<T: Time>: Copy + Send + Sync {
    fn variance_delta(&self, from: &T, to: &T) -> f64;
}

ConstantDrift(f64) impl: to.elapsed_to(from) as f64 * gamma * gamma.

`Index` and `KeyTable<K>`

Index(usize) is the handle into dense per-History Vec storage. Public, but intended for use by power users on hot paths who want to skip the KeyTable lookup. Casual API takes &K. KeyTable<K> (renamed from IndexMap, to avoid colliding with the indexmap crate's type) maps user keys → Index.

`Observer` trait

Replaces verbose: bool + println!. Default no-op impls; user overrides what they need.

pub trait Observer<T: Time>: Send + Sync {
    fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
    fn on_batch_processed(&self, _time: &T, _idx: usize, _n_events: usize) {}
    fn on_converged(&self, _iters: usize, _final_step: (f64, f64)) {}
}
pub struct NullObserver;
impl<T: Time> Observer<T> for NullObserver {}

Trade-offs

Gaussian natural-param representation: anyone reading mu/sigma in a hot loop pays a sqrt — but that's correct, hot reads are rare.
Time as a trait (not enum) keeps it open-ended at zero runtime cost; default History<i64, _> keeps the call sites familiar.
Observer is a trait (not a closure) so different sites can have different signatures without losing type safety. NullObserver is a ZST.

Section 2 — Factor graph architecture

The current Game::likelihoods is a hand-rolled, hard-coded graph. To unlock richer formats and let us experiment with EP schedules, the graph itself becomes a data structure.

Variable / Factor model

Variables hold their current Gaussian marginal. Factors hold their outgoing messages to each connected variable plus do the local computation. Standard EP: factor's update is "divide marginal by old outgoing → cavity → apply local approximation → multiply marginal by new outgoing."

pub trait Factor: Send + Sync {
    fn variables(&self) -> &[VarId];
    fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64); // returns max delta
    fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}

Built-in factor catalog

Factor	Purpose	Status
`PerformanceFactor`	skill → performance (add β² noise, optional weight)	replaces inline `performance() * weight`
`TeamSumFactor`	weighted sum of player perfs → team perf	replaces inline `fold`
`RankDiffFactor`	(team_a perf) − (team_b perf) → diff var	currently `team[e].posterior_win() − team[e+1].posterior_lose()`
`TruncFactor`	EP truncation: `P(diff > margin)` or `P(	diff
`MarginFactor` (future)	use observed score margin as soft evidence	enables richer match formats
`SynergyFactor` (future)	couples teammates' skills	enables different topology
`ScoreFactor` (future)	continuous outcome (e.g., points scored)	enables score-based outcomes

The first four together exactly reproduce today's algorithm. The last three are extension slots.

Game = factor graph + schedule

pub struct Game<S: Schedule = DefaultSchedule> {
    vars: VarStore,            // SoA: Vec<Gaussian> marginals
    factors: FactorList,       // enum dispatch over BuiltinFactor (see Open Questions)
    schedule: S,
}

Lean toward enum dispatch (enum BuiltinFactor { Perf(...), Sum(...), RankDiff(...), Trunc(...), ... }) over Box<dyn Factor> for the built-ins:

avoids per-message vtable overhead in the hottest loop
keeps factor data inline (no heap indirection)
still allows user-defined factors via a BuiltinFactor::Custom(Box<dyn Factor>) variant

Schedule trait

Controls iteration order and stopping. Default = current behavior (sweep forward, then backward, until ε or max iters). Pluggable so we can later try damped EP or junction-tree schedules.

High-level constructors

Game::ranked(teams, results, options)    // dominant case
Game::free_for_all(players, ranking)     // FFA with possible ties
Game::custom(builder)                    // power users build their own graph

GameOptions carries iteration cap, epsilon, p_draw, and approximation choice. Today these are scattered between method args and module constants.

Trade-offs

Enum dispatch over trait objects for built-ins; richer factors drop in via new enum variants.
Variables and factor messages stored as Vec<Gaussian> indexed by VarId / edge slot — flat, cache-friendly.
Schedule is a generic parameter (zero-cost); most users get default; experimentation is open.

Open question

Whether enum BuiltinFactor will feel too closed-world. The Custom(Box<dyn Factor>) escape hatch helps but inner-loop perf for user factors will be slower. Acceptable for now; flagged for future revisit if it becomes a problem.

Section 3 — Storage layout (SoA + arenas)

Dense Vec keyed by `Index`

Every HashMap<Index, T> becomes a Vec<T> (or Vec<Option<T>> for sparse) indexed directly by Index.0. The public-facing KeyTable<K> continues to map arbitrary keys → Index.

SoA at hot layers, AoS at boundaries

The Skill struct stays as a public type for the API (returned from learning_curves, etc.), but inside TimeSlice we lay it out column-wise:

struct TimeSliceSkills {
    forward:    Vec<Gaussian>,   // [n_agents]
    backward:   Vec<Gaussian>,
    likelihood: Vec<Gaussian>,
    online:     Vec<Gaussian>,
    elapsed:    Vec<i64>,
    present:    Vec<bool>,
}

Within a slice, the inner loops touch one column repeatedly across many events — keeping the column contiguous improves cache utilization and makes the eventual SIMD step (Section 6) straightforward.

Gaussian itself stays as a single 16-byte struct in the Vec<Gaussian>. Splitting into two parallel Vec<f64>s wins for pure SIMD over thousands of Gaussians but loses for the random-access patterns dominant in EP. Revisit if benchmarks demand it.

Arena allocator inside `Game`

Replace per-event allocations with a ScratchArena reused across calls.

pub struct ScratchArena {
    var_buf:     Vec<Gaussian>,
    factor_buf:  Vec<Gaussian>,    // edge messages
    bool_buf:    Vec<bool>,
    f64_buf:     Vec<f64>,
}
impl ScratchArena {
    fn reset(&mut self);                    // sets len=0, keeps capacity
    fn alloc_vars(&mut self, n: usize) -> &mut [Gaussian];
}

TimeSlice owns one ScratchArena; each event borrows it for the duration of its Game construction and inference. For the parallel-slice story (Section 6), each Rayon task gets its own arena.

Per-event storage layout

Inside a TimeSlice, each event is stored column-wise as well, with Item inlined into team-level parallel arrays:

struct EventStorage {
    teams:   SmallVec<[TeamStorage; 4]>,
    outcome: Outcome,
    weights: SmallVec<[SmallVec<[f64; 4]>; 4]>,
    evidence: f64,
}
struct TeamStorage {
    competitors:   SmallVec<[Index; 4]>,    // who's on the team
    edge_messages: SmallVec<[Gaussian; 4]>, // outgoing message per slot
    output:        f64,
}

Iteration over (competitor, edge_message) pairs zips two slices — no per-element struct.

SmallVec for typical shapes

Teams ≤ ~5 players, games ≤ ~8 teams. SmallVec<[T; 8]> for team membership and SmallVec<[T; 4]> for team rosters keeps the common case allocation-free.

Trade-offs

Dense Vec<T> keyed by Index is faster but means agent removal needs tombstones (or just leaves slots present-but-inactive). Acceptable: TrueSkill histories rarely remove players.
SoA at TimeSlice level only, not at History level. History keeps Vec<TimeSlice> because slices are heterogeneous in size.
One ScratchArena per TimeSlice keeps the lifetime story simple.

Open question

The TimeSliceSkills sketch above uses (b) dense + present mask: one slot per agent in the history, indexed directly by Index, with a present: Vec<bool> mask for batches the agent didn't participate in. The alternative is (a) sparse columnar: a Vec<Index> of present agents and parallel Vec<Gaussian> columns of length n_present, with a separate lookup (binary search or auxiliary table) to find a given Index's slot.

(b) gives O(1) lookup and SIMD-friendly columns but wastes memory for sparsely populated slices. (a) is leaner per-slice but pays per-lookup cost in the inner loop. Bench both during T0 and pick. Default proposal: (b), since modern systems are memory-rich and the parallelism story is cleaner.

Section 4 — API surface

Typed event description

pub struct Event<T: Time, K> {
    pub time: T,
    pub teams: SmallVec<[Team<K>; 4]>,
    pub outcome: Outcome,
}

pub struct Team<K> {
    pub members: SmallVec<[Member<K>; 4]>,
}

pub struct Member<K> {
    pub key: K,
    pub weight: f64,                 // default 1.0
    pub prior: Option<Rating>,       // per-event override
}

pub enum Outcome {
    Ranked(SmallVec<[u32; 4]>),  // rank per team; equal ranks = tie
    Scored(SmallVec<[f64; 4]>),  // continuous score per team (engages MarginFactor)
}

Outcome::winner(0), Outcome::draw(), Outcome::ranking([0,1,2]) are convenience constructors.

Builders

let mut history = History::<i64, _>::builder()
    .mu(25.0).sigma(25.0/3.0).beta(25.0/6.0)
    .drift(ConstantDrift(0.03))
    .p_draw(0.10)
    .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
    .observer(LogObserver::default())
    .build();

For the no-time case, type inference picks Untimed:

let mut history = History::<Untimed, _>::builder().build();

Three-tier event ingestion

// 1. Bulk ingestion (high-throughput path)
history.add_events(events_iter)?;

// 2. One-off match (very common in practice)
history.record_winner("alice", "bob", time)?;
history.record_draw("alice", "bob", time)?;

// 3. Builder for irregular shapes
history.event(time)
    .team(["alice", "bob"]).weights([1.0, 0.7])
    .team(["carol"])
    .ranking([1, 0])
    .commit()?;

Convergence & queries

let report: ConvergenceReport = history.converge()?;

let curve: Vec<(i64, Gaussian)> = history.learning_curve(&"alice");
let all = history.learning_curves();           // HashMap<&K, Vec<(T, Gaussian)>>
let now = history.current_skill(&"alice");     // Option<Gaussian>

let ev = history.log_evidence();
let ev_for = history.log_evidence_for(&["alice", "bob"]);

let q = history.predict_quality(&[&["alice"], &["bob"]]);
let p_win = history.predict_outcome(&[&["alice"], &["bob"]]);

Standalone Game

let g = Game::ranked(&[&[alice], &[bob]], Outcome::winner(0), &options);
let post = g.posteriors();

// Convenience
let (a, b) = Game::one_v_one(&alice, &bob, Outcome::winner(0));

Errors

Replace debug_assert!/panic! at the API boundary with Result.

pub enum InferenceError {
    MismatchedShape { kind: &'static str, expected: usize, got: usize },
    InvalidProbability { value: f64 },
    ConvergenceFailed { last_step: (f64, f64), iterations: usize },
    NegativePrecision { pi: f64 },
}

Hot inner loops still use debug_assert! for invariants the API has already enforced.

Trade-offs

Generic over user's K; engine works in Index. Public outputs use &K.
SmallVec everywhere on the event-description path.
Three-tier API so casual users don't drown in types and bulk users still get throughput.
Outcome enum replaces the "lower number wins" &[f64] convention.

Open question

Whether to expose Index directly to users via an intern_key(&K) -> Index method, letting hot-path callers skip the KeyTable lookup on every call. Recommendation: yes — public Index handle plus history.lookup<Q: Borrow<K>>(&Q) -> Option<Index>. The casual API still takes &K everywhere; power users can promote to Index when profiling demands.

Section 4½ — Naming pass

Current	New	Rationale
`History`	`History` (kept)	Matches upstream; reads cleanly.
`Batch`	`TimeSlice`	Says what it is: every event sharing one timestamp.
`Player`	`Rating`	The struct holds prior/beta/drift — that's a rating configuration. Resolves the `Player`/`Agent` confusion.
`Agent`	`Competitor`	Holds dynamic state for someone competing in the history; fits the domain.
`Skill`	`Skill` (kept)	Per-time-slice skill estimate; clearer than `BatchSkill`.
`Item`	inlined into `TeamStorage` columns (engine) / `Member<K>` (public)	Eliminates the per-element struct in the hot path; gives API users a clear "team member" name.
`Game`	`Game` (kept)	`Match` collides with Rust's `match`.
`Index`	`Index` (kept)	Internal handle.
`IndexMap`	`KeyTable`	Avoids confusion with the `indexmap` crate.

Section 5 — Convergence & message scheduling

Three nested loops, one mechanism

The system has three nested convergence loops:

Within-game: EP sweeps over the factor graph
Within-time-slice: re-running games as inputs change
Cross-history: forward-pass then backward-pass over all slices

All three implement Workload; one Schedule impl drives all of them.

pub trait Schedule {
    fn run<W: Workload>(&self, workload: &mut W) -> ScheduleReport;
}

pub trait Workload {
    fn step(&mut self) -> (f64, f64);
    fn snapshot_evidence(&self) -> f64 { 0.0 }
}

pub struct ScheduleReport {
    pub iterations: usize,
    pub final_step: (f64, f64),
    pub converged: bool,
}

Built-in schedules

Schedule	Behavior	Use
`EpsilonOrMax { eps, max }`	Default. Sweep until `(dpi, dtau) ≤ eps` or `max` iters.	All three loops. Replicates current behavior.
`Damped { eps, max, alpha }`	Same, but writes `α·new + (1−α)·old`.	Stuck oscillations.
`Residual { eps, max }`	Priority-queue: re-update factor with largest pending delta first.	Faster convergence on uneven graphs.
`OneShot`	Exactly one pass, no convergence check.	Online incremental adds.

Stopping in natural-param space

Switch from (|Δmu|, |Δsigma|) ≤ epsilon to (|Δpi|, |Δtau|) ≤ (eps_pi, eps_tau):

mu and sigma are on different scales; one tolerance is wrong for both
We store in nat-params anyway — checking convergence in mu/sigma costs free sqrts
Nat-param delta is the natural geometry of the EP fixed point

Default EpsilonOrMax::default() exposes a single epsilon for simplicity; advanced ctor exposes both tolerances.

Within-game improvements

Replace hard-cap of 10 iterations with GameOptions::schedule that propagates ScheduleReport upward
Fast path: graphs with no diff chain (1v1 with 1 iter sufficient) skip the loop entirely
FFA / many-team ranks benefit from Residual; opt-in

Within-slice and cross-history improvements

No more old/new HashMap snapshotting: track deltas inline as we write under SoA
Per-slice dirty bits: a TimeSlice whose neighbor messages haven't changed since its last full sweep doesn't need to re-run. Track time_slice.dirty and skip clean ones during the cross-history sweep. Big win for online-add (the locality case).

`ConvergenceReport`

pub struct ConvergenceReport {
    pub iterations: usize,
    pub final_step: (f64, f64),
    pub log_evidence: f64,
    pub converged: bool,
    pub per_iteration_time: SmallVec<[Duration; 32]>,
    pub batches_skipped: usize,
}

Observer continues to receive per-iteration callbacks for live UI; ConvergenceReport is the post-hoc summary.

Trade-offs

One Schedule trait shared across loops — fewer concepts, more composable.
Convergence checks in nat-param space — slightly different exact threshold than today; tests' epsilons re-tuned mechanically.
Dirty-bit skipping changes iteration order vs. today; fixed point is the same, iteration counts may shift downward.
Residual and Damped are opt-in; default behavior matches today closely.

Open question

Whether Schedule::run should take an optional Observer reference. Recommendation: observation lives at a higher layer (History::converge calls observer hooks; Schedule is purely the loop driver).

Section 6 — Concurrency & parallelism

What's parallelizable

Operation	Parallelism	Strategy
`History::converge()` (full forward+backward)	Sequential across slices	Within each slice: color-group events in parallel via Rayon
`History::add_events(...)`	Sequential append, but ingestion of typed events into `EventStorage` parallelizes trivially	n/a
`History::learning_curves()`	Per-key parallel	`into_par_iter()`
`History::log_evidence_for(targets)`	Per-batch parallel, reduce sum	`par_iter().map(...).sum()`
`Game` inference	Sequential	n/a (too small to amortize Rayon overhead)

Within-slice color-group parallelism

When events are added to a slice, partition them into color groups where events in the same color touch no shared Index. Within a color, run events in parallel via Rayon. Across colors, run sequentially. Preserves asynchronous-EP semantics exactly.

Alternative: synchronous EP with snapshot. All events read from a frozen skill snapshot, write deltas to thread-local buffers, barrier merges. Trivially parallel but weaker per-iteration convergence — needs damping. Available as a Schedule impl, opt-in.

`Send + Sync` requirements

All public traits (Time, Drift, Observer, Factor, Schedule) require Send + Sync. Observer impls must be thread-safe (called from arbitrary worker threads).

Rayon as default-on feature

rayon as default-on feature; with default-features = false, parallel paths fall back to sequential iterators behind cfg(feature = "rayon").

Expected speedup ballpark

For 1000 players, 60 events/slice × 1000 slices, 30 convergence iterations:

Source	Estimated speedup vs. today
`HashMap` → dense `Vec`	2–4×
Natural-param `Gaussian`, no-sqrt mul/div	1.5–2×
Pre-allocated `ScratchArena`	1.2–1.5×
Color-group parallel events in slice (8 cores)	2–4×
Dirty-bit slice skipping (online add case)	5–50×
Combined (offline converge)	~10–30×
Combined (online add)	~50–500× depending on locality

These are pre-implementation estimates. Each tier validates with criterion.

Trade-offs

Color-group parallelism requires up-front graph coloring at ingestion. Cost: linear in events, run once per add_events. Cheap.
Default = asynchronous EP (preserves current semantics). Synchronous opt-in only.
Cross-slice sweep stays sequential; no speculative parallel sweeps.
Rayon default-on but feature-gated.

Open question

Whether to expose color-group partitioning to users. Recommendation: hidden by default, escape hatch via add_events_with_partition(...) for power users who already know their event independence.

Section 7 — Migration, testing, and delivery plan

The crate is unreleased, so version-bump ceremony doesn't apply. Tiers are sequencing of work and milestones, not releases.

Tier sequence

T0 — Numerical parity (no API change)

Internal-only. Public surface unchanged.

Switch Gaussian storage to natural parameters (pi, tau). mu()/sigma() become accessors.
Replace HashMap<Index, _> with dense Vec<_> keyed by Index.0 everywhere.
Introduce ScratchArena inside Batch so Game::new stops allocating per-event.
Drop the panic! in mu_sigma; return Result propagated upward.

Acceptance: existing test suite passes (bit-equal where possible, ULP-bounded where natural-param arithmetic shifts a rounding); cargo bench shows ≥3× win on batch benchmark; no API breakage.

T1 — Factor graph machinery (internal-only)

Introduce Factor, VarStore, Schedule as pub(crate) types.
Re-implement Game::likelihoods() on top of BuiltinFactor::{Perf, TeamSum, RankDiff, Trunc} driven by EpsilonOrMax.
Replace within-game iteration tracking with ScheduleReport.

Acceptance: existing test suite passes (ULP-bounded); within-game iteration counts unchanged; benchmarks ≥ T0.

T2 — New API surface (breaking)

All renames and the new public API land together. No half-renamed intermediate state.

New types: Rating, TimeSlice, Competitor, Member<K>, Outcome, Event<T, K>, KeyTable<K>.
Time trait introduced; History<T: Time, D: Drift<T>> is generic.
Three-tier API surface: record_winner, event(...).team(...).commit(), bulk add_events(iter).
Observer trait + ConvergenceReport; verbose: bool deleted.
panic!/debug_assert! at API boundary become Result<_, InferenceError>.
Promote Factor/Schedule/VarStore to pub under a factors module.

Acceptance: full test suite rewritten in new API; equivalence tests prove identical posteriors vs. old API on the same inputs.

T3 — Concurrency

Send + Sync audit and bounds on all public traits.
Color-group partitioning at TimeSlice ingestion.
rayon as default-on feature with #[cfg(feature = "rayon")] fallback.
Parallel paths: within-slice color groups, learning_curves, log_evidence_for.

Acceptance: deterministic posteriors across RAYON_NUM_THREADS={1,2,4,8}; benchmarks show >2× on 8-core for offline converge.

T4 — Richer factor types & schedules

Each shipped independently after T3.

MarginFactor → enables Outcome::Scored.
Damped and Residual schedules.
SynergyFactor, ScoreFactor → same pattern when wanted.

Each comes with its own benchmark and a worked example in examples/.

Testing strategy

Layer	Approach
Numerical correctness	Keep existing hardcoded golden values from `test_1vs1`, `test_1vs1_draw`, `test_2vs1vs2_mixed`, etc. through T0–T1 unchanged. They are a regression net against the original Python port.
API parity	T2 adds an `equivalence` test module that runs identical inputs through old vs. new construction and compares posteriors within ULPs.
Property tests	Add `proptest` for: factor graph fixed-point invariance under message order, `Outcome` round-trip, `Gaussian` mul/div associativity in nat-params, schedule convergence regardless of starting state.
Determinism	T3 adds tests that run identical input across multiple Rayon thread counts and assert identical posteriors.
Benchmark gates	Each tier has a "must not regress" gate vs. the previous tier on the existing `batch` and `gaussian` criterion suites. T0 must beat baseline by ≥3×; T1 ≥ T0; etc.

Risk management

T0 risk: rounding drift in tests. Mitigation: where natural-param arithmetic legitimately changes the last ULPs, update goldens and simultaneously add a parity test against a snapshot taken from baseline to prove the difference is bounded.
T2 risk: API design mistakes. Mitigation: review the spec and a worked example before implementing; iterate on feedback.
T3 risk: subtle race conditions in color-group partitioning. Mitigation: loom tests for the merge step; deterministic-output assertion across thread counts.
Cross-tier risk: scope creep. Each tier has a closed checklist; new ideas go to the next tier's wishlist.

What we're explicitly not doing

No GPU offload.
No no_std support.
No serde / persistence in this design.
No incremental online API beyond record_winner / add_events.

Open questions summary

Collected here for the review pass:

enum BuiltinFactor extensibility — may feel too closed-world; revisit if user-defined factors via Custom(Box<dyn Factor>) become common.
Sparse vs. dense per-slice skill storage — default to dense + present mask; sparse columnar is the alternative. Decided by T0 benchmarks.
Index exposure for hot paths — expose intern_key/lookup so power users can promote &K to Index and skip the KeyTable lookup; casual API still takes &K everywhere.
Schedule::run and observer wiring — observation stays at higher layer (History::converge calls observer hooks; Schedule is purely the loop driver).
Color-group partition exposure — hidden by default, escape hatch via add_events_with_partition(...).

27 KiB Raw Blame History Unescape Escape

TrueSkill-TT Engine Redesign — Design

Summary

Goals & non-goals

Workload assumptions

Section 1 — Core types & traits

Gaussian — natural-parameter storage

Time trait

Drift<T> trait

Index and KeyTable<K>

Observer trait

Trade-offs

Section 2 — Factor graph architecture

Variable / Factor model

Built-in factor catalog

Game = factor graph + schedule

Schedule trait

High-level constructors

Trade-offs

Open question

Section 3 — Storage layout (SoA + arenas)

Dense Vec keyed by Index

SoA at hot layers, AoS at boundaries

Arena allocator inside Game

Per-event storage layout

SmallVec for typical shapes

Trade-offs

Open question

Section 4 — API surface

Typed event description

Builders

Three-tier event ingestion

Convergence & queries

Standalone Game

Errors

Trade-offs

Open question

Section 4½ — Naming pass

Section 5 — Convergence & message scheduling

Three nested loops, one mechanism

Built-in schedules

Stopping in natural-param space

Within-game improvements

Within-slice and cross-history improvements

ConvergenceReport

Trade-offs

Open question

Section 6 — Concurrency & parallelism

What's parallelizable

Within-slice color-group parallelism

Send + Sync requirements

Rayon as default-on feature

Expected speedup ballpark

Trade-offs

Open question

Section 7 — Migration, testing, and delivery plan

Tier sequence

Testing strategy

Risk management

What we're explicitly not doing

Open questions summary

27 KiB

Raw Blame History

`Gaussian` — natural-parameter storage

`Time` trait

`Drift<T>` trait

`Index` and `KeyTable<K>`

`Observer` trait

Dense Vec keyed by `Index`

Arena allocator inside `Game`

`ConvergenceReport`

`Send + Sync` requirements