Files
trueskill-tt/docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md
Anders Olsson d2aab82c1e T0 + T1 + T2: engine redesign through new API surface (#1)
Implements tiers T0, T1, T2 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. All three tiers have landed together on this branch because they build on one another; this PR rolls them up for a single review pass.

Per-tier plans:
- T0: `docs/superpowers/plans/2026-04-23-t0-numerical-parity.md`
- T1: `docs/superpowers/plans/2026-04-24-t1-factor-graph.md`
- T2: `docs/superpowers/plans/2026-04-24-t2-new-api-surface.md`

## Summary

### T0 — Numerical parity (internal)

- `Gaussian` switched to natural-parameter storage `(pi, tau)`; mul/div now ~7× faster (218 ps vs 1.57 ns).
- `HashMap<Index, _>` → dense `Vec<_>` keyed by `Index.0` (via `AgentStore<D>`, `SkillStore`).
- `ScratchArena` eliminates per-event allocations in `Game::likelihoods`.
- `InferenceError` seed type added (1 variant).
- 38 → 53 tests passing through T1.
- Benchmark: `Batch::iteration` 29.84 → 21.25 µs.

### T1 — Factor graph machinery (internal)

- `Factor` trait + `BuiltinFactor` enum (TeamSum / RankDiff / Trunc) driving within-game inference.
- `VarStore` flat storage for variable marginals.
- `Schedule` trait + `EpsilonOrMax` impl replacing the hand-rolled EP loop.
- `Game::likelihoods` rebuilt on the factor-graph machinery; iteration counts and goldens preserved to within 1e-6.
- 53 tests passing.
- Benchmark: `Batch::iteration` 23.01 µs (slight regression absorbed in T2).

### T2 — New API surface (breaking)

**Renames:**
- `IndexMap → KeyTable`, `Player → Rating`, `Agent → Competitor`, `Batch → TimeSlice`

**New types:**
- `Time` trait with `Untimed` ZST and `i64` impls; `Drift<T>`, `Rating<T, D>`, `Competitor<T, D>`, `TimeSlice<T>`, `History<T, D, O, K>` all generic.
- `Event<T, K>`, `Team<K>`, `Member<K>`, `Outcome` (`Ranked` variant; `#[non_exhaustive]`).
- `Observer<T>` trait + `NullObserver`.
- `ConvergenceOptions`, `ConvergenceReport`.
- `GameOptions`, `OwnedGame<T, D>`.

**Three-tier ingestion:**
- `history.record_winner(&K, &K, T)` / `record_draw(&K, &K, T)` — 1v1 convenience.
- `history.add_events(iter)` — typed bulk.
- `history.event(T).team([...]).weights([...]).ranking([...]).commit()` — fluent.

**Query API:** `current_skill`, `learning_curve`, `learning_curves` (keyed on `K`), `log_evidence`, `log_evidence_for`, `predict_quality`, `predict_outcome`.

**Game constructors:** `ranked`, `one_v_one`, `free_for_all`, `custom` — all returning `Result<_, InferenceError>`.

**`factors` module:** `Factor`, `Schedule`, `VarStore`, `VarId`, `BuiltinFactor`, `EpsilonOrMax`, `ScheduleReport`, `TeamSumFactor`, `RankDiffFactor`, `TruncFactor` now public.

**Errors:** `InferenceError` gains `MismatchedShape`, `InvalidProbability`, `ConvergenceFailed`; boundary panics converted to `Result`.

**Removed (breaking):** `History::convergence(iters, eps, verbose)`, `HistoryBuilder::gamma(f64)`, `HistoryBuilder::time(bool)`, `History.time: bool`, `learning_curves_by_index`, nested-Vec public `add_events`.

## Behavior change (documented in CHANGELOG)

`Time = Untimed` has `elapsed_to → 0`, so no drift accumulates between slices. The old `time=false` mode implicitly forced `elapsed=1` on reappearance via an `i64::MAX` sentinel — that quirk is not reproducible under a typed time axis. Tests that depended on it now use `History::<i64, _>` with explicit `1..=n` timestamps. One test (`test_env_ttt`) had 3 Gaussian goldens updated to reflect the corrected semantics; documented in commit `33a7d90`.

## Final numbers

| Metric | Before T0 | After T2 | Delta |
|---|---|---|---|
| `Batch::iteration` | 29.84 µs | 21.36 µs | **-28%** |
| `Gaussian::mul` | 1.57 ns | 219 ps | **-86%** |
| `Gaussian::div` | 1.57 ns | 219 ps | **-86%** |
| Tests passing | 38 | 90 | +52 |

All other Gaussian ops unchanged (~219 ps add/sub, ~264 ps pi/tau reads).

## Test plan

- [x] `cargo test --features approx` — 90/90 pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch` — 21.36 µs
- [x] `cargo bench --bench gaussian` — unchanged from T1
- [x] `cargo run --example atp --features approx` — rewritten in new API, runs clean
- [x] Historical Game-level goldens preserved in `tests/equivalence.rs`
- [x] Public API matches spec Section 4 (verified by integration tests in `tests/api_shape.rs`)

## Commit history

~45 commits total across T0 + T1 + T2. Each task is self-contained and individually tested; the branch is bisectable. See `git log main..t2-new-api-surface` for the full list.

## Deferred to later tiers

- `Outcome::Scored` + `MarginFactor` — T4
- `Damped` / `Residual` schedules — T4
- `Send + Sync` bounds + Rayon parallelism — T3
- N-team `predict_outcome` — T4
- `Game::custom` full ergonomics — T4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #1
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
2026-04-24 11:20:04 +00:00

27 KiB
Raw Blame History

TrueSkill-TT Engine Redesign — Design

Date: 2026-04-23 Status: Approved (pending implementation plan)

Summary

Comprehensive redesign of the TrueSkill-TT engine targeting four orthogonal goals:

  1. Performance — substantially faster offline convergence and incremental online updates.
  2. Accuracy and richer match formats — support for score margins, free-for-all with partial orders, correlated skills.
  3. Better convergence — replace ad-hoc capped iteration with a pluggable Schedule trait covering all three nested loops.
  4. Better API surface — typed event description, observer-based progress reporting, generic time axis, structured errors, ergonomic builders.

The design is comprehensive (Approach 1 of three considered) but delivered in five tiers so each step is independently shippable and validated by benchmarks.

Goals & non-goals

Goals

  • 1030× speedup on the offline convergence path for representative workloads (1000+ players, 1000+ events, 30 iterations)
  • Order-of-magnitude speedup on incremental "add a single event" workloads
  • Pluggable factor graph allowing new factor types without engine changes
  • Optional Rayon-backed parallelism on top of Send + Sync-correct internals
  • Typed, ergonomic public API; replace nested Vec<Vec<Vec<_>>> shapes with Event<T, K> / Team<K> / Member<K>
  • Generic time axis: Untimed, i64, or user-supplied
  • Observer-based progress instead of verbose: bool + println!
  • Structured Result<_, InferenceError> at API boundaries

Non-goals

  • WebAssembly support is not a goal; we may break it if a crate or feature requires.
  • No GPU offload.
  • No no_std support.
  • No persistent format / serde — possible future feature.
  • No replacement of the Gaussian/EP approximation itself in this design (the underlying inference math stays the same; we change layout, dispatch, scheduling, and API around it).

Workload assumptions

Baseline workload that drives perf decisions:

  • ~1000+ players
  • ~1000+ events total
  • ~5060 events per time slice (per day)
  • Both online (incremental adds) and offline (full convergence) are common
  • Offline convergence runs frequently

Section 1 — Core types & traits

The foundation everything else builds on.

Gaussian — natural-parameter storage

Switch storage from (mu, sigma) to natural parameters (pi, tau) where pi = sigma⁻², tau = mu · pi. Multiplication and division dominate the hot path; in nat-params they are direct adds/subs of the components, no sqrt. Reads of mu/sigma become accessor methods (tau / pi, 1.0 / pi.sqrt()). The trade is correct because reads are vanishingly rare compared to writes in EP.

pub struct Gaussian { pi: f64, tau: f64 }
pub const UNIFORM: Gaussian = Gaussian { pi: 0.0, tau: 0.0 }; // replaces N_INF

Time trait

Replaces the bare i64 time field. Keeps History parametric.

pub trait Time: Copy + Ord + Send + Sync + 'static {
    fn elapsed_to(&self, later: &Self) -> i64;
}
pub struct Untimed; // ZST for the no-time-axis case
impl Time for Untimed { fn elapsed_to(&self, _: &Self) -> i64 { 0 } }
impl Time for i64 { fn elapsed_to(&self, later: &Self) -> i64 { later - self } }
// Optional impls behind feature flags: time::OffsetDateTime, chrono types

Drift<T> trait

Generic over T: Time so seasonal/calendar-aware drift is possible without going through i64.

pub trait Drift<T: Time>: Copy + Send + Sync {
    fn variance_delta(&self, from: &T, to: &T) -> f64;
}

ConstantDrift(f64) impl: to.elapsed_to(from) as f64 * gamma * gamma.

Index and KeyTable<K>

Index(usize) is the handle into dense per-History Vec storage. Public, but intended for use by power users on hot paths who want to skip the KeyTable lookup. Casual API takes &K. KeyTable<K> (renamed from IndexMap, to avoid colliding with the indexmap crate's type) maps user keys → Index.

Observer trait

Replaces verbose: bool + println!. Default no-op impls; user overrides what they need.

pub trait Observer<T: Time>: Send + Sync {
    fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
    fn on_batch_processed(&self, _time: &T, _idx: usize, _n_events: usize) {}
    fn on_converged(&self, _iters: usize, _final_step: (f64, f64)) {}
}
pub struct NullObserver;
impl<T: Time> Observer<T> for NullObserver {}

Trade-offs

  • Gaussian natural-param representation: anyone reading mu/sigma in a hot loop pays a sqrt — but that's correct, hot reads are rare.
  • Time as a trait (not enum) keeps it open-ended at zero runtime cost; default History<i64, _> keeps the call sites familiar.
  • Observer is a trait (not a closure) so different sites can have different signatures without losing type safety. NullObserver is a ZST.

Section 2 — Factor graph architecture

The current Game::likelihoods is a hand-rolled, hard-coded graph. To unlock richer formats and let us experiment with EP schedules, the graph itself becomes a data structure.

Variable / Factor model

Variables hold their current Gaussian marginal. Factors hold their outgoing messages to each connected variable plus do the local computation. Standard EP: factor's update is "divide marginal by old outgoing → cavity → apply local approximation → multiply marginal by new outgoing."

pub trait Factor: Send + Sync {
    fn variables(&self) -> &[VarId];
    fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64); // returns max delta
    fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}

Built-in factor catalog

Factor Purpose Status
PerformanceFactor skill → performance (add β² noise, optional weight) replaces inline performance() * weight
TeamSumFactor weighted sum of player perfs → team perf replaces inline fold
RankDiffFactor (team_a perf) (team_b perf) → diff var currently team[e].posterior_win() team[e+1].posterior_lose()
TruncFactor EP truncation: P(diff > margin) or `P( diff
MarginFactor (future) use observed score margin as soft evidence enables richer match formats
SynergyFactor (future) couples teammates' skills enables different topology
ScoreFactor (future) continuous outcome (e.g., points scored) enables score-based outcomes

The first four together exactly reproduce today's algorithm. The last three are extension slots.

Game = factor graph + schedule

pub struct Game<S: Schedule = DefaultSchedule> {
    vars: VarStore,            // SoA: Vec<Gaussian> marginals
    factors: FactorList,       // enum dispatch over BuiltinFactor (see Open Questions)
    schedule: S,
}

Lean toward enum dispatch (enum BuiltinFactor { Perf(...), Sum(...), RankDiff(...), Trunc(...), ... }) over Box<dyn Factor> for the built-ins:

  • avoids per-message vtable overhead in the hottest loop
  • keeps factor data inline (no heap indirection)
  • still allows user-defined factors via a BuiltinFactor::Custom(Box<dyn Factor>) variant

Schedule trait

Controls iteration order and stopping. Default = current behavior (sweep forward, then backward, until ε or max iters). Pluggable so we can later try damped EP or junction-tree schedules.

High-level constructors

Game::ranked(teams, results, options)    // dominant case
Game::free_for_all(players, ranking)     // FFA with possible ties
Game::custom(builder)                    // power users build their own graph

GameOptions carries iteration cap, epsilon, p_draw, and approximation choice. Today these are scattered between method args and module constants.

Trade-offs

  • Enum dispatch over trait objects for built-ins; richer factors drop in via new enum variants.
  • Variables and factor messages stored as Vec<Gaussian> indexed by VarId / edge slot — flat, cache-friendly.
  • Schedule is a generic parameter (zero-cost); most users get default; experimentation is open.

Open question

Whether enum BuiltinFactor will feel too closed-world. The Custom(Box<dyn Factor>) escape hatch helps but inner-loop perf for user factors will be slower. Acceptable for now; flagged for future revisit if it becomes a problem.

Section 3 — Storage layout (SoA + arenas)

Dense Vec keyed by Index

Every HashMap<Index, T> becomes a Vec<T> (or Vec<Option<T>> for sparse) indexed directly by Index.0. The public-facing KeyTable<K> continues to map arbitrary keys → Index.

SoA at hot layers, AoS at boundaries

The Skill struct stays as a public type for the API (returned from learning_curves, etc.), but inside TimeSlice we lay it out column-wise:

struct TimeSliceSkills {
    forward:    Vec<Gaussian>,   // [n_agents]
    backward:   Vec<Gaussian>,
    likelihood: Vec<Gaussian>,
    online:     Vec<Gaussian>,
    elapsed:    Vec<i64>,
    present:    Vec<bool>,
}

Within a slice, the inner loops touch one column repeatedly across many events — keeping the column contiguous improves cache utilization and makes the eventual SIMD step (Section 6) straightforward.

Gaussian itself stays as a single 16-byte struct in the Vec<Gaussian>. Splitting into two parallel Vec<f64>s wins for pure SIMD over thousands of Gaussians but loses for the random-access patterns dominant in EP. Revisit if benchmarks demand it.

Arena allocator inside Game

Replace per-event allocations with a ScratchArena reused across calls.

pub struct ScratchArena {
    var_buf:     Vec<Gaussian>,
    factor_buf:  Vec<Gaussian>,    // edge messages
    bool_buf:    Vec<bool>,
    f64_buf:     Vec<f64>,
}
impl ScratchArena {
    fn reset(&mut self);                    // sets len=0, keeps capacity
    fn alloc_vars(&mut self, n: usize) -> &mut [Gaussian];
}

TimeSlice owns one ScratchArena; each event borrows it for the duration of its Game construction and inference. For the parallel-slice story (Section 6), each Rayon task gets its own arena.

Per-event storage layout

Inside a TimeSlice, each event is stored column-wise as well, with Item inlined into team-level parallel arrays:

struct EventStorage {
    teams:   SmallVec<[TeamStorage; 4]>,
    outcome: Outcome,
    weights: SmallVec<[SmallVec<[f64; 4]>; 4]>,
    evidence: f64,
}
struct TeamStorage {
    competitors:   SmallVec<[Index; 4]>,    // who's on the team
    edge_messages: SmallVec<[Gaussian; 4]>, // outgoing message per slot
    output:        f64,
}

Iteration over (competitor, edge_message) pairs zips two slices — no per-element struct.

SmallVec for typical shapes

Teams ≤ ~5 players, games ≤ ~8 teams. SmallVec<[T; 8]> for team membership and SmallVec<[T; 4]> for team rosters keeps the common case allocation-free.

Trade-offs

  • Dense Vec<T> keyed by Index is faster but means agent removal needs tombstones (or just leaves slots present-but-inactive). Acceptable: TrueSkill histories rarely remove players.
  • SoA at TimeSlice level only, not at History level. History keeps Vec<TimeSlice> because slices are heterogeneous in size.
  • One ScratchArena per TimeSlice keeps the lifetime story simple.

Open question

The TimeSliceSkills sketch above uses (b) dense + present mask: one slot per agent in the history, indexed directly by Index, with a present: Vec<bool> mask for batches the agent didn't participate in. The alternative is (a) sparse columnar: a Vec<Index> of present agents and parallel Vec<Gaussian> columns of length n_present, with a separate lookup (binary search or auxiliary table) to find a given Index's slot.

(b) gives O(1) lookup and SIMD-friendly columns but wastes memory for sparsely populated slices. (a) is leaner per-slice but pays per-lookup cost in the inner loop. Bench both during T0 and pick. Default proposal: (b), since modern systems are memory-rich and the parallelism story is cleaner.

Section 4 — API surface

Typed event description

pub struct Event<T: Time, K> {
    pub time: T,
    pub teams: SmallVec<[Team<K>; 4]>,
    pub outcome: Outcome,
}

pub struct Team<K> {
    pub members: SmallVec<[Member<K>; 4]>,
}

pub struct Member<K> {
    pub key: K,
    pub weight: f64,                 // default 1.0
    pub prior: Option<Rating>,       // per-event override
}

pub enum Outcome {
    Ranked(SmallVec<[u32; 4]>),  // rank per team; equal ranks = tie
    Scored(SmallVec<[f64; 4]>),  // continuous score per team (engages MarginFactor)
}

Outcome::winner(0), Outcome::draw(), Outcome::ranking([0,1,2]) are convenience constructors.

Builders

let mut history = History::<i64, _>::builder()
    .mu(25.0).sigma(25.0/3.0).beta(25.0/6.0)
    .drift(ConstantDrift(0.03))
    .p_draw(0.10)
    .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
    .observer(LogObserver::default())
    .build();

For the no-time case, type inference picks Untimed:

let mut history = History::<Untimed, _>::builder().build();

Three-tier event ingestion

// 1. Bulk ingestion (high-throughput path)
history.add_events(events_iter)?;

// 2. One-off match (very common in practice)
history.record_winner("alice", "bob", time)?;
history.record_draw("alice", "bob", time)?;

// 3. Builder for irregular shapes
history.event(time)
    .team(["alice", "bob"]).weights([1.0, 0.7])
    .team(["carol"])
    .ranking([1, 0])
    .commit()?;

Convergence & queries

let report: ConvergenceReport = history.converge()?;

let curve: Vec<(i64, Gaussian)> = history.learning_curve(&"alice");
let all = history.learning_curves();           // HashMap<&K, Vec<(T, Gaussian)>>
let now = history.current_skill(&"alice");     // Option<Gaussian>

let ev = history.log_evidence();
let ev_for = history.log_evidence_for(&["alice", "bob"]);

let q = history.predict_quality(&[&["alice"], &["bob"]]);
let p_win = history.predict_outcome(&[&["alice"], &["bob"]]);

Standalone Game

let g = Game::ranked(&[&[alice], &[bob]], Outcome::winner(0), &options);
let post = g.posteriors();

// Convenience
let (a, b) = Game::one_v_one(&alice, &bob, Outcome::winner(0));

Errors

Replace debug_assert!/panic! at the API boundary with Result.

pub enum InferenceError {
    MismatchedShape { kind: &'static str, expected: usize, got: usize },
    InvalidProbability { value: f64 },
    ConvergenceFailed { last_step: (f64, f64), iterations: usize },
    NegativePrecision { pi: f64 },
}

Hot inner loops still use debug_assert! for invariants the API has already enforced.

Trade-offs

  • Generic over user's K; engine works in Index. Public outputs use &K.
  • SmallVec everywhere on the event-description path.
  • Three-tier API so casual users don't drown in types and bulk users still get throughput.
  • Outcome enum replaces the "lower number wins" &[f64] convention.

Open question

Whether to expose Index directly to users via an intern_key(&K) -> Index method, letting hot-path callers skip the KeyTable lookup on every call. Recommendation: yes — public Index handle plus history.lookup<Q: Borrow<K>>(&Q) -> Option<Index>. The casual API still takes &K everywhere; power users can promote to Index when profiling demands.

Section 4½ — Naming pass

Current New Rationale
History History (kept) Matches upstream; reads cleanly.
Batch TimeSlice Says what it is: every event sharing one timestamp.
Player Rating The struct holds prior/beta/drift — that's a rating configuration. Resolves the Player/Agent confusion.
Agent Competitor Holds dynamic state for someone competing in the history; fits the domain.
Skill Skill (kept) Per-time-slice skill estimate; clearer than BatchSkill.
Item inlined into TeamStorage columns (engine) / Member<K> (public) Eliminates the per-element struct in the hot path; gives API users a clear "team member" name.
Game Game (kept) Match collides with Rust's match.
Index Index (kept) Internal handle.
IndexMap KeyTable Avoids confusion with the indexmap crate.

Section 5 — Convergence & message scheduling

Three nested loops, one mechanism

The system has three nested convergence loops:

  1. Within-game: EP sweeps over the factor graph
  2. Within-time-slice: re-running games as inputs change
  3. Cross-history: forward-pass then backward-pass over all slices

All three implement Workload; one Schedule impl drives all of them.

pub trait Schedule {
    fn run<W: Workload>(&self, workload: &mut W) -> ScheduleReport;
}

pub trait Workload {
    fn step(&mut self) -> (f64, f64);
    fn snapshot_evidence(&self) -> f64 { 0.0 }
}

pub struct ScheduleReport {
    pub iterations: usize,
    pub final_step: (f64, f64),
    pub converged: bool,
}

Built-in schedules

Schedule Behavior Use
EpsilonOrMax { eps, max } Default. Sweep until (dpi, dtau) ≤ eps or max iters. All three loops. Replicates current behavior.
Damped { eps, max, alpha } Same, but writes α·new + (1α)·old. Stuck oscillations.
Residual { eps, max } Priority-queue: re-update factor with largest pending delta first. Faster convergence on uneven graphs.
OneShot Exactly one pass, no convergence check. Online incremental adds.

Stopping in natural-param space

Switch from (|Δmu|, |Δsigma|) ≤ epsilon to (|Δpi|, |Δtau|) ≤ (eps_pi, eps_tau):

  • mu and sigma are on different scales; one tolerance is wrong for both
  • We store in nat-params anyway — checking convergence in mu/sigma costs free sqrts
  • Nat-param delta is the natural geometry of the EP fixed point

Default EpsilonOrMax::default() exposes a single epsilon for simplicity; advanced ctor exposes both tolerances.

Within-game improvements

  • Replace hard-cap of 10 iterations with GameOptions::schedule that propagates ScheduleReport upward
  • Fast path: graphs with no diff chain (1v1 with 1 iter sufficient) skip the loop entirely
  • FFA / many-team ranks benefit from Residual; opt-in

Within-slice and cross-history improvements

  • No more old/new HashMap snapshotting: track deltas inline as we write under SoA
  • Per-slice dirty bits: a TimeSlice whose neighbor messages haven't changed since its last full sweep doesn't need to re-run. Track time_slice.dirty and skip clean ones during the cross-history sweep. Big win for online-add (the locality case).

ConvergenceReport

pub struct ConvergenceReport {
    pub iterations: usize,
    pub final_step: (f64, f64),
    pub log_evidence: f64,
    pub converged: bool,
    pub per_iteration_time: SmallVec<[Duration; 32]>,
    pub batches_skipped: usize,
}

Observer continues to receive per-iteration callbacks for live UI; ConvergenceReport is the post-hoc summary.

Trade-offs

  • One Schedule trait shared across loops — fewer concepts, more composable.
  • Convergence checks in nat-param space — slightly different exact threshold than today; tests' epsilons re-tuned mechanically.
  • Dirty-bit skipping changes iteration order vs. today; fixed point is the same, iteration counts may shift downward.
  • Residual and Damped are opt-in; default behavior matches today closely.

Open question

Whether Schedule::run should take an optional Observer reference. Recommendation: observation lives at a higher layer (History::converge calls observer hooks; Schedule is purely the loop driver).

Section 6 — Concurrency & parallelism

What's parallelizable

Operation Parallelism Strategy
History::converge() (full forward+backward) Sequential across slices Within each slice: color-group events in parallel via Rayon
History::add_events(...) Sequential append, but ingestion of typed events into EventStorage parallelizes trivially n/a
History::learning_curves() Per-key parallel into_par_iter()
History::log_evidence_for(targets) Per-batch parallel, reduce sum par_iter().map(...).sum()
Game inference Sequential n/a (too small to amortize Rayon overhead)

Within-slice color-group parallelism

When events are added to a slice, partition them into color groups where events in the same color touch no shared Index. Within a color, run events in parallel via Rayon. Across colors, run sequentially. Preserves asynchronous-EP semantics exactly.

Alternative: synchronous EP with snapshot. All events read from a frozen skill snapshot, write deltas to thread-local buffers, barrier merges. Trivially parallel but weaker per-iteration convergence — needs damping. Available as a Schedule impl, opt-in.

Send + Sync requirements

All public traits (Time, Drift, Observer, Factor, Schedule) require Send + Sync. Observer impls must be thread-safe (called from arbitrary worker threads).

Rayon as default-on feature

rayon as default-on feature; with default-features = false, parallel paths fall back to sequential iterators behind cfg(feature = "rayon").

Expected speedup ballpark

For 1000 players, 60 events/slice × 1000 slices, 30 convergence iterations:

Source Estimated speedup vs. today
HashMap → dense Vec 24×
Natural-param Gaussian, no-sqrt mul/div 1.52×
Pre-allocated ScratchArena 1.21.5×
Color-group parallel events in slice (8 cores) 24×
Dirty-bit slice skipping (online add case) 550×
Combined (offline converge) ~1030×
Combined (online add) ~50500× depending on locality

These are pre-implementation estimates. Each tier validates with criterion.

Trade-offs

  • Color-group parallelism requires up-front graph coloring at ingestion. Cost: linear in events, run once per add_events. Cheap.
  • Default = asynchronous EP (preserves current semantics). Synchronous opt-in only.
  • Cross-slice sweep stays sequential; no speculative parallel sweeps.
  • Rayon default-on but feature-gated.

Open question

Whether to expose color-group partitioning to users. Recommendation: hidden by default, escape hatch via add_events_with_partition(...) for power users who already know their event independence.

Section 7 — Migration, testing, and delivery plan

The crate is unreleased, so version-bump ceremony doesn't apply. Tiers are sequencing of work and milestones, not releases.

Tier sequence

T0 — Numerical parity (no API change)

Internal-only. Public surface unchanged.

  • Switch Gaussian storage to natural parameters (pi, tau). mu()/sigma() become accessors.
  • Replace HashMap<Index, _> with dense Vec<_> keyed by Index.0 everywhere.
  • Introduce ScratchArena inside Batch so Game::new stops allocating per-event.
  • Drop the panic! in mu_sigma; return Result propagated upward.

Acceptance: existing test suite passes (bit-equal where possible, ULP-bounded where natural-param arithmetic shifts a rounding); cargo bench shows ≥3× win on batch benchmark; no API breakage.

T1 — Factor graph machinery (internal-only)

  • Introduce Factor, VarStore, Schedule as pub(crate) types.
  • Re-implement Game::likelihoods() on top of BuiltinFactor::{Perf, TeamSum, RankDiff, Trunc} driven by EpsilonOrMax.
  • Replace within-game iteration tracking with ScheduleReport.

Acceptance: existing test suite passes (ULP-bounded); within-game iteration counts unchanged; benchmarks ≥ T0.

T2 — New API surface (breaking)

All renames and the new public API land together. No half-renamed intermediate state.

  • New types: Rating, TimeSlice, Competitor, Member<K>, Outcome, Event<T, K>, KeyTable<K>.
  • Time trait introduced; History<T: Time, D: Drift<T>> is generic.
  • Three-tier API surface: record_winner, event(...).team(...).commit(), bulk add_events(iter).
  • Observer trait + ConvergenceReport; verbose: bool deleted.
  • panic!/debug_assert! at API boundary become Result<_, InferenceError>.
  • Promote Factor/Schedule/VarStore to pub under a factors module.

Acceptance: full test suite rewritten in new API; equivalence tests prove identical posteriors vs. old API on the same inputs.

T3 — Concurrency

  • Send + Sync audit and bounds on all public traits.
  • Color-group partitioning at TimeSlice ingestion.
  • rayon as default-on feature with #[cfg(feature = "rayon")] fallback.
  • Parallel paths: within-slice color groups, learning_curves, log_evidence_for.

Acceptance: deterministic posteriors across RAYON_NUM_THREADS={1,2,4,8}; benchmarks show >2× on 8-core for offline converge.

T4 — Richer factor types & schedules

Each shipped independently after T3.

  • MarginFactor → enables Outcome::Scored.
  • Damped and Residual schedules.
  • SynergyFactor, ScoreFactor → same pattern when wanted.

Each comes with its own benchmark and a worked example in examples/.

Testing strategy

Layer Approach
Numerical correctness Keep existing hardcoded golden values from test_1vs1, test_1vs1_draw, test_2vs1vs2_mixed, etc. through T0T1 unchanged. They are a regression net against the original Python port.
API parity T2 adds an equivalence test module that runs identical inputs through old vs. new construction and compares posteriors within ULPs.
Property tests Add proptest for: factor graph fixed-point invariance under message order, Outcome round-trip, Gaussian mul/div associativity in nat-params, schedule convergence regardless of starting state.
Determinism T3 adds tests that run identical input across multiple Rayon thread counts and assert identical posteriors.
Benchmark gates Each tier has a "must not regress" gate vs. the previous tier on the existing batch and gaussian criterion suites. T0 must beat baseline by ≥3×; T1 ≥ T0; etc.

Risk management

  • T0 risk: rounding drift in tests. Mitigation: where natural-param arithmetic legitimately changes the last ULPs, update goldens and simultaneously add a parity test against a snapshot taken from baseline to prove the difference is bounded.
  • T2 risk: API design mistakes. Mitigation: review the spec and a worked example before implementing; iterate on feedback.
  • T3 risk: subtle race conditions in color-group partitioning. Mitigation: loom tests for the merge step; deterministic-output assertion across thread counts.
  • Cross-tier risk: scope creep. Each tier has a closed checklist; new ideas go to the next tier's wishlist.

What we're explicitly not doing

  • No GPU offload.
  • No no_std support.
  • No serde / persistence in this design.
  • No incremental online API beyond record_winner / add_events.

Open questions summary

Collected here for the review pass:

  1. enum BuiltinFactor extensibility — may feel too closed-world; revisit if user-defined factors via Custom(Box<dyn Factor>) become common.
  2. Sparse vs. dense per-slice skill storage — default to dense + present mask; sparse columnar is the alternative. Decided by T0 benchmarks.
  3. Index exposure for hot paths — expose intern_key/lookup so power users can promote &K to Index and skip the KeyTable lookup; casual API still takes &K everywhere.
  4. Schedule::run and observer wiring — observation stays at higher layer (History::converge calls observer hooks; Schedule is purely the loop driver).
  5. Color-group partition exposure — hidden by default, escape hatch via add_events_with_partition(...).