trueskill-tt/docs/superpowers/plans/2026-04-23-t0-numerical-parity.md

# T0 — Numerical Parity Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Internally rebuild the engine's data plane for performance — switch `Gaussian` to natural-parameter storage, replace `HashMap<Index, _>` with dense `Vec<_>` storage, eliminate per-event allocations via a reusable `ScratchArena`, and convert the lone `panic!` in `mu_sigma` into a propagated `Result`. **No top-level public API change.** Existing test suite must still pass.

**Architecture:** This is the first tier of a five-tier engine redesign documented in `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. T0 is invisible to library users — same constructors, same methods, same outputs (within ULP-bounded floating-point drift) — but the internals are the new layout the rest of the redesign builds on.

**Tech Stack:** Rust 2024 edition, criterion for benchmarks, approx for floating-point comparisons, the existing `trueskill-tt` crate.

## Acceptance criteria

- All existing tests pass (`cargo test --features approx`). Where natural-parameter rounding shifts a hardcoded golden value, update the golden and add a parity comment explaining the ULP-bounded drift.
- `cargo bench --bench batch` shows ≥3× speedup on `Batch::iteration` vs. baseline.
- `cargo bench --bench gaussian` compiles (it currently doesn't — see Task 2).
- `cargo clippy --all-targets` clean.
- `cargo fmt --check` clean.
- No change to the top-level public API (`History`, `HistoryBuilder`, `Game::new`, `Game::posteriors`, `Player::new`, `quality()`, `Gaussian::from_ms`, etc.). Internal `pub mod` types may change signatures (`Batch::add_events`, `Agent`) — these are exposed in `lib.rs` but were never user-facing, and the in-tree benches are the only consumers besides the engine itself.

## File map

**Created:**
- `src/storage/mod.rs` — module root for the dense storage types
- `src/storage/skill_store.rs` — `SkillStore`: dense `Vec`-backed replacement for `HashMap<Index, Skill>`
- `src/storage/agent_store.rs` — `AgentStore<D>`: dense `Vec`-backed replacement for `HashMap<Index, Agent<D>>`
- `src/arena.rs` — `ScratchArena`: reusable scratch buffers for `Game::likelihoods`
- `benches/baseline.txt` — captured baseline numbers for the acceptance gate (committed for reference)

**Modified:**
- `src/gaussian.rs` — switch storage to `(pi, tau)`; `mu()`/`sigma()` become public accessors; ops rewritten
- `src/lib.rs` — module declarations for `storage` and `arena`; constants `N00`/`N01`/`N_INF` keep their public values but are constructed from natural params; `mu_sigma` becomes `Result`-returning
- `src/approx.rs` — `AbsDiffEq`/`RelativeEq`/`UlpsEq` impls compare via `mu()`/`sigma()` accessors instead of fields
- `src/agent.rs` — `clean()` takes `&mut AgentStore<D>` (changed from generic iterator); same observable behavior
- `src/batch.rs` — `Skill` map becomes `SkillStore`; `add_events`/`iteration`/etc. take `&AgentStore<D>` instead of `&HashMap<Index, Agent<D>>`; `Batch` owns a `ScratchArena`
- `src/history.rs` — `agents: HashMap<…>` becomes `agents: AgentStore<D>`; internal call sites updated
- `src/game.rs` — `Game::new` takes `&mut ScratchArena`; per-event `Vec` allocations replaced by arena slices
- `benches/gaussian.rs` — uses public `pi()`/`tau()` accessors (was broken)
- `benches/batch.rs` — passes `&AgentStore<D>` to `Batch::add_events` (built once before the bench loop)

**Touched (test-only golden updates — see Task 3):**
- `src/gaussian.rs` (test module)
- `src/game.rs` (test module)
- `src/batch.rs` (test module)
- `src/history.rs` (test module if affected)
- `src/lib.rs` (test module)

---

## Task 1: Establish baseline benchmark numbers

**Files:**
- Create: `benches/baseline.txt`

The acceptance criterion is "≥3× speedup on `Batch::iteration`." Capture the floor first.

- [ ] **Step 1: Fix the broken gaussian bench so we can capture both baselines**

The bench currently calls private methods. Make `pi()` and `tau()` `pub` *before* changing the representation. This is the only baseline-prep change.

Edit `src/gaussian.rs`:

```rust
//   fn pi(&self) -> f64 {
// becomes:
    pub fn pi(&self) -> f64 {

//   fn tau(&self) -> f64 {
// becomes:
    pub fn tau(&self) -> f64 {
```

Verify: `cargo bench --no-run` succeeds for both `batch` and `gaussian`.

- [ ] **Step 2: Run baseline benchmarks and save numbers**

```bash
cargo bench --bench batch 2>&1 | tee /tmp/bench-batch-baseline.txt
cargo bench --bench gaussian 2>&1 | tee /tmp/bench-gaussian-baseline.txt
```

Extract the headline numbers (lines like `Batch::iteration       time:   [X.XX µs Y.YY µs Z.ZZ µs]`) and write them into `benches/baseline.txt`:

```
# Baseline numbers captured before T0 changes
# Hardware: <fill in hostname and CPU model from `uname -a` / `sysctl -n machdep.cpu.brand_string`>
# Date: 2026-04-23

Batch::iteration         <X.XX> µs
Gaussian::add            <X.XX> ns
Gaussian::sub            <X.XX> ns
Gaussian::mul            <X.XX> ns
Gaussian::div            <X.XX> ns
Gaussian::pi             <X.XX> ns
Gaussian::tau            <X.XX> ns
Gaussian::pi_tau_combined <X.XX> ns
```

- [ ] **Step 3: Commit**

```bash
git add src/gaussian.rs benches/baseline.txt
git commit -m "$(cat <<'EOF'
bench: capture T0 baseline; expose pi/tau accessors

Promotes Gaussian::pi and Gaussian::tau to public so benches/gaussian.rs
compiles, then captures the baseline numbers we'll measure T0 against.
EOF
)"
```

---

## Task 2: Switch `Gaussian` to natural-parameter storage

**Files:**
- Modify: `src/gaussian.rs` (rewrite struct + ops)
- Modify: `src/lib.rs` (constants)
- Modify: `src/approx.rs` (compare via accessors)

This is the core change. Operations move from "store mu/sigma, recompute pi/tau on every mul/div" to "store pi/tau, recompute mu/sigma only when read." Hot-path mul/div become pure addition/subtraction of stored fields — no `sqrt`, no `powi(-2)`, no division.

### Mathematical mapping

| Operation | Before (mu/sigma stored) | After (pi/tau stored) |
|---|---|---|
| `pi()` | `sigma⁻²` (compute) | field read |
| `tau()` | `mu · pi()` (compute) | field read |
| `mu()` | field read | `tau / pi` (or `0` if `pi == 0`) |
| `sigma()` | field read | `1 / √pi` (or `∞` if `pi == 0`) |
| `a * b` (factor product) | convert to nat, add, convert back | `Gaussian { pi: a.pi + b.pi, tau: a.tau + b.tau }` — pure adds |
| `a / b` (cavity) | convert to nat, sub, convert back | `Gaussian { pi: a.pi - b.pi, tau: a.tau - b.tau }` — pure subs |
| `a + b` (variance addition) | `mu1+mu2`, `√(σ1²+σ2²)` | requires moment form: convert + compute + convert back |
| `a - b` (variance addition; same shape) | `mu1-mu2`, `√(σ1²+σ2²)` | same as `+` for sigma; requires moment form |
| `a * scalar` | `mu·k`, `σ·k` | `pi: pi/k²`, `tau: tau/k` (derived: σ' = σ·k → pi' = pi/k²; mu' = mu·k → tau' = mu'·pi' = (mu·k)·(pi/k²) = (mu·pi)/k = tau/k) |
| `forget(δ²)` | `mu`, `√(σ² + δ²)` | requires moment form |
| `delta(other)` | `(|Δmu|, |Δsigma|)` | accessor-based; same observable |
| `exclude(other)` | `mu1-mu2`, `√(σ1² - σ2²)` | requires moment form |

### Special-value table

| Constant | Public meaning | mu/sigma form (today) | pi/tau form (after) |
|---|---|---|---|
| `N_INF` | uniform / no info | `{ mu: 0, sigma: ∞ }` | `{ pi: 0.0, tau: 0.0 }` |
| `N00` | additive identity | `{ mu: 0, sigma: 0 }` | `{ pi: f64::INFINITY, tau: 0.0 }` |
| `N01` | standard normal | `{ mu: 0, sigma: 1 }` | `{ pi: 1.0, tau: 0.0 }` |

### Tasks

- [ ] **Step 1: Replace `src/gaussian.rs` with the natural-parameter implementation**

```rust
use std::ops;

use crate::{MU, N_INF, SIGMA};

/// A Gaussian distribution stored in natural parameters.
///
/// `pi  = 1 / sigma^2`  (precision)
/// `tau = mu * pi`      (precision-adjusted mean)
///
/// This representation makes message-passing operations (`*` and `/`) into
/// pure additions/subtractions of the stored fields, with no `sqrt` or
/// reciprocal in the hot path.
#[derive(Clone, Copy, PartialEq, Debug)]
pub struct Gaussian {
    pi: f64,
    tau: f64,
}

impl Gaussian {
    /// Construct from mean and standard deviation. Maintained for API
    /// compatibility with the previous representation.
    pub const fn from_ms(mu: f64, sigma: f64) -> Self {
        if sigma == f64::INFINITY {
            Self { pi: 0.0, tau: 0.0 }
        } else if sigma == 0.0 {
            // mu == 0 is the only N00-like usage in the codebase; we honour
            // any other point-mass construction using the IEEE convention
            // (tau = mu * inf = inf for nonzero mu).
            Self {
                pi: f64::INFINITY,
                tau: if mu == 0.0 { 0.0 } else { f64::INFINITY * mu },
            }
        } else {
            let pi = 1.0 / (sigma * sigma);
            Self { pi, tau: mu * pi }
        }
    }

    /// Construct directly from natural parameters. Internal helper.
    #[inline]
    pub(crate) const fn from_natural(pi: f64, tau: f64) -> Self {
        Self { pi, tau }
    }

    #[inline]
    pub fn pi(&self) -> f64 {
        self.pi
    }

    #[inline]
    pub fn tau(&self) -> f64 {
        self.tau
    }

    #[inline]
    pub fn mu(&self) -> f64 {
        if self.pi == 0.0 { 0.0 } else { self.tau / self.pi }
    }

    #[inline]
    pub fn sigma(&self) -> f64 {
        if self.pi == 0.0 {
            f64::INFINITY
        } else if self.pi.is_infinite() {
            0.0
        } else {
            1.0 / self.pi.sqrt()
        }
    }

    pub(crate) fn delta(&self, other: Gaussian) -> (f64, f64) {
        ((self.mu() - other.mu()).abs(), (self.sigma() - other.sigma()).abs())
    }

    pub(crate) fn exclude(&self, other: Gaussian) -> Self {
        let mu = self.mu() - other.mu();
        let var = self.sigma().powi(2) - other.sigma().powi(2);
        Self::from_ms(mu, var.sqrt())
    }

    pub(crate) fn forget(&self, variance_delta: f64) -> Self {
        let mu = self.mu();
        let var = self.sigma().powi(2) + variance_delta;
        Self::from_ms(mu, var.sqrt())
    }
}

impl Default for Gaussian {
    fn default() -> Self {
        Self::from_ms(MU, SIGMA)
    }
}

impl ops::Add<Gaussian> for Gaussian {
    type Output = Gaussian;
    fn add(self, rhs: Gaussian) -> Self::Output {
        // Variance addition: (mu1 + mu2, sqrt(s1^2 + s2^2)).
        // Used for combining performance and noise; rare relative to mul/div.
        let mu = self.mu() + rhs.mu();
        let var = self.sigma().powi(2) + rhs.sigma().powi(2);
        Self::from_ms(mu, var.sqrt())
    }
}

impl ops::Sub<Gaussian> for Gaussian {
    type Output = Gaussian;
    fn sub(self, rhs: Gaussian) -> Self::Output {
        // (mu1 - mu2, sqrt(s1^2 + s2^2)). Same sigma combination as add.
        let mu = self.mu() - rhs.mu();
        let var = self.sigma().powi(2) + rhs.sigma().powi(2);
        Self::from_ms(mu, var.sqrt())
    }
}

impl ops::Mul<Gaussian> for Gaussian {
    type Output = Gaussian;
    fn mul(self, rhs: Gaussian) -> Self::Output {
        // Factor product: nat-param add. Hot path.
        Self::from_natural(self.pi + rhs.pi, self.tau + rhs.tau)
    }
}

impl ops::Mul<f64> for Gaussian {
    type Output = Gaussian;
    fn mul(self, scalar: f64) -> Self::Output {
        if !scalar.is_finite() {
            return N_INF;
        }
        if scalar == 0.0 {
            return N_INF;
        }
        // sigma' = sigma * |scalar|  =>  pi' = pi / scalar^2
        // mu'    = mu * scalar       =>  tau' = mu' * pi' = (mu * scalar) * (pi / scalar^2)
        //                                              = (mu * pi) / scalar = tau / scalar
        Self::from_natural(self.pi / (scalar * scalar), self.tau / scalar)
    }
}

impl ops::Div<Gaussian> for Gaussian {
    type Output = Gaussian;
    fn div(self, rhs: Gaussian) -> Self::Output {
        // Cavity: nat-param sub. Hot path.
        Self::from_natural(self.pi - rhs.pi, self.tau - rhs.tau)
    }
}
```

- [ ] **Step 2: Update `src/lib.rs` constants**

The `pub const` constants need to remain `const`-constructible. `from_ms` is `const fn`, so just rely on it:

```rust
// In src/lib.rs, no change needed to the lines:
pub const N01: Gaussian = Gaussian::from_ms(0.0, 1.0);
pub const N00: Gaussian = Gaussian::from_ms(0.0, 0.0);
pub const N_INF: Gaussian = Gaussian::from_ms(0.0, f64::INFINITY);
```

These continue to work because `from_ms` already handles `sigma == 0.0` and `sigma == f64::INFINITY` as special cases producing the right nat-param values.

- [ ] **Step 3: Rewrite `src/approx.rs` to compare via accessors**

The `Gaussian` struct no longer has `mu`/`sigma` as public fields; the approx impls used them directly. Switch to accessors.

```rust
use approx::{AbsDiffEq, RelativeEq, UlpsEq};

use crate::gaussian::Gaussian;

impl AbsDiffEq for Gaussian {
    type Epsilon = <f64 as AbsDiffEq>::Epsilon;

    fn default_epsilon() -> Self::Epsilon {
        f64::default_epsilon()
    }

    fn abs_diff_eq(&self, other: &Self, epsilon: Self::Epsilon) -> bool {
        f64::abs_diff_eq(&self.mu(), &other.mu(), epsilon)
            && f64::abs_diff_eq(&self.sigma(), &other.sigma(), epsilon)
    }
}

impl RelativeEq for Gaussian {
    fn default_max_relative() -> Self::Epsilon {
        f64::default_max_relative()
    }

    fn relative_eq(
        &self,
        other: &Self,
        epsilon: Self::Epsilon,
        max_relative: Self::Epsilon,
    ) -> bool {
        f64::relative_eq(&self.mu(), &other.mu(), epsilon, max_relative)
            && f64::relative_eq(&self.sigma(), &other.sigma(), epsilon, max_relative)
    }
}

impl UlpsEq for Gaussian {
    fn default_max_ulps() -> u32 {
        f64::default_max_ulps()
    }

    fn ulps_eq(&self, other: &Self, epsilon: Self::Epsilon, max_ulps: u32) -> bool {
        f64::ulps_eq(&self.mu(), &other.mu(), epsilon, max_ulps)
            && f64::ulps_eq(&self.sigma(), &other.sigma(), epsilon, max_ulps)
    }
}
```

- [ ] **Step 4: Update `benches/gaussian.rs` (now a no-op since Task 1 made the methods public)**

No change needed — Task 1 already promoted `pi()`/`tau()` to `pub`. After Task 2 they remain `pub` and become trivial field reads. Verify:

```bash
cargo bench --no-run --bench gaussian
```

Expected: clean compile.

- [ ] **Step 5: Search for direct field access on `Gaussian` outside `gaussian.rs` and replace with accessors**

```bash
grep -rn '\.mu\b\|\.sigma\b' src/ benches/ | grep -v 'gaussian.rs'
```

For each hit, change `g.mu` → `g.mu()` and `g.sigma` → `g.sigma()`. Likely sites:

- `src/lib.rs` — `cdf()`, `pdf()`, `compute_margin()`, `evidence()`, `approx()`, `quality()` — search for `.mu` and `.sigma`
- `src/gaussian.rs` (own tests at the bottom)

Specifically known sites in `src/lib.rs`:
- `evidence(...)`: uses `d[e].prior.mu` and `d[e].prior.sigma`
- `approx()`: uses `n.mu`, `n.sigma`
- `quality()`: uses `rating.mu` and `rating.sigma` inside the loops

Update each.

- [ ] **Step 6: Run tests, fix broken golden values**

```bash
cargo test --features approx 2>&1 | tee /tmp/t0-test-output.txt
```

For each `assert_eq!(g, Gaussian { mu: X, sigma: Y })` that fails:
- The struct literal pattern won't even compile (no public `mu`/`sigma` fields). Replace with `Gaussian::from_ms(X, Y)`.
- For exact `assert_eq!`, also check whether the failure is "compile error" (struct literal) or "runtime mismatch" (ULP drift). For runtime drift, capture the new value (it should differ from the old by at most a few ULPs in the last hex digit) and update the literal.

Likely affected tests:
- `src/gaussian.rs::tests::test_add` — struct literal `Gaussian { mu: ..., sigma: ... }`
- `src/gaussian.rs::tests::test_sub`
- `src/gaussian.rs::tests::test_mul`
- `src/gaussian.rs::tests::test_div`

Rewrite each like:

```rust
#[test]
fn test_mul() {
    let n = Gaussian::from_ms(25.0, 25.0 / 3.0);
    let m = Gaussian::from_ms(0.0, 1.0);

    let result = n * m;
    // Original goldens: mu = 0.35488958990536273, sigma = 0.992876838486922
    // Verify within ULPs (the natural-param mul should produce identical bits
    // for these inputs — if not, capture and document the new value).
    assert_eq!(result.mu(), 0.35488958990536273);
    assert_eq!(result.sigma(), 0.992876838486922);
}
```

If a golden differs at the last few ULPs, switch to `approx::assert_ulps_eq!` with `max_ulps = 4`:

```rust
use approx::assert_ulps_eq;
assert_ulps_eq!(result, Gaussian::from_ms(EXPECTED_MU, EXPECTED_SIGMA), max_ulps = 4);
```

- [ ] **Step 7: Run the full test suite, including game/batch/history**

```bash
cargo test --features approx
```

Tests in `game.rs`, `batch.rs`, `history.rs` already use `assert_ulps_eq!` with `epsilon = 1e-6` so they should be tolerant. If any fail beyond `1e-6`, investigate (it would suggest a real bug in the rewrite, not just ULP drift).

- [ ] **Step 8: Lint and format**

```bash
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt
```

- [ ] **Step 9: Commit**

```bash
git add src/gaussian.rs src/lib.rs src/approx.rs benches/gaussian.rs
# Plus any test files you adjusted goldens in
git add src/gaussian.rs  # test module
git commit -m "$(cat <<'EOF'
refactor(gaussian): switch to natural-parameter storage

Internal change: Gaussian now stores (pi, tau) rather than (mu, sigma).
Multiplication and division become pure adds/subs of stored fields,
eliminating per-op sqrt and reciprocal. mu()/sigma() become public
accessors. Add/Sub/forget/exclude continue to work in moment form
(rare relative to mul/div in the EP hot path).

Test golden values updated where natural-parameter rounding shifts
the last few ULPs; differences bounded by 1e-6 absolute, validated
against the prior moment-form arithmetic.

Part of T0 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"
```

---

## Task 3: Convert `mu_sigma` panic to `Result`

**Files:**
- Modify: `src/lib.rs` (or wherever `mu_sigma` lives — verify with `grep`)

The current `mu_sigma` panics on negative precision. This is an internal invariant violation and should be a structured error.

After Task 2, `mu_sigma` may have been deleted entirely (it was a helper for the old representation's mul/div). Check first:

```bash
grep -n 'fn mu_sigma\|mu_sigma(' src/
```

- [ ] **Step 1: If `mu_sigma` still exists, replace the panic with an error type**

If the function survived the Task 2 rewrite (it might still be referenced indirectly), replace the body:

```rust
// In src/lib.rs (or src/gaussian.rs)
#[derive(Debug, Clone, PartialEq)]
pub enum InferenceError {
    NegativePrecision { pi: f64 },
}

impl std::fmt::Display for InferenceError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::NegativePrecision { pi } => {
                write!(f, "precision must be non-negative; got {pi}")
            }
        }
    }
}

impl std::error::Error for InferenceError {}

fn mu_sigma(tau: f64, pi: f64) -> Result<(f64, f64), InferenceError> {
    if pi > 0.0 {
        Ok((tau / pi, (1.0 / pi).sqrt()))
    } else if (pi + 1e-5) < 0.0 {
        Err(InferenceError::NegativePrecision { pi })
    } else {
        Ok((0.0, f64::INFINITY))
    }
}
```

If `mu_sigma` was deleted in Task 2, skip to Step 3 and just create the `InferenceError` enum.

- [ ] **Step 2: Update callers to propagate the `Result`**

If `mu_sigma` survived, update callers to `?` the error or unwrap with a `debug_assert!`. Hot-path callers (anything inside `propagate`/`iteration`) should be `debug_assert!` because the invariant is enforced at the API boundary. Public-API callers should propagate.

In practice after Task 2 there should be no surviving `mu_sigma` callers; all Gaussian arithmetic is direct nat-param work.

- [ ] **Step 3: Add `InferenceError` to public API**

In `src/lib.rs`:

```rust
mod error;
pub use error::InferenceError;
```

Create `src/error.rs`:

```rust
#[derive(Debug, Clone, PartialEq)]
pub enum InferenceError {
    NegativePrecision { pi: f64 },
}

impl std::fmt::Display for InferenceError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::NegativePrecision { pi } => {
                write!(f, "precision must be non-negative; got {pi}")
            }
        }
    }
}

impl std::error::Error for InferenceError {}
```

This is the seed of the wider `InferenceError` enum that T2 will expand. Other variants come later.

- [ ] **Step 4: Test**

```bash
cargo test --features approx
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt
```

- [ ] **Step 5: Commit**

```bash
git add src/lib.rs src/error.rs
git commit -m "$(cat <<'EOF'
refactor: introduce InferenceError; remove mu_sigma panic

The 'precision should be greater than 0' panic in mu_sigma becomes
a structured InferenceError::NegativePrecision. Hot-path call sites
that have invariant guarantees use debug_assert!; API-boundary
sites propagate.

Seed for the wider error enum that T2 will expand.
EOF
)"
```

---

## Task 4: Introduce `SkillStore` (replace `HashMap<Index, Skill>` in `Batch`)

**Files:**
- Create: `src/storage/mod.rs`
- Create: `src/storage/skill_store.rs`
- Modify: `src/lib.rs` (declare `pub mod storage;`)
- Modify: `src/batch.rs` (use `SkillStore`)

`Batch` currently holds `pub(crate) skills: HashMap<Index, Skill>`. Every iteration hashes and resolves indices. Replace with a dense `Vec`-backed store.

### Design

Two operations dominate:
1. Lookup by `Index` (very common; in inner loop)
2. Iterate over all present `(Index, Skill)` pairs (per iteration, per posterior recompute)

Use the **dense + present mask** layout from the spec (Section 3 open question, default proposal):

```rust
pub struct SkillStore {
    skills:  Vec<Skill>,    // dense, indexed by Index.0
    present: Vec<bool>,     // parallel mask
    n_present: usize,       // cached count for iteration sizing
}
```

`get`/`get_mut` return `Option<&Skill>`/`Option<&mut Skill>`. `insert(idx, skill)` sets and marks present. Iteration is `present.iter().enumerate().filter(|(_, &p)| p).map(|(i, _)| (Index(i), &skills[i]))`.

### Tasks

- [ ] **Step 1: Write the test for `SkillStore`**

Create `src/storage/skill_store.rs`. Add tests as a `#[cfg(test)] mod tests` block at the bottom.

```rust
use crate::Index;
use crate::batch::Skill;

#[derive(Debug, Default)]
pub struct SkillStore {
    skills: Vec<Skill>,
    present: Vec<bool>,
    n_present: usize,
}

impl SkillStore {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn with_capacity(cap: usize) -> Self {
        Self {
            skills: Vec::with_capacity(cap),
            present: Vec::with_capacity(cap),
            n_present: 0,
        }
    }

    fn ensure_capacity(&mut self, idx: usize) {
        if idx >= self.skills.len() {
            self.skills.resize_with(idx + 1, Skill::default);
            self.present.resize(idx + 1, false);
        }
    }

    pub fn insert(&mut self, idx: Index, skill: Skill) {
        self.ensure_capacity(idx.0);
        if !self.present[idx.0] {
            self.n_present += 1;
        }
        self.skills[idx.0] = skill;
        self.present[idx.0] = true;
    }

    pub fn get(&self, idx: Index) -> Option<&Skill> {
        if idx.0 < self.present.len() && self.present[idx.0] {
            Some(&self.skills[idx.0])
        } else {
            None
        }
    }

    pub fn get_mut(&mut self, idx: Index) -> Option<&mut Skill> {
        if idx.0 < self.present.len() && self.present[idx.0] {
            Some(&mut self.skills[idx.0])
        } else {
            None
        }
    }

    pub fn contains(&self, idx: Index) -> bool {
        idx.0 < self.present.len() && self.present[idx.0]
    }

    pub fn len(&self) -> usize {
        self.n_present
    }

    pub fn is_empty(&self) -> bool {
        self.n_present == 0
    }

    /// Iterate present (Index, &Skill) pairs.
    pub fn iter(&self) -> impl Iterator<Item = (Index, &Skill)> {
        self.present
            .iter()
            .enumerate()
            .filter_map(move |(i, &p)| if p { Some((Index(i), &self.skills[i])) } else { None })
    }

    /// Iterate present (Index, &mut Skill) pairs.
    pub fn iter_mut(&mut self) -> impl Iterator<Item = (Index, &mut Skill)> {
        let SkillStore { skills, present, .. } = self;
        skills
            .iter_mut()
            .zip(present.iter())
            .enumerate()
            .filter_map(|(i, (s, &p))| if p { Some((Index(i), s)) } else { None })
    }

    /// Just the present indices, for cheap iteration when only keys matter.
    pub fn keys(&self) -> impl Iterator<Item = Index> + '_ {
        self.present
            .iter()
            .enumerate()
            .filter_map(|(i, &p)| if p { Some(Index(i)) } else { None })
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn insert_then_get() {
        let mut store = SkillStore::new();
        let idx = Index(3);
        let skill = Skill::default();
        store.insert(idx, skill);

        assert!(store.contains(idx));
        assert_eq!(store.len(), 1);
        assert!(store.get(idx).is_some());
    }

    #[test]
    fn missing_returns_none() {
        let store = SkillStore::new();
        assert!(store.get(Index(0)).is_none());
        assert!(!store.contains(Index(42)));
    }

    #[test]
    fn iter_skips_absent_slots() {
        let mut store = SkillStore::new();
        store.insert(Index(0), Skill::default());
        store.insert(Index(5), Skill::default());

        let keys: Vec<Index> = store.keys().collect();
        assert_eq!(keys, vec![Index(0), Index(5)]);
    }

    #[test]
    fn double_insert_same_index_does_not_double_count() {
        let mut store = SkillStore::new();
        store.insert(Index(2), Skill::default());
        store.insert(Index(2), Skill::default());
        assert_eq!(store.len(), 1);
    }
}
```

- [ ] **Step 2: Create `src/storage/mod.rs`**

```rust
mod skill_store;

pub use skill_store::SkillStore;
```

- [ ] **Step 3: Declare the module in `src/lib.rs`**

Add to the module declarations:

```rust
pub mod storage;
```

(Public so the (still-public) `batch` module can return references to it; we may tighten visibility in T2.)

- [ ] **Step 4: Run the new tests**

```bash
cargo test --features approx storage::skill_store
```

Expected: 4 passing tests.

- [ ] **Step 5: Replace `HashMap<Index, Skill>` in `Batch`**

In `src/batch.rs`, the change is mechanical. Replace:

```rust
pub(crate) skills: HashMap<Index, Skill>,
```

with:

```rust
pub(crate) skills: SkillStore,
```

Add the import:

```rust
use crate::storage::SkillStore;
```

Now update every use site. `HashMap` and `SkillStore` differ in:
- `map[&idx]` (panicking lookup) → `store.get(idx).expect("present")` or use `store.get_mut(idx).unwrap()`
- `map.insert(idx, val)` → `store.insert(idx, val)` (same signature)
- `map.iter()` returning `(&Index, &Skill)` → `store.iter()` returning `(Index, &Skill)` — note key is by-value; adjust borrow patterns
- `map.keys()` returning `&Index` → `store.keys()` returning `Index`
- `map.get(&idx)` → `store.get(idx)`
- `map.get_mut(&idx)` → `store.get_mut(idx)`

Walk through `batch.rs` and update each call. Specifically (line numbers approximate):
- `Batch::posteriors` — uses `iter()`
- `Batch::add_events` — uses `get_mut()`/`insert()`
- `Batch::iteration` — uses `get_mut()`/`get()` heavily
- `Batch::convergence` — uses `posteriors()` and the deltas
- `Batch::forward_prior_out`, `Batch::backward_prior_out` — use `get()`
- `Batch::new_backward_info`, `Batch::new_forward_info` — use `iter_mut()`

After the rewrite, the `Batch::new` constructor likely becomes:

```rust
pub fn new(time: i64, p_draw: f64) -> Self {
    Self {
        events: Vec::new(),
        skills: SkillStore::new(),
        time,
        p_draw,
    }
}
```

- [ ] **Step 6: Run the existing batch tests**

```bash
cargo test --features approx batch
```

Expected: all batch tests pass. If iteration order through `iter()` changed (ours sorts by Index; HashMap order was arbitrary), some "first event" comparisons may need adjustment — but the EP fixed point is order-independent so posteriors should match within ULPs.

- [ ] **Step 7: Run full test suite to catch any cross-file fallout**

```bash
cargo test --features approx
```

- [ ] **Step 8: Lint and format**

```bash
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt
```

- [ ] **Step 9: Commit**

```bash
git add src/storage/ src/lib.rs src/batch.rs
git commit -m "$(cat <<'EOF'
refactor(batch): replace HashMap<Index, Skill> with dense SkillStore

Eliminates per-iteration hashing in the within-slice convergence loop
and improves cache locality. SkillStore is a Vec-backed dense store
with a parallel present mask; lookup is O(1) array indexing.

Iteration now visits indices in ascending order (HashMap was arbitrary);
the EP fixed point is order-independent so posteriors are unchanged
within ULPs.
EOF
)"
```

---

## Task 5: Introduce `AgentStore<D>` (replace `HashMap<Index, Agent<D>>` in `History`)

**Files:**
- Create: `src/storage/agent_store.rs`
- Modify: `src/storage/mod.rs` (re-export)
- Modify: `src/agent.rs` (`clean()` signature)
- Modify: `src/batch.rs` (signatures: `add_events`, `iteration`, `new_backward_info`, `new_forward_info`, `backward_prior_out`, `log_evidence` take `&AgentStore<D>` not `&HashMap<…>`)
- Modify: `src/history.rs` (use `AgentStore<D>` field)
- Modify: `benches/batch.rs` (build `AgentStore` instead of `HashMap`)

Same idea as Task 4, parameterised over `D: Drift`.

- [ ] **Step 1: Write `AgentStore<D>` with tests**

Create `src/storage/agent_store.rs`:

```rust
use crate::Index;
use crate::agent::Agent;
use crate::drift::Drift;

#[derive(Debug)]
pub struct AgentStore<D: Drift> {
    agents: Vec<Option<Agent<D>>>,
    n_present: usize,
}

impl<D: Drift> Default for AgentStore<D> {
    fn default() -> Self {
        Self {
            agents: Vec::new(),
            n_present: 0,
        }
    }
}

impl<D: Drift> AgentStore<D> {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn with_capacity(cap: usize) -> Self {
        Self {
            agents: Vec::with_capacity(cap),
            n_present: 0,
        }
    }

    fn ensure_capacity(&mut self, idx: usize) {
        if idx >= self.agents.len() {
            self.agents.resize_with(idx + 1, || None);
        }
    }

    pub fn insert(&mut self, idx: Index, agent: Agent<D>) {
        self.ensure_capacity(idx.0);
        if self.agents[idx.0].is_none() {
            self.n_present += 1;
        }
        self.agents[idx.0] = Some(agent);
    }

    pub fn get(&self, idx: Index) -> Option<&Agent<D>> {
        self.agents.get(idx.0).and_then(|slot| slot.as_ref())
    }

    pub fn get_mut(&mut self, idx: Index) -> Option<&mut Agent<D>> {
        self.agents.get_mut(idx.0).and_then(|slot| slot.as_mut())
    }

    pub fn contains(&self, idx: Index) -> bool {
        self.get(idx).is_some()
    }

    pub fn len(&self) -> usize {
        self.n_present
    }

    pub fn is_empty(&self) -> bool {
        self.n_present == 0
    }

    pub fn iter(&self) -> impl Iterator<Item = (Index, &Agent<D>)> {
        self.agents
            .iter()
            .enumerate()
            .filter_map(|(i, slot)| slot.as_ref().map(|a| (Index(i), a)))
    }

    pub fn iter_mut(&mut self) -> impl Iterator<Item = (Index, &mut Agent<D>)> {
        self.agents
            .iter_mut()
            .enumerate()
            .filter_map(|(i, slot)| slot.as_mut().map(|a| (Index(i), a)))
    }

    pub fn values_mut(&mut self) -> impl Iterator<Item = &mut Agent<D>> {
        self.agents.iter_mut().filter_map(|s| s.as_mut())
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::agent::Agent;
    use crate::drift::ConstantDrift;
    use crate::player::Player;

    #[test]
    fn insert_then_get_roundtrip() {
        let mut store: AgentStore<ConstantDrift> = AgentStore::new();
        let idx = Index(7);
        let agent = Agent {
            player: Player::default(),
            ..Default::default()
        };
        store.insert(idx, agent);
        assert!(store.contains(idx));
        assert_eq!(store.len(), 1);
    }

    #[test]
    fn iter_in_index_order() {
        let mut store: AgentStore<ConstantDrift> = AgentStore::new();
        store.insert(Index(2), Agent::default());
        store.insert(Index(0), Agent::default());
        store.insert(Index(5), Agent::default());

        let keys: Vec<Index> = store.iter().map(|(i, _)| i).collect();
        assert_eq!(keys, vec![Index(0), Index(2), Index(5)]);
    }
}
```

- [ ] **Step 2: Re-export from `src/storage/mod.rs`**

```rust
mod agent_store;
mod skill_store;

pub use agent_store::AgentStore;
pub use skill_store::SkillStore;
```

- [ ] **Step 3: Run the new tests**

```bash
cargo test --features approx storage::agent_store
```

Expected: 2 passing tests.

- [ ] **Step 4: Update `src/agent.rs::clean`**

Current signature:

```rust
pub(crate) fn clean<'a, D: Drift + 'a, A: Iterator<Item = &'a mut Agent<D>>>(
    agents: A,
    last_time: bool,
) {
    for a in agents {
        a.message = N_INF;
        if last_time {
            a.last_time = i64::MIN;
        }
    }
}
```

Change to take `&mut AgentStore<D>`:

```rust
use crate::storage::AgentStore;

pub(crate) fn clean<D: Drift>(agents: &mut AgentStore<D>, last_time: bool) {
    for a in agents.values_mut() {
        a.message = N_INF;
        if last_time {
            a.last_time = i64::MIN;
        }
    }
}
```

Update call sites in `src/history.rs`:
- `agent::clean(self.agents.values_mut(), false);` → `agent::clean(&mut self.agents, false);`

- [ ] **Step 5: Update `Batch` method signatures**

In `src/batch.rs`, replace `&HashMap<Index, Agent<D>>` with `&AgentStore<D>` in:

- `Batch::add_events`
- `Batch::iteration`
- `Batch::convergence`
- `Batch::backward_prior_out`
- `Batch::new_backward_info`
- `Batch::new_forward_info`
- `Batch::log_evidence`
- `Item::within_prior` (also takes `&HashMap<Index, Agent<D>>`)
- `Event::within_priors`

The body changes are mechanical: `agents[&idx]` → `agents.get(idx).expect(...)` (or `agents.get(idx).unwrap()` since the call sites have invariants that the index is present).

Add the import:

```rust
use crate::storage::{AgentStore, SkillStore};
```

Remove the now-unused `use std::collections::HashMap;` if there are no other uses.

- [ ] **Step 6: Update `History` to use `AgentStore<D>`**

In `src/history.rs`:

```rust
// Change the field
pub struct History<D: Drift = ConstantDrift> {
    size: usize,
    pub(crate) batches: Vec<Batch>,
    agents: AgentStore<D>,           // was: HashMap<Index, Agent<D>>
    time: bool,
    // ...
}

// Change the constructors
impl Default for History<ConstantDrift> {
    fn default() -> Self {
        Self {
            // ...
            agents: AgentStore::new(),
            // ...
        }
    }
}
```

And in `HistoryBuilder::build`:

```rust
pub fn build(self) -> History<D> {
    History {
        // ...
        agents: AgentStore::new(),
        // ...
    }
}
```

Walk through every `self.agents.foo(...)` call site in `history.rs` and replace `HashMap` semantics with `AgentStore` semantics. Likely changes:
- `self.agents.contains_key(agent)` → `self.agents.contains(*agent)` (note: `Index` is `Copy`)
- `self.agents.insert(*agent, val)` → `self.agents.insert(*agent, val)` (same signature)
- `self.agents.get_mut(agent).unwrap()` → `self.agents.get_mut(*agent).unwrap()`
- `&self.agents` passed to `Batch::*` methods now passes `&AgentStore<D>` directly (no `.values()`/`.iter()` needed)

- [ ] **Step 7: Update `benches/batch.rs`**

The bench currently builds `HashMap<Index, Agent>` and passes `&agents`. Change to:

```rust
use trueskill_tt::{
    BETA, GAMMA, IndexMap, MU, P_DRAW, SIGMA, agent::Agent, batch::Batch, drift::ConstantDrift,
    gaussian::Gaussian, player::Player, storage::AgentStore,
};

fn criterion_benchmark(criterion: &mut Criterion) {
    let mut index = IndexMap::new();
    let a = index.get_or_create("a");
    let b = index.get_or_create("b");
    let c = index.get_or_create("c");

    let mut agents: AgentStore<ConstantDrift> = AgentStore::new();
    for idx in [a, b, c] {
        agents.insert(
            idx,
            Agent {
                player: Player::new(Gaussian::from_ms(MU, SIGMA), BETA, ConstantDrift(GAMMA)),
                ..Default::default()
            },
        );
    }

    // ... rest unchanged, but pass &agents (now &AgentStore)
}
```

- [ ] **Step 8: Run all tests**

```bash
cargo test --features approx
```

Likely failure mode: tests in `batch.rs` and `history.rs` build their own `HashMap<Index, Agent>` and pass it. Update those tests to build `AgentStore` instead. Search:

```bash
grep -rn 'HashMap.*Agent\|HashMap::new.*agent' src/
```

For each test, replace:

```rust
let mut agents = HashMap::new();
agents.insert(a, Agent { ... });
```

with:

```rust
let mut agents: AgentStore<ConstantDrift> = AgentStore::new();
agents.insert(a, Agent { ... });
```

- [ ] **Step 9: Lint, format, commit**

```bash
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt
git add src/ benches/batch.rs
git commit -m "$(cat <<'EOF'
refactor: replace HashMap<Index, Agent> with dense AgentStore

History and Batch now hold agents in a Vec<Option<Agent<D>>> indexed
directly by Index. Eliminates HashMap overhead in the cross-history
forward/backward sweep and within-slice iteration.

Public Batch::* signatures now take &AgentStore<D> instead of
&HashMap<Index, Agent<D>>. The benches/batch.rs and tests are
updated to build AgentStore. Top-level History API is unchanged.
EOF
)"
```

---

## Task 6: Introduce `ScratchArena` and reuse buffers across `Game::new` calls

**Files:**
- Create: `src/arena.rs`
- Modify: `src/lib.rs` (declare module)
- Modify: `src/game.rs` (`Game::new` takes `&mut ScratchArena`; replace per-event Vecs)
- Modify: `src/batch.rs` (`Batch` owns a `ScratchArena`; passes to `Game::new`)

`Game::likelihoods` allocates four Vecs per call:
- `team: Vec<TeamMessage>` (length `n_teams`)
- `diff: Vec<DiffMessage>` (length `n_teams - 1`)
- `tie: Vec<bool>` (length `n_teams - 1`)
- `margin: Vec<f64>` (length `n_teams - 1`)

For 60 games × 30 iterations × cross-history convergence, that's hundreds of thousands of small allocations per converge. Replace with a reusable arena owned by the `Batch`.

### Design

```rust
pub struct ScratchArena {
    teams:   Vec<TeamMessage>,
    diffs:   Vec<DiffMessage>,
    ties:    Vec<bool>,
    margins: Vec<f64>,
}
```

Each scratch slice is reset by `clear()` (sets `len = 0`, retains capacity). `Game::new` borrows the arena, reserves enough capacity, and writes into the buffers. Lifetimes: the arena outlives any single `Game`; `Game` borrows mutably for the duration of construction and inference.

Easiest API: `ScratchArena` owns the buffers; `Game::new` takes `&mut ScratchArena` and reads `Vec<Vec<Gaussian>>` (the public `likelihoods` field) before returning. The lifetime of the borrowed buffers ends with `Game::new`.

### Tasks

- [ ] **Step 1: Write `ScratchArena` with a small test**

Create `src/arena.rs`:

```rust
use crate::message::{DiffMessage, TeamMessage};

#[derive(Debug, Default)]
pub struct ScratchArena {
    pub(crate) teams:   Vec<TeamMessage>,
    pub(crate) diffs:   Vec<DiffMessage>,
    pub(crate) ties:    Vec<bool>,
    pub(crate) margins: Vec<f64>,
}

impl ScratchArena {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn with_capacity(n_teams: usize) -> Self {
        Self {
            teams:   Vec::with_capacity(n_teams),
            diffs:   Vec::with_capacity(n_teams.saturating_sub(1)),
            ties:    Vec::with_capacity(n_teams.saturating_sub(1)),
            margins: Vec::with_capacity(n_teams.saturating_sub(1)),
        }
    }

    pub fn reset(&mut self) {
        self.teams.clear();
        self.diffs.clear();
        self.ties.clear();
        self.margins.clear();
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn reset_keeps_capacity() {
        let mut arena = ScratchArena::with_capacity(8);
        let cap_before = arena.teams.capacity();
        arena.teams.push(TeamMessage::default());
        arena.reset();
        assert_eq!(arena.teams.len(), 0);
        assert_eq!(arena.teams.capacity(), cap_before);
    }
}
```

Note: `TeamMessage` and `DiffMessage` are currently `pub(crate)`. The `ScratchArena` lives in the engine, so `pub(crate)` access is fine — `ScratchArena`'s fields are also `pub(crate)`.

- [ ] **Step 2: Declare the module**

In `src/lib.rs`:

```rust
pub mod arena;
```

(Public for the same reason as `storage` — `Batch` exposes it as a constructor argument; T2 can tighten.)

- [ ] **Step 3: Run the test**

```bash
cargo test --features approx arena
```

Expected: 1 passing test.

- [ ] **Step 4: Update `Game::new` to take `&mut ScratchArena`**

In `src/game.rs`:

```rust
use crate::arena::ScratchArena;

impl<'a, D: Drift> Game<'a, D> {
    pub fn new(
        teams: Vec<Vec<Player<D>>>,
        result: &'a [f64],
        weights: &'a [Vec<f64>],
        p_draw: f64,
        arena: &mut ScratchArena,
    ) -> Self {
        // ... same debug_assert! checks ...

        arena.reset();
        // Pre-size buffers to the exact required lengths.
        // Building team/diff/tie/margin into arena buffers.

        let mut this = Self {
            teams,
            result,
            weights,
            p_draw,
            likelihoods: Vec::new(),
            evidence: 0.0,
        };

        this.likelihoods(arena);
        this
    }

    fn likelihoods(&mut self, arena: &mut ScratchArena) {
        let o = sort_perm(self.result, true);

        // Build team messages into the arena instead of allocating fresh.
        arena.teams.clear();
        arena.teams.extend(o.iter().map(|&e| {
            let performance = self.teams[e]
                .iter()
                .zip(self.weights[e].iter())
                .fold(N00, |p, (player, &weight)| {
                    p + (player.performance() * weight)
                });
            TeamMessage {
                prior: performance,
                ..Default::default()
            }
        }));

        // Same for diffs/ties/margins. Use slice references for the inner loop.
        arena.diffs.clear();
        arena.diffs.extend(arena.teams.windows(2).map(|w| DiffMessage {
            prior: w[0].prior - w[1].prior,
            likelihood: N_INF,
        }));
        // ... etc, mirroring the original code ...
    }
}
```

(Full rewrite is mechanical — replace each `Vec::new()` / `vec![…]` / `let mut x = … .collect()` with `arena.X.clear(); arena.X.extend(…)`.)

- [ ] **Step 5: Update callers of `Game::new`**

Search:

```bash
grep -rn 'Game::new' src/ benches/
```

Each call site passes a `&mut ScratchArena`. There are two kinds of callers:

1. `Batch::iteration`: holds an arena per `Batch`; passes `&mut self.arena`.
2. `Batch::log_evidence`: same.
3. `History::log_evidence`: routes through `Batch::log_evidence`, which already has access.
4. Tests in `src/game.rs::tests`: each test creates a local `let mut arena = ScratchArena::new();` and passes `&mut arena` into the `Game::new` call.

Update each.

- [ ] **Step 6: Add `arena: ScratchArena` field to `Batch`**

In `src/batch.rs`:

```rust
pub struct Batch {
    pub(crate) events: Vec<Event>,
    pub(crate) skills: SkillStore,
    pub(crate) time: i64,
    p_draw: f64,
    arena: ScratchArena,
}

impl Batch {
    pub fn new(time: i64, p_draw: f64) -> Self {
        Self {
            events: Vec::new(),
            skills: SkillStore::new(),
            time,
            p_draw,
            arena: ScratchArena::new(),
        }
    }
}
```

In `Batch::iteration`, `Batch::log_evidence`, etc., where `Game::new(...)` is called, pass `&mut self.arena`.

- [ ] **Step 7: Run all tests**

```bash
cargo test --features approx
```

The fixed point is unchanged; tests should pass without golden updates.

- [ ] **Step 8: Lint, format**

```bash
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt
```

- [ ] **Step 9: Commit**

```bash
git add src/arena.rs src/lib.rs src/game.rs src/batch.rs
git commit -m "$(cat <<'EOF'
perf(game): reuse buffers via ScratchArena, eliminate per-event allocs

Game::likelihoods previously allocated four Vecs (teams, diffs, ties,
margins) on every call. Batch now owns one ScratchArena reused across
all events in the slice's iteration loops; Game::new clears and writes
into the arena instead of allocating fresh.

For a 60-event slice * 30 convergence iterations, this removes ~7200
small allocations per converge.
EOF
)"
```

---

## Task 7: Final benchmark comparison and acceptance check

**Files:**
- Modify: `benches/baseline.txt` (append T0 numbers)

- [ ] **Step 1: Run final benchmarks**

```bash
cargo bench --bench batch 2>&1 | tee /tmp/bench-batch-t0.txt
cargo bench --bench gaussian 2>&1 | tee /tmp/bench-gaussian-t0.txt
```

- [ ] **Step 2: Append T0 numbers to `benches/baseline.txt`**

Add to the file:

```
# After T0 (date, hardware as above)

Batch::iteration         <X.XX> µs   (speedup: <Y.YY>x)
Gaussian::add            <X.XX> ns   (speedup: <Y.YY>x)
Gaussian::sub            <X.XX> ns   (speedup: <Y.YY>x)
Gaussian::mul            <X.XX> ns   (speedup: <Y.YY>x)
Gaussian::div            <X.XX> ns   (speedup: <Y.YY>x)
Gaussian::pi             <X.XX> ns   (now field read)
Gaussian::tau            <X.XX> ns   (now field read)
```

- [ ] **Step 3: Verify acceptance criterion**

`Batch::iteration` speedup must be **≥3.0×**. If less:
- Re-run `cargo bench` to rule out noise (criterion's confidence interval should make this clear).
- Profile with `cargo flamegraph --bench batch` to find what didn't move. Common culprits: the `sort_perm` allocation, the `posteriors()` round-trip in `Batch::convergence`, the inner `Vec<Vec<Gaussian>>` builds in `Game::posteriors` (untouched in T0).
- If the gap is real, write up a short post-mortem in the commit message and decide whether to add a remediation step or accept the lower speedup. Do not silently lower the bar.

If `Gaussian::mul`/`Gaussian::div` did not improve substantially (~5×+ expected since they become two adds vs. four floating-point ops + a sqrt + a reciprocal), investigate. Likely cause: compiler did not inline; check `#[inline]` annotations.

- [ ] **Step 4: Run the full test suite one last time**

```bash
cargo test --features approx
cargo clippy --all-targets --features approx -- -D warnings
cargo fmt --check
```

All green.

- [ ] **Step 5: Commit the final benchmark numbers**

```bash
git add benches/baseline.txt
git commit -m "$(cat <<'EOF'
bench: capture T0 final numbers; X.YYx speedup on Batch::iteration

T0 acceptance gate met: ≥3x speedup on Batch::iteration vs. baseline.
Closes the T0 tier of the engine redesign.
EOF
)"
```

---

## Self-review notes (post-plan)

**Spec coverage:**
- ✅ Gaussian → natural parameters: Task 2
- ✅ HashMap → dense Vec: Tasks 4 (skills) and 5 (agents)
- ✅ ScratchArena replacing per-event allocs: Task 6
- ✅ Drop `panic!` in `mu_sigma`: Task 3
- ✅ Acceptance: ≥3× speedup gate: Tasks 1 and 7
- ✅ Existing tests pass within ULPs: handled in Task 2 step 6 and Task 4 step 6
- ✅ No top-level public API change: confirmed in acceptance criteria; pub-but-internal `Batch::*` signatures shift but `History`/`Game::posteriors`/etc. unchanged
- ✅ Sparse-vs-dense per-slice open question: defaulted to dense + present mask in Task 4 per spec

**Open items deferred to T1+:**
- Factor graph machinery (`Factor`, `VarStore`, `Schedule`) — entire scope of T1
- Schedule / convergence reporting — T1+T5
- Renames (`Player → Rating`, etc.) — T2
- API surface (`Outcome`, `Member<K>`, `Event<T,K>`) — T2
- Concurrency, color groups, Rayon — T3
- Richer factor types — T4

**Things to watch during execution:**
- Iteration order through `SkillStore`/`AgentStore` is by-Index, where `HashMap` was arbitrary. EP fixed point is order-independent so posteriors should match — but evidence accumulation in the *first* iteration of `Game::likelihoods` only updates `evidence` from one diff in the loop (`if iter == 0`); if cross-event order shifts at the slice level, the per-iteration evidence may differ even though the converged posterior matches. The acceptance test is on posteriors (within ULPs); evidence delta at convergence should match.
- `pub mod batch;` and `pub mod agent;` exposing internals: T0 keeps them `pub` to avoid disturbing benches. T2 will tighten visibility.
- `mu_sigma` may be entirely deleted by Task 2 (its callers were the old `Mul`/`Div` impls). Task 3's wording handles both cases.