bench,docs: capture T3 final numbers and update CHANGELOG

Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline). Gaussian ops unchanged. End-to-end history_converge benchmark on Apple M5 Pro: Workload seq rayon speedup 500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x 2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x 5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x The spec's >=2x target is not achieved on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads don't fit that profile. Cross-slice parallelism (dirty-bit slice skipping, spec Section 5) is the natural next step for real-workload speedup. Determinism verified: bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf(game): revert Task 10 SmallVec changes — caused sequential regression
2026-04-24 14:58:24 +02:00 · 2026-04-24 14:55:37 +02:00 · 2026-04-24 14:47:29 +02:00 · 2026-04-24 13:59:33 +02:00 · 2026-04-24 13:56:29 +02:00 · 2026-04-24 13:54:47 +02:00
9 changed files with 762 additions and 29 deletions
@@ -2,6 +2,66 @@
 All notable changes to this project will be documented in this file.
 ## Unreleased — T3 concurrency
 Adds rayon-backed parallel paths per Section 6 of
 `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`.
 ### Breaking
 - `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`,
  `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these
  via auto-derive, but downstream custom impls that aren't thread-safe
  will need the bounds.
 ### New
 - Opt-in `rayon` cargo feature. When enabled:
  - Within-slice event iteration runs color-group events in parallel
    via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
  - `History::learning_curves` computes per-slice posteriors in
    parallel, merges sequentially in slice order.
  - `History::log_evidence` / `log_evidence_for` use per-slice parallel
    computation with deterministic sequential reduction (sum in slice
    order) — bit-identical to the sequential baseline.
 - `ColorGroups` internal infrastructure with greedy graph coloring
  (`src/color_group.rs`). Events sharing no `Index` go into the same
  color group; events in the same group can run concurrently without
  touching each other's skills.
 - `tests/determinism.rs` asserts bit-identical posteriors across
  `RAYON_NUM_THREADS={1, 2, 4, 8}`.
 - `benches/history_converge.rs` measures end-to-end convergence on
  three workload shapes.
 ### Performance notes
 - Default build (no rayon): `Batch::iteration` 23.23 µs — no regression
  vs T2.
 - With `--features rayon`:
  - 500 events / 100 competitors / 10 per slice: 1.0× speedup.
  - 2000 events / 200 competitors / 20 per slice: 1.0× speedup.
  - 5000 events in one slice / 50k competitors: **1.3× speedup.**
 - The spec targeted >2× speedup on 8-core offline converge. This is
  only achievable on workloads with many events-per-slice AND large
  competitor pools. **Typical TrueSkill workloads (tens of events
  per slice) do not materially benefit from T3's within-slice
  parallelism** because rayon's task-spawn overhead dominates.
 - Cross-slice parallelism (dirty-bit slice skipping per spec Section
  5) is the natural next step for real workload speedup — deferred
  to a future tier.
 ### Internals
 - The parallel path uses an `unsafe` block to concurrently write to
  `SkillStore` from color-group-disjoint events. Soundness rests on
  the color-group invariant (events in the same color touch no shared
  `Index`), which is guaranteed by construction in
  `TimeSlice::recompute_color_groups`. Sequential path unchanged.
 - `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to
  sequential iteration inside the parallel `sweep_color_groups` to
  avoid rayon's task-spawn overhead.
 - Thread-local `ScratchArena` per rayon worker thread.
 ## Unreleased — T2 new API surface
 Breaking: every renamed type and the new public API land together per
@@ -14,6 +14,10 @@ harness = false
 name = "gaussian"
 harness = false
 [[bench]]
 name = "history_converge"
 harness = false
 [dependencies]
 approx = { version = "0.5.1", optional = true }
 rayon = { version = "1", optional = true }
@@ -98,3 +98,35 @@ Gaussian::tau             260.80 ps    (unchanged)
 #   learning_curves_by_index(), nested-Vec public add_events().
 # - 90 tests green: 68 lib + 10 api_shape + 6 game + 4 record_winner +
 #   2 equivalence.
 # After T3 (2026-04-24, same hardware)
 Batch::iteration (seq, no rayon)     23.23 µs   (matches T2 baseline; no regression)
 Batch::iteration (rayon, small slice) 24.57 µs   (within noise; small workloads pay rayon overhead)
 Gaussian::add                         236.62 ps  (unchanged)
 Gaussian::sub                         236.43 ps  (unchanged)
 Gaussian::mul                         237.05 ps  (unchanged)
 Gaussian::div                         236.07 ps  (unchanged)
 # End-to-end history_converge benchmark (Apple M5 Pro, RAYON_NUM_THREADS=auto):
 # workload                              seq      rayon    speedup
 # 500 events, 100 competitors, 10/slice 4.03 ms  4.24 ms  1.0x
 # 2000 events, 200 competitors, 20/slice 20.18 ms 19.82 ms 1.0x
 # 5000 events, 50000 competitors, 1 slice 11.88 ms 9.10 ms 1.3x
 #
 # Notes:
 # - T3's within-slice color-group parallelism only materializes a speedup
 #   when a slice holds many events with disjoint competitor sets. Typical
 #   TrueSkill workloads (tens of events per slice) don't show measurable
 #   benefit from rayon.
 # - The pre-revert SmallVec experiment hit 2x on the 5000-event workload
 #   but regressed sequential Batch::iteration by 28%. The tradeoff wasn't
 #   worth it for typical workloads — ShipVec<[_; 8]> inline size (1 KB per
 #   Game struct) hurt cache locality on the hot path.
 # - Cross-slice parallelism (dirty-bit slice skipping per spec Section 5)
 #   is the natural next step for realistic TrueSkill workloads and would
 #   deliver the spec's ~50-500x online-add speedup. Deferred to T4+.
 # - Determinism verified: tests/determinism.rs asserts bit-identical
 #   posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
 # - Send + Sync bounds added on Time, Drift<T>, Observer<T>, Factor, Schedule.
 # - Rayon is opt-in via `--features rayon`. Default build is unchanged from T2.
@@ -0,0 +1,116 @@
 //! End-to-end History::converge benchmark.
 //!
 //! Workload shapes designed to expose rayon's within-slice color-group
 //! parallelism. Events in the same color group are processed in parallel
 //! via direct-write with disjoint index sets (no data races). Color groups
 //! smaller than a threshold fall back to the sequential path to avoid
 //! rayon overhead on small workloads.
 //!
 //! On Apple M5 Pro, the P-core count (6) is the optimal thread count.
 //! The rayon thread pool is initialised to `min(P-cores, available)` to
 //! avoid scheduling onto the slower E-cores.
 //!
 //! ## Results (Apple M5 Pro, 2026-04-24, after SmallVec revert)
 //!
 //! | Workload                                    | Sequential  | Parallel   | Speedup |
 //! |---------------------------------------------|------------:|-----------:|--------:|
 //! | History::converge/500x100@10perslice        |     4.03 ms |    4.24 ms |   1.0×  |
 //! | History::converge/2000x200@20perslice       |    20.18 ms |   19.82 ms |   1.0×  |
 //! | History::converge/1v1-5000x50000@5000perslice|   11.88 ms |    9.10 ms |   1.3×  |
 //!
 //! T3 acceptance gate: ≥2× speedup on at least one workload — NOT achieved after revert.
 //! The SmallVec storage that enabled the 2× gate caused a +28% regression in the
 //! sequential Batch::iteration benchmark and was reverted. Small workloads still fall
 //! below the RAYON_THRESHOLD (64 events/color) and run sequentially with near-zero overhead.
 use criterion::{BatchSize, Criterion, criterion_group, criterion_main};
 use smallvec::smallvec;
 use trueskill_tt::{
    ConstantDrift, ConvergenceOptions, Event, History, Member, NullObserver, Outcome, Team,
 };
 fn build_history_1v1(
    n_events: usize,
    n_competitors: usize,
    events_per_slice: usize,
    seed: u64,
 ) -> History<i64, ConstantDrift, NullObserver, String> {
    let mut rng = seed;
    let mut next = || {
        rng = rng
            .wrapping_mul(6364136223846793005)
            .wrapping_add(1442695040888963407);
        rng
    };
    let mut h = History::<i64, _, _, String>::builder_with_key()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions {
            max_iter: 30,
            epsilon: 1e-6,
        })
        .build();
    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(n_events);
    for ev_i in 0..n_events {
        let a = (next() as usize) % n_competitors;
        let mut b = (next() as usize) % n_competitors;
        while b == a {
            b = (next() as usize) % n_competitors;
        }
        events.push(Event {
            time: (ev_i as i64 / events_per_slice as i64) + 1,
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h
 }
 fn bench_converge(c: &mut Criterion) {
    // Two original task workloads (small per-slice event count;
    // fall below RAYON_THRESHOLD so sequential path runs — near-zero overhead).
    c.bench_function("History::converge/500x100@10perslice", |b| {
        b.iter_batched(
            || build_history_1v1(500, 100, 10, 42),
            |mut h| {
                h.converge().unwrap();
            },
            BatchSize::SmallInput,
        );
    });
    c.bench_function("History::converge/2000x200@20perslice", |b| {
        b.iter_batched(
            || build_history_1v1(2000, 200, 20, 42),
            |mut h| {
                h.converge().unwrap();
            },
            BatchSize::SmallInput,
        );
    });
    // Large single-slice workload: 5000 events, 50000 competitors.
    // All events in one slice → color-0 gets ~4900 disjoint events, well above
    // the 64-event RAYON_THRESHOLD. 30 iterations × 1 slice = 30 sweeps, each
    // parallelised across P-core threads. Shows ≥2× speedup.
    c.bench_function("History::converge/1v1-5000x50000@5000perslice", |b| {
        b.iter_batched(
            || build_history_1v1(5000, 50000, 5000, 42),
            |mut h| {
                h.converge().unwrap();
            },
            BatchSize::SmallInput,
        );
    });
 }
 criterion_group!(benches, bench_converge);
 criterion_main!(benches);
@@ -0,0 +1,158 @@
 //! Greedy graph coloring for within-slice event independence.
 //!
 //! Events sharing no `Index` can be processed in parallel under async-EP
 //! semantics. This module partitions a list of events into "colors" such
 //! that events of the same color touch disjoint index sets.
 //!
 //! The algorithm is greedy: for each event in ingestion order, place it in
 //! the lowest-numbered color whose existing members share no `Index`. If
 //! no existing color accepts the event, open a new color.
 //!
 //! Complexity: O(n × c × m) where n is events, c is colors (small, ≤ 5 in
 //! practice), and m is average team size.
 use std::collections::HashSet;
 use crate::Index;
 /// Partition of event indices into color groups.
 ///
 /// Each inner `Vec<usize>` holds the indices (into the original events
 /// array) of events assigned to one color. Colors are iterated in ascending
 /// order by convention.
 #[derive(Clone, Debug, Default)]
 pub(crate) struct ColorGroups {
    pub(crate) groups: Vec<Vec<usize>>,
 }
 impl ColorGroups {
    #[allow(dead_code)]
    pub(crate) fn new() -> Self {
        Self::default()
    }
    #[allow(dead_code)]
    pub(crate) fn n_colors(&self) -> usize {
        self.groups.len()
    }
    #[allow(dead_code)]
    pub(crate) fn is_empty(&self) -> bool {
        self.groups.is_empty()
    }
    /// Total event count across all colors.
    #[allow(dead_code)]
    pub(crate) fn total_events(&self) -> usize {
        self.groups.iter().map(|g| g.len()).sum()
    }
    /// Contiguous index range for one color after events have been reordered
    /// into color-contiguous positions by `TimeSlice::recompute_color_groups`.
    #[allow(dead_code)]
    pub(crate) fn color_range(&self, color_idx: usize) -> std::ops::Range<usize> {
        let group = &self.groups[color_idx];
        if group.is_empty() {
            return 0..0;
        }
        let start = *group.first().unwrap();
        let end = *group.last().unwrap() + 1;
        start..end
    }
 }
 /// Compute color groups greedily.
 ///
 /// `index_set(ev_idx)` yields, for each event index, the iterator of
 /// `Index` values that event touches. The returned `ColorGroups` has one
 /// inner `Vec<usize>` per color, containing event indices in the order
 /// they were assigned.
 #[allow(dead_code)]
 pub(crate) fn color_greedy<I, F>(n_events: usize, index_set: F) -> ColorGroups
 where
    F: Fn(usize) -> I,
    I: IntoIterator<Item = Index>,
 {
    let mut groups: Vec<Vec<usize>> = Vec::new();
    let mut members: Vec<HashSet<Index>> = Vec::new();
    for ev_idx in 0..n_events {
        let ev_members: HashSet<Index> = index_set(ev_idx).into_iter().collect();
        // Find first color whose member-set is disjoint from this event's indices.
        let chosen = members.iter().position(|m| m.is_disjoint(&ev_members));
        let color_idx = match chosen {
            Some(c) => c,
            None => {
                groups.push(Vec::new());
                members.push(HashSet::new());
                groups.len() - 1
            }
        };
        groups[color_idx].push(ev_idx);
        members[color_idx].extend(ev_members);
    }
    ColorGroups { groups }
 }
 #[cfg(test)]
 mod tests {
    use super::*;
    fn idx(i: usize) -> Index {
        Index::from(i)
    }
    #[test]
    fn single_event_gets_one_color() {
        let cg = color_greedy(1, |_| vec![idx(0), idx(1)]);
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0]);
    }
    #[test]
    fn disjoint_events_share_a_color() {
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0, 1]);
    }
    #[test]
    fn overlapping_events_need_separate_colors() {
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(1), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0]);
        assert_eq!(cg.groups[1], vec![1]);
    }
    #[test]
    fn three_events_two_colors() {
        // Event 0: {0, 1}; event 1: {2, 3}; event 2: {0, 2}.
        // Greedy: ev0→c0, ev1→c0 (disjoint), ev2 overlaps both→c1.
        let cg = color_greedy(3, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            2 => vec![idx(0), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0, 1]);
        assert_eq!(cg.groups[1], vec![2]);
    }
    #[test]
    fn total_events_counts_correctly() {
        let cg = color_greedy(4, |_| vec![idx(0)]);
        // All events touch index 0 → 4 distinct colors.
        assert_eq!(cg.n_colors(), 4);
        assert_eq!(cg.total_events(), 4);
    }
 }
@@ -262,17 +262,45 @@ impl<T: Time, D: Drift<T>, O: Observer<T>, K: Eq + Hash + Clone> History<T, D, O
    /// Note: `key(idx)` is O(n) per lookup; this method is therefore O(n²)
    /// in the number of competitors. Acceptable for T2; T3 may optimize.
    pub fn learning_curves(&self) -> HashMap<K, Vec<(T, Gaussian)>> {
-        let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
+        #[cfg(feature = "rayon")]
-        for slice in &self.time_slices {
+        {
-            for (idx, skill) in slice.skills.iter() {
+            use rayon::prelude::*;
-                if let Some(key) = self.keys.key(idx).cloned() {
+
-                    data.entry(key)
+            let per_slice: Vec<Vec<(Index, T, Gaussian)>> = self
-                        .or_default()
+                .time_slices
-                        .push((slice.time, skill.posterior()));
+                .par_iter()
                .map(|ts| {
                    ts.skills
                        .iter()
                        .map(|(idx, sk)| (idx, ts.time, sk.posterior()))
                        .collect()
                })
                .collect();
            let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
            for slice_contrib in per_slice {
                for (idx, t, g) in slice_contrib {
                    if let Some(key) = self.keys.key(idx).cloned() {
                        data.entry(key).or_default().push((t, g));
                    }
                }
            }
            data
        }
        #[cfg(not(feature = "rayon"))]
        {
            let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
            for slice in &self.time_slices {
                for (idx, skill) in slice.skills.iter() {
                    if let Some(key) = self.keys.key(idx).cloned() {
                        data.entry(key)
                            .or_default()
                            .push((slice.time, skill.posterior()));
                    }
                }
            }
            data
        }
        data
    }
    /// Skill estimate at the latest time slice the competitor appears in.
@@ -304,10 +332,23 @@ impl<T: Time, D: Drift<T>, O: Observer<T>, K: Eq + Hash + Clone> History<T, D, O
    }
    pub(crate) fn log_evidence_internal(&mut self, forward: bool, targets: &[Index]) -> f64 {
-        self.time_slices
+        #[cfg(feature = "rayon")]
-            .iter()
+        {
-            .map(|ts| ts.log_evidence(self.online, targets, forward, &self.agents))
+            use rayon::prelude::*;
-            .sum()
+            let per_slice: Vec<f64> = self
                .time_slices
                .par_iter()
                .map(|ts| ts.log_evidence(self.online, targets, forward, &self.agents))
                .collect();
            per_slice.into_iter().sum()
        }
        #[cfg(not(feature = "rayon"))]
        {
            self.time_slices
                .iter()
                .map(|ts| ts.log_evidence(self.online, targets, forward, &self.agents))
                .sum()
        }
    }
    /// Total log-evidence across the history.
@@ -9,6 +9,7 @@ pub(crate) mod arena;
 mod time;
 mod time_slice;
 pub use time_slice::TimeSlice;
 mod color_group;
 mod competitor;
 mod convergence;
 pub mod drift;
@@ -7,6 +7,7 @@ use std::collections::HashMap;
 use crate::{
    Index, N_INF,
    arena::ScratchArena,
    color_group::ColorGroups,
    drift::Drift,
    game::Game,
    gaussian::Gaussian,
@@ -84,6 +85,12 @@ pub(crate) struct Event {
 }
 impl Event {
    pub(crate) fn iter_agents(&self) -> impl Iterator<Item = Index> + '_ {
        self.teams
            .iter()
            .flat_map(|t| t.items.iter().map(|it| it.agent))
    }
    fn outputs(&self) -> Vec<f64> {
        self.teams
            .iter()
@@ -108,6 +115,33 @@ impl Event {
            })
            .collect::<Vec<_>>()
    }
    /// Direct in-loop update: mutates self and `skills` inline with no
    /// intermediate allocation. Used by both the sequential sweep path and,
    /// via unsafe, by the parallel rayon path for events in the same color
    /// group (which have disjoint agent sets — see `sweep_color_groups`).
    fn iteration_direct<T: Time, D: Drift<T>>(
        &mut self,
        skills: &mut SkillStore,
        agents: &CompetitorStore<T, D>,
        p_draw: f64,
        arena: &mut ScratchArena,
    ) {
        let teams = self.within_priors(false, false, skills, agents);
        let result = self.outputs();
        let g = Game::ranked_with_arena(teams, &result, &self.weights, p_draw, arena);
        for (t, team) in self.teams.iter_mut().enumerate() {
            for (i, item) in team.items.iter_mut().enumerate() {
                let old_likelihood = skills.get(item.agent).unwrap().likelihood;
                let new_likelihood = (old_likelihood / item.likelihood) * g.likelihoods[t][i];
                skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
                item.likelihood = g.likelihoods[t][i];
            }
        }
        self.evidence = g.evidence;
    }
 }
 #[derive(Debug)]
@@ -117,6 +151,7 @@ pub struct TimeSlice<T: Time = i64> {
    pub(crate) time: T,
    p_draw: f64,
    arena: ScratchArena,
    pub(crate) color_groups: ColorGroups,
 }
 impl<T: Time> TimeSlice<T> {
@@ -127,9 +162,44 @@ impl<T: Time> TimeSlice<T> {
            time,
            p_draw,
            arena: ScratchArena::new(),
            color_groups: ColorGroups::new(),
        }
    }
    /// Recompute the color-group partition and reorder `self.events` into
    /// color-contiguous ranges. After this call, `self.color_groups.groups[c]`
    /// contains a contiguous ascending range of indices in `self.events`.
    pub(crate) fn recompute_color_groups(&mut self) {
        use crate::color_group::color_greedy;
        let n = self.events.len();
        if n == 0 {
            self.color_groups = ColorGroups::new();
            return;
        }
        let cg = color_greedy(n, |ev_idx| {
            self.events[ev_idx].iter_agents().collect::<Vec<_>>()
        });
        let mut reordered: Vec<Event> = Vec::with_capacity(n);
        let mut new_groups: Vec<Vec<usize>> = Vec::with_capacity(cg.groups.len());
        let mut taken: Vec<Option<Event>> = self.events.drain(..).map(Some).collect();
        for group in &cg.groups {
            let mut new_indices: Vec<usize> = Vec::with_capacity(group.len());
            for &old_idx in group {
                let ev = taken[old_idx].take().expect("event already taken");
                new_indices.push(reordered.len());
                reordered.push(ev);
            }
            new_groups.push(new_indices);
        }
        self.events = reordered;
        self.color_groups = ColorGroups { groups: new_groups };
    }
    pub fn add_events<D: Drift<T>>(
        &mut self,
        composition: Vec<Vec<Vec<Index>>>,
@@ -212,6 +282,7 @@ impl<T: Time> TimeSlice<T> {
        self.events.extend(events);
        self.iteration(from, agents);
        self.recompute_color_groups();
    }
    pub(crate) fn posteriors(&self) -> HashMap<Index, Gaussian> {
@@ -222,28 +293,115 @@ impl<T: Time> TimeSlice<T> {
    }
    pub fn iteration<D: Drift<T>>(&mut self, from: usize, agents: &CompetitorStore<T, D>) {
-        for event in self.events.iter_mut().skip(from) {
+        if from > 0 || self.color_groups.is_empty() {
-            let teams = event.within_priors(false, false, &self.skills, agents);
+            // Initial pass (add_events) or no color groups yet: simple sequential sweep.
-            let result = event.outputs();
+            for event in self.events.iter_mut().skip(from) {
                let teams = event.within_priors(false, false, &self.skills, agents);
                let result = event.outputs();
-            let g = Game::ranked_with_arena(
+                let g = Game::ranked_with_arena(
-                teams,
+                    teams,
-                &result,
+                    &result,
-                &event.weights,
+                    &event.weights,
-                self.p_draw,
+                    self.p_draw,
-                &mut self.arena,
+                    &mut self.arena,
-            );
+                );
-            for (t, team) in event.teams.iter_mut().enumerate() {
+                for (t, team) in event.teams.iter_mut().enumerate() {
-                for (i, item) in team.items.iter_mut().enumerate() {
+                    for (i, item) in team.items.iter_mut().enumerate() {
-                    let old_likelihood = self.skills.get(item.agent).unwrap().likelihood;
+                        let old_likelihood = self.skills.get(item.agent).unwrap().likelihood;
-                    let new_likelihood = (old_likelihood / item.likelihood) * g.likelihoods[t][i];
+                        let new_likelihood =
-                    self.skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
+                            (old_likelihood / item.likelihood) * g.likelihoods[t][i];
-                    item.likelihood = g.likelihoods[t][i];
+                        self.skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
                        item.likelihood = g.likelihoods[t][i];
                    }
                }
                event.evidence = g.evidence;
            }
        } else {
            self.sweep_color_groups(agents);
        }
    }
    /// Full event sweep using the color-group partition. Colors are processed
    /// sequentially; within each color the inner loop is parallel under rayon.
    ///
    /// Events within each color group touch disjoint agent sets (guaranteed by
    /// the greedy coloring). This lets each rayon thread write directly to its
    /// events' skill likelihoods without a deferred-apply step, matching the
    /// sequential path's allocation profile. The unsafe block is sound because:
    ///   1. `self.events[range]` and `self.skills` are separate fields → disjoint.
    ///   2. Events in the same color group access disjoint `Index` values in
    ///      `self.skills`, so concurrent writes land on different memory locations.
    ///   3. Each event only writes to its own items' likelihoods (no sharing).
    #[cfg(feature = "rayon")]
    fn sweep_color_groups<D: Drift<T>>(&mut self, agents: &CompetitorStore<T, D>) {
        use rayon::prelude::*;
        thread_local! {
            static ARENA: std::cell::RefCell<ScratchArena> =
                std::cell::RefCell::new(ScratchArena::new());
        }
        // Minimum color-group size to justify rayon's task-spawn overhead.
        // Below this threshold, process events sequentially to avoid regression
        // on small per-slice workloads.
        const RAYON_THRESHOLD: usize = 64;
        for color_idx in 0..self.color_groups.groups.len() {
            let group_len = self.color_groups.groups[color_idx].len();
            if group_len == 0 {
                continue;
            }
            let range = self.color_groups.color_range(color_idx);
            let p_draw = self.p_draw;
            if group_len >= RAYON_THRESHOLD {
                // Obtain a raw pointer from the unique `&mut self.skills` reference.
                // Casting back to `&mut` inside the closure is sound because:
                //   1. The pointer originates from a `&mut` — no aliasing with shared refs.
                //   2. Events in the same color group touch disjoint `Index` slots in the
                //      underlying Vec, so concurrent writes from different threads land on
                //      different memory locations — no data race.
                //   3. `self.events[range]` and `self.skills` are separate struct fields,
                //      so the borrow splits cleanly.
                let skills_addr: usize = (&mut self.skills as *mut SkillStore) as usize;
                self.events[range].par_iter_mut().for_each(move |ev| {
                    // SAFETY: see above.
                    let skills: &mut SkillStore = unsafe { &mut *(skills_addr as *mut SkillStore) };
                    ARENA.with(|cell| {
                        let mut arena = cell.borrow_mut();
                        arena.reset();
                        ev.iteration_direct(skills, agents, p_draw, &mut arena);
                    });
                });
            } else {
                for ev in &mut self.events[range] {
                    ev.iteration_direct(&mut self.skills, agents, p_draw, &mut self.arena);
                }
            }
        }
    }
-            event.evidence = g.evidence;
+    /// Full event sweep using the color-group partition, sequential direct-write path.
    /// Events within each color group are updated inline — no EventOutput allocation —
    /// matching the T2 performance profile.
    #[cfg(not(feature = "rayon"))]
    fn sweep_color_groups<D: Drift<T>>(&mut self, agents: &CompetitorStore<T, D>) {
        for color_idx in 0..self.color_groups.groups.len() {
            if self.color_groups.groups[color_idx].is_empty() {
                continue;
            }
            let range = self.color_groups.color_range(color_idx);
            // Borrow self.events as a mutable slice for this color range.
            // self.skills and self.arena are separate fields — disjoint borrows are
            // allowed within a single method body.
            let p_draw = self.p_draw;
            for ev in &mut self.events[range] {
                ev.iteration_direct(&mut self.skills, agents, p_draw, &mut self.arena);
            }
        }
    }
@@ -662,4 +820,67 @@ mod tests {
            epsilon = 1e-6
        );
    }
    #[test]
    fn time_slice_color_groups_reorders_events() {
        // ev0: [a, b]; ev1: [c, d]; ev2: [a, c]
        // Greedy coloring: ev0→c0, ev1→c0 (disjoint), ev2→c1 (overlaps both).
        // After recompute_color_groups, physical order is [ev0, ev1, ev2]
        // and groups == [[0, 1], [2]].
        let mut index_map = KeyTable::new();
        let a = index_map.get_or_create("a");
        let b = index_map.get_or_create("b");
        let c = index_map.get_or_create("c");
        let d = index_map.get_or_create("d");
        let mut agents: CompetitorStore<i64, ConstantDrift> = CompetitorStore::new();
        for agent in [a, b, c, d] {
            agents.insert(
                agent,
                Competitor {
                    rating: Rating::new(
                        Gaussian::from_ms(25.0, 25.0 / 3.0),
                        25.0 / 6.0,
                        ConstantDrift(25.0 / 300.0),
                    ),
                    ..Default::default()
                },
            );
        }
        let mut ts = TimeSlice::new(0i64, 0.0);
        ts.add_events(
            vec![
                vec![vec![a], vec![b]],
                vec![vec![c], vec![d]],
                vec![vec![a], vec![c]],
            ],
            vec![vec![1.0, 0.0], vec![1.0, 0.0], vec![1.0, 0.0]],
            vec![],
            &agents,
        );
        assert_eq!(ts.color_groups.n_colors(), 2);
        assert_eq!(ts.color_groups.groups[0], vec![0, 1]);
        assert_eq!(ts.color_groups.groups[1], vec![2]);
        assert_eq!(ts.color_groups.color_range(0), 0..2);
        assert_eq!(ts.color_groups.color_range(1), 2..3);
        // Events at positions 0 and 1 (color 0) must be disjoint — verify by
        // checking that the agent sets of self.events[0] and self.events[1] do
        // not include the agent at self.events[2].
        let agents_in_ev2: Vec<Index> = ts.events[2].iter_agents().collect();
        let agents_in_ev0: Vec<Index> = ts.events[0].iter_agents().collect();
        let agents_in_ev1: Vec<Index> = ts.events[1].iter_agents().collect();
        // ev0 and ev1 must be disjoint from each other (color-0 invariant).
        assert!(agents_in_ev0.iter().all(|ag| !agents_in_ev1.contains(ag)));
        // ev2 must share an agent with ev0 or ev1 (it needed its own color).
        let ev2_overlaps_ev0 = agents_in_ev2.iter().any(|ag| agents_in_ev0.contains(ag));
        let ev2_overlaps_ev1 = agents_in_ev2.iter().any(|ag| agents_in_ev1.contains(ag));
        assert!(ev2_overlaps_ev0 || ev2_overlaps_ev1);
    }
 }
@@ -0,0 +1,100 @@
 //! Determinism tests: identical posteriors across RAYON_NUM_THREADS
 //! values. Only compiled with the `rayon` feature.
 #![cfg(feature = "rayon")]
 use smallvec::smallvec;
 use trueskill_tt::{ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team};
 /// Build a deterministic workload using a simple LCG (no external rand crate).
 fn build_and_converge(seed: u64) -> Vec<(i64, trueskill_tt::Gaussian)> {
    let mut h = History::<i64, _, _, String>::builder_with_key()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions {
            max_iter: 30,
            epsilon: 1e-6,
        })
        .build();
    // LCG for deterministic pseudo-random ints.
    let mut rng = seed;
    let mut next = || {
        rng = rng
            .wrapping_mul(6364136223846793005)
            .wrapping_add(1442695040888963407);
        rng
    };
    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(200);
    for ev_i in 0..200 {
        let a = (next() % 40) as usize;
        let mut b = (next() % 40) as usize;
        while b == a {
            b = (next() % 40) as usize;
        }
        // ~10 events per slice so color groups have material parallelism.
        events.push(Event {
            time: (ev_i as i64 / 10) + 1,
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h.converge().unwrap();
    // Sample one competitor's curve for the comparison.
    h.learning_curve("p0")
 }
 #[test]
 fn posteriors_identical_across_thread_counts() {
    let sizes = [1usize, 2, 4, 8];
    let mut results: Vec<Vec<(i64, trueskill_tt::Gaussian)>> = Vec::new();
    for &n in &sizes {
        let pool = rayon::ThreadPoolBuilder::new()
            .num_threads(n)
            .build()
            .expect("rayon pool build");
        let curve = pool.install(|| build_and_converge(42));
        results.push(curve);
    }
    let reference = &results[0];
    for (i, curve) in results.iter().enumerate().skip(1) {
        assert_eq!(
            curve.len(),
            reference.len(),
            "curve length differs at {n} threads",
            n = sizes[i],
        );
        for (j, (&(t_ref, g_ref), &(t, g))) in reference.iter().zip(curve.iter()).enumerate() {
            assert_eq!(
                t_ref,
                t,
                "time point {j} differs at {n} threads: ref={t_ref} vs got={t}",
                n = sizes[i],
            );
            assert_eq!(
                g_ref.mu().to_bits(),
                g.mu().to_bits(),
                "mu bits differ at {n} threads, time {t}: ref={ref_mu} got={got_mu}",
                n = sizes[i],
                ref_mu = g_ref.mu(),
                got_mu = g.mu(),
            );
            assert_eq!(
                g_ref.sigma().to_bits(),
                g.sigma().to_bits(),
                "sigma bits differ at {n} threads, time {t}: ref={ref_sigma} got={got_sigma}",
                n = sizes[i],
                ref_sigma = g_ref.sigma(),
                got_sigma = g.sigma(),
            );
        }
    }
 }
Author	SHA1	Message	Date
logaritmiskandClaude Opus 4.7	db633bdafe	bench,docs: capture T3 final numbers and update CHANGELOG Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline). Gaussian ops unchanged. End-to-end history_converge benchmark on Apple M5 Pro: Workload seq rayon speedup 500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x 2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x 5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x The spec's >=2x target is not achieved on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads don't fit that profile. Cross-slice parallelism (dirty-bit slice skipping, spec Section 5) is the natural next step for real-workload speedup. Determinism verified: bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:58:24 +02:00
logaritmiskandClaude Sonnet 4.6	f0d6211387	perf(game): revert Task 10 SmallVec changes — caused sequential regression The Vec<Vec<_>> → SmallVec<[SmallVec<[_;8]>;8]> change in Task 10 regressed Batch::iteration from 23.29 µs to 29.73 µs (+28%). The SmallVec was motivated by reducing parallel-path allocations but it hurt the sequential path substantially. Reverting game.rs + time_slice.rs + history.rs storage back to the T2 Vec<Vec<_>> shape. The parallel rayon path (unsafe direct-write + thread_local ScratchArena + RAYON_THRESHOLD=64 fallback) stays — it is independent of Game's internal storage. Benchmarks after revert: Batch::iteration (seq, no rayon): 23.23 µs (restored ≈T2) Batch::iteration (rayon): 24.57 µs history_converge/500x100@10: 4.03 ms seq, 4.24 ms rayon — 1.0× history_converge/2000x200@20: 20.18 ms seq, 19.82 ms rayon — 1.0× history_converge/1v1-5000x50000@5000: 11.88 ms seq, 9.10 ms rayon — 1.3× Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 14:55:37 +02:00
logaritmiskandClaude Sonnet 4.6	be515c3d8d	bench(history): end-to-end History::converge benchmark + rayon perf fix Adds benches/history_converge.rs with three workloads: - 500 events / 100 competitors / 10 events per slice - 2000 events / 200 competitors / 20 events per slice - 5000 events / 50000 competitors / 5000 events per slice (gate workload) Investigation found the original rayon path used a compute/apply split with EventOutput heap allocation per event, causing 3-23x regression. Root cause: per-event allocations caused heavy allocator contention across rayon threads. Fixes: - Replace EventOutput/two-phase approach with direct unsafe parallel write. Events in a color group have disjoint agent index sets; concurrent writes to SkillStore land on different Vec slots — no data race. - Add RAYON_THRESHOLD=64: color groups below this size fall back to sequential to avoid rayon overhead on small slices. - Game internals: switch likelihoods/teams to SmallVec<[_;8]> to avoid heap allocation for ≤8-team / ≤8-player-per-team games. Add type aliases Teams<T,D> and Likelihoods to satisfy clippy::type_complexity. - within_priors() and outputs() now return SmallVec; callers updated to use ranked_with_arena_sv() directly (avoiding Vec→SmallVec conversion). Sequential baseline (Apple M5 Pro, 2026-04-24): 500x100@10perslice: 4.72 ms 2000x200@20perslice: 23.17 ms 1v1-5000x50000@5000perslice: 13.89 ms With --features rayon (RAYON_NUM_THREADS=5, P-cores on M5 Pro): 500x100@10perslice: 4.82 ms (1.0× — below threshold) 2000x200@20perslice: 23.09 ms (1.0× — below threshold) 1v1-5000x50000@5000perslice: 6.97 ms (2.0× speedup — GATE ACHIEVED) T3 acceptance gate: >=2× speedup on at least one workload — ACHIEVED. 74 tests pass under both feature configs. Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 14:47:29 +02:00
logaritmiskandClaude Opus 4.7	cbf652eb1d	test: assert bit-identical posteriors across RAYON_NUM_THREADS tests/determinism.rs runs the same deterministic 200-event history at thread counts {1, 2, 4, 8} via rayon::ThreadPoolBuilder::install and asserts every (time, posterior) pair has bit-identical mu and sigma across all configurations. Cfg-gated to the rayon feature; no-op under --features approx alone. Verifies the T3 determinism invariant that the ordered-reduce strategy (per-slice parallel, sequential sum) produces thread-count- independent results. Part of T3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:59:33 +02:00
logaritmisk	ab8e1fd684	feat(history): parallel log_evidence with deterministic sum Per-slice log_evidence contribution computed in parallel under --features rayon; final reduction is sequential .into_iter().sum() on Vec<f64>, preserving slice order so the sum is bit-identical to the sequential T2 baseline. Essential for the T3 acceptance criterion of identical posteriors across RAYON_NUM_THREADS values. Part of T3.	2026-04-24 13:56:29 +02:00
logaritmiskandClaude Opus 4.7	f3c074c24c	feat(history): parallel learning_curves under rayon feature Per-slice posterior collection runs in parallel via par_iter; merge into the per-key HashMap is sequential in slice order so iteration order and HashMap insertion order are identical to the sequential impl. Preserves deterministic output across thread counts. Default-feature (no rayon) build unchanged — uses the T2 sequential impl. Part of T3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:54:47 +02:00
logaritmiskandClaude Sonnet 4.6	4b99485fc8	perf(time-slice): restore sequential direct-write path under cfg(not(feature = "rayon")) The compute/apply split introduced in `3680c54` was always active — the sequential build paid EventOutput heap-alloc overhead even without rayon, regressing Batch::iteration from 23.46 µs to 33.79 µs (+44%). This commit makes the split feature-gated: under cfg(feature = "rayon") the compute/apply pattern stays (needed for par_iter); under cfg(not(feature = "rayon")) events update SkillStore inline via Event::iteration_direct, matching the T2 performance profile. EventOutput, Event::compute, and Event::apply_output are now cfg(feature = "rayon")-only. TimeSlice::sweep_color_groups has two cfg-gated implementations sharing the same signature. Sequential restored to 23.29 µs; parallel 34.31 µs (small-workload overhead expected — rayon threadpool amortizes at larger scales). Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:52:48 +02:00
logaritmisk	3680c54d3c	feat(time-slice): parallel within-slice event iteration via rayon Under #[cfg(feature = "rayon")], the per-iteration event sweep processes events color-by-color: within a color, events touch disjoint Index values by construction, so par_iter is safe. Across colors, sequential ordering preserves async-EP semantics. Event::compute() is a pure function returning an owned EventOutput (new per-item likelihoods, evidence, and pre-computed new skill likelihoods). The apply phase runs sequentially after the parallel map, writing EventOutput values back to SkillStore and each event's item likelihoods. This avoids shared mutable state in the hot loop. Default build (no rayon) uses a sequential fallback that traverses the same color-group order — behaviorally identical to the parallel path. This keeps goldens bit-identical across feature configurations. Scenario 3b applied: event updates read from and write to the shared SkillStore, so the compute/apply split (Option A) was necessary. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:48:41 +02:00
logaritmiskandClaude Sonnet 4.6	9836b7b709	feat(time-slice): compute and maintain color groups; reorder events TimeSlice gains a color_groups field of type ColorGroups, recomputed whenever events change. After recompute, self.events is physically reordered so color-0 events are first, then color-1, etc. Each color is therefore a contiguous range of indices in self.events — the invariant that Task 6's parallel par_iter_mut exploits. Greedy coloring via crate::color_group::color_greedy; agent indices come from Event::iter_agents. ColorGroups gains a color_range helper that returns the contiguous Range<usize> for a given color. Numerical behavior unchanged: async-EP is order-independent at convergence, so event reordering does not affect goldens. Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:42:05 +02:00
logaritmisk	a40c0d6301	feat(color-group): add greedy within-slice event partitioning ColorGroups holds a partition of event indices into color groups such that events of the same color touch no shared Index. Computed greedily in ingestion order: each event goes into the first color whose existing members are disjoint from the event's indices. Used in T3 for safe within-slice parallelism — events in the same color can run concurrently without touching each other's skills. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:38:21 +02:00