T3: rayon-backed concurrency (opt-in) (#2)

Implements T3 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6. Plan: `docs/superpowers/plans/2026-04-24-t3-concurrency.md` (11 tasks). ## Summary ### Breaking - `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds. ### New - Opt-in `rayon` cargo feature. When enabled: - Within-slice event iteration runs color-group events in parallel via `par_iter_mut` (`TimeSlice::sweep_color_groups`). - `History::learning_curves` computes per-slice posteriors in parallel; merges sequentially in slice order. - `History::log_evidence` / `log_evidence_for` use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline. - `ColorGroups` infrastructure (`src/color_group.rs`) with greedy graph coloring. Events sharing no `Index` go into the same color group; events in the same group can run concurrently without touching each other's skills. - `tests/determinism.rs` asserts bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`. - `benches/history_converge.rs` measures end-to-end convergence on three workload shapes. ## Performance ### Sequential (no rayon, default build) | Metric | Before T3 | After T3 | Delta | |---|---|---|---| | `Batch::iteration` | 22.88 µs | 23.23 µs | **+1.5%** (noise) | | `Gaussian::*` | ≈218–264 ps | ≈236 ps | within noise | **No sequential regression.** Default build is as fast as T2. ### Parallel (`--features rayon`, Apple M5 Pro, auto thread count) | Workload | Sequential | Parallel | Speedup | |---|---:|---:|---:| | 500 events / 100 competitors / 10 per slice | 4.03 ms | 4.24 ms | **1.0×** | | 2000 events / 200 competitors / 20 per slice | 20.18 ms | 19.82 ms | **1.0×** | | 5000 events / 50000 competitors / 1 slice | 11.88 ms | 9.10 ms | **1.3×** | ### ⚠️ The spec's >=2× target was not met on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates. **Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step** for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier. ## Determinism `tests/determinism.rs` runs a 200-event history at thread counts {1, 2, 4, 8} via `rayon::ThreadPoolBuilder::install` and asserts every `(time, posterior)` pair has bit-identical `mu` and `sigma` (compared via `f64::to_bits()`). Passes. ## Internals - Parallel path uses an `unsafe` block to concurrently write to `SkillStore` from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared `Index`), guaranteed by construction in `TimeSlice::recompute_color_groups`. Sequential path unchanged from T2. - `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to sequential inside `sweep_color_groups` to avoid task-spawn overhead. - Thread-local `ScratchArena` per rayon worker thread. ## Test plan - [x] `cargo test --features approx` — 96 tests pass (74 lib + 22 integration) - [x] `cargo test --features approx,rayon` — 97 tests pass (+1 determinism) - [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean - [x] `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean - [x] `cargo +nightly fmt --check` — clean - [x] `cargo bench --bench batch --features approx` — 23.23 µs (no regression vs T2) - [x] `cargo bench --bench history_converge --features approx,rayon` — runs on all three workloads - [x] Bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}` — verified ## Commit history 13 commits on `t3-concurrency`. Each task is self-contained and bisectable. See `git log main..t3-concurrency` for the full list. ## Deferred - **Cross-slice parallelism** (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads. - **Default-on `rayon` feature** — spec called for default-on; we keep it opt-in until the feature proves stable in production use. - **Synchronous-EP schedule with barrier merge** — alternative parallel strategy per spec Section 6. - **`MarginFactor` / `Outcome::Scored`** — T4. - **`Damped` / `Residual` schedules** — T4. - **N-team `predict_outcome`** — T4. - **`Game::custom` full ergonomics** — T4. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #2 Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com> Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
2026-04-24 13:01:01 +00:00
parent d2aab82c1e
commit 6bf3e7e294
15 changed files with 2022 additions and 36 deletions
--- a/src/time_slice.rs
+++ b/src/time_slice.rs
@@ -7,6 +7,7 @@ use std::collections::HashMap;
 use crate::{
    Index, N_INF,
    arena::ScratchArena,
+    color_group::ColorGroups,
    drift::Drift,
    game::Game,
    gaussian::Gaussian,
@@ -84,6 +85,12 @@ pub(crate) struct Event {
 }

 impl Event {
+    pub(crate) fn iter_agents(&self) -> impl Iterator<Item = Index> + '_ {
+        self.teams
+            .iter()
+            .flat_map(|t| t.items.iter().map(|it| it.agent))
+    }
+
    fn outputs(&self) -> Vec<f64> {
        self.teams
            .iter()
@@ -108,6 +115,33 @@ impl Event {
            })
            .collect::<Vec<_>>()
    }
+
+    /// Direct in-loop update: mutates self and `skills` inline with no
+    /// intermediate allocation. Used by both the sequential sweep path and,
+    /// via unsafe, by the parallel rayon path for events in the same color
+    /// group (which have disjoint agent sets — see `sweep_color_groups`).
+    fn iteration_direct<T: Time, D: Drift<T>>(
+        &mut self,
+        skills: &mut SkillStore,
+        agents: &CompetitorStore<T, D>,
+        p_draw: f64,
+        arena: &mut ScratchArena,
+    ) {
+        let teams = self.within_priors(false, false, skills, agents);
+        let result = self.outputs();
+        let g = Game::ranked_with_arena(teams, &result, &self.weights, p_draw, arena);
+
+        for (t, team) in self.teams.iter_mut().enumerate() {
+            for (i, item) in team.items.iter_mut().enumerate() {
+                let old_likelihood = skills.get(item.agent).unwrap().likelihood;
+                let new_likelihood = (old_likelihood / item.likelihood) * g.likelihoods[t][i];
+                skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
+                item.likelihood = g.likelihoods[t][i];
+            }
+        }
+
+        self.evidence = g.evidence;
+    }
 }

 #[derive(Debug)]
@@ -117,6 +151,7 @@ pub struct TimeSlice<T: Time = i64> {
    pub(crate) time: T,
    p_draw: f64,
    arena: ScratchArena,
+    pub(crate) color_groups: ColorGroups,
 }

 impl<T: Time> TimeSlice<T> {
@@ -127,9 +162,44 @@ impl<T: Time> TimeSlice<T> {
            time,
            p_draw,
            arena: ScratchArena::new(),
+            color_groups: ColorGroups::new(),
        }
    }

+    /// Recompute the color-group partition and reorder `self.events` into
+    /// color-contiguous ranges. After this call, `self.color_groups.groups[c]`
+    /// contains a contiguous ascending range of indices in `self.events`.
+    pub(crate) fn recompute_color_groups(&mut self) {
+        use crate::color_group::color_greedy;
+
+        let n = self.events.len();
+        if n == 0 {
+            self.color_groups = ColorGroups::new();
+            return;
+        }
+
+        let cg = color_greedy(n, |ev_idx| {
+            self.events[ev_idx].iter_agents().collect::<Vec<_>>()
+        });
+
+        let mut reordered: Vec<Event> = Vec::with_capacity(n);
+        let mut new_groups: Vec<Vec<usize>> = Vec::with_capacity(cg.groups.len());
+        let mut taken: Vec<Option<Event>> = self.events.drain(..).map(Some).collect();
+
+        for group in &cg.groups {
+            let mut new_indices: Vec<usize> = Vec::with_capacity(group.len());
+            for &old_idx in group {
+                let ev = taken[old_idx].take().expect("event already taken");
+                new_indices.push(reordered.len());
+                reordered.push(ev);
+            }
+            new_groups.push(new_indices);
+        }
+
+        self.events = reordered;
+        self.color_groups = ColorGroups { groups: new_groups };
+    }
+
    pub fn add_events<D: Drift<T>>(
        &mut self,
        composition: Vec<Vec<Vec<Index>>>,
@@ -212,6 +282,7 @@ impl<T: Time> TimeSlice<T> {
        self.events.extend(events);

        self.iteration(from, agents);
+        self.recompute_color_groups();
    }

    pub(crate) fn posteriors(&self) -> HashMap<Index, Gaussian> {
@@ -222,28 +293,115 @@ impl<T: Time> TimeSlice<T> {
    }

    pub fn iteration<D: Drift<T>>(&mut self, from: usize, agents: &CompetitorStore<T, D>) {
-        for event in self.events.iter_mut().skip(from) {
-            let teams = event.within_priors(false, false, &self.skills, agents);
-            let result = event.outputs();
+        if from > 0 || self.color_groups.is_empty() {
+            // Initial pass (add_events) or no color groups yet: simple sequential sweep.
+            for event in self.events.iter_mut().skip(from) {
+                let teams = event.within_priors(false, false, &self.skills, agents);
+                let result = event.outputs();

-            let g = Game::ranked_with_arena(
-                teams,
-                &result,
-                &event.weights,
-                self.p_draw,
-                &mut self.arena,
-            );
+                let g = Game::ranked_with_arena(
+                    teams,
+                    &result,
+                    &event.weights,
+                    self.p_draw,
+                    &mut self.arena,
+                );

-            for (t, team) in event.teams.iter_mut().enumerate() {
-                for (i, item) in team.items.iter_mut().enumerate() {
-                    let old_likelihood = self.skills.get(item.agent).unwrap().likelihood;
-                    let new_likelihood = (old_likelihood / item.likelihood) * g.likelihoods[t][i];
-                    self.skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
-                    item.likelihood = g.likelihoods[t][i];
+                for (t, team) in event.teams.iter_mut().enumerate() {
+                    for (i, item) in team.items.iter_mut().enumerate() {
+                        let old_likelihood = self.skills.get(item.agent).unwrap().likelihood;
+                        let new_likelihood =
+                            (old_likelihood / item.likelihood) * g.likelihoods[t][i];
+                        self.skills.get_mut(item.agent).unwrap().likelihood = new_likelihood;
+                        item.likelihood = g.likelihoods[t][i];
+                    }
+                }
+
+                event.evidence = g.evidence;
+            }
+        } else {
+            self.sweep_color_groups(agents);
+        }
+    }
+
+    /// Full event sweep using the color-group partition. Colors are processed
+    /// sequentially; within each color the inner loop is parallel under rayon.
+    ///
+    /// Events within each color group touch disjoint agent sets (guaranteed by
+    /// the greedy coloring). This lets each rayon thread write directly to its
+    /// events' skill likelihoods without a deferred-apply step, matching the
+    /// sequential path's allocation profile. The unsafe block is sound because:
+    ///   1. `self.events[range]` and `self.skills` are separate fields → disjoint.
+    ///   2. Events in the same color group access disjoint `Index` values in
+    ///      `self.skills`, so concurrent writes land on different memory locations.
+    ///   3. Each event only writes to its own items' likelihoods (no sharing).
+    #[cfg(feature = "rayon")]
+    fn sweep_color_groups<D: Drift<T>>(&mut self, agents: &CompetitorStore<T, D>) {
+        use rayon::prelude::*;
+
+        thread_local! {
+            static ARENA: std::cell::RefCell<ScratchArena> =
+                std::cell::RefCell::new(ScratchArena::new());
+        }
+
+        // Minimum color-group size to justify rayon's task-spawn overhead.
+        // Below this threshold, process events sequentially to avoid regression
+        // on small per-slice workloads.
+        const RAYON_THRESHOLD: usize = 64;
+
+        for color_idx in 0..self.color_groups.groups.len() {
+            let group_len = self.color_groups.groups[color_idx].len();
+            if group_len == 0 {
+                continue;
+            }
+            let range = self.color_groups.color_range(color_idx);
+            let p_draw = self.p_draw;
+
+            if group_len >= RAYON_THRESHOLD {
+                // Obtain a raw pointer from the unique `&mut self.skills` reference.
+                // Casting back to `&mut` inside the closure is sound because:
+                //   1. The pointer originates from a `&mut` — no aliasing with shared refs.
+                //   2. Events in the same color group touch disjoint `Index` slots in the
+                //      underlying Vec, so concurrent writes from different threads land on
+                //      different memory locations — no data race.
+                //   3. `self.events[range]` and `self.skills` are separate struct fields,
+                //      so the borrow splits cleanly.
+                let skills_addr: usize = (&mut self.skills as *mut SkillStore) as usize;
+                self.events[range].par_iter_mut().for_each(move |ev| {
+                    // SAFETY: see above.
+                    let skills: &mut SkillStore = unsafe { &mut *(skills_addr as *mut SkillStore) };
+                    ARENA.with(|cell| {
+                        let mut arena = cell.borrow_mut();
+                        arena.reset();
+                        ev.iteration_direct(skills, agents, p_draw, &mut arena);
+                    });
+                });
+            } else {
+                for ev in &mut self.events[range] {
+                    ev.iteration_direct(&mut self.skills, agents, p_draw, &mut self.arena);
                }
            }
+        }
+    }

-            event.evidence = g.evidence;
+    /// Full event sweep using the color-group partition, sequential direct-write path.
+    /// Events within each color group are updated inline — no EventOutput allocation —
+    /// matching the T2 performance profile.
+    #[cfg(not(feature = "rayon"))]
+    fn sweep_color_groups<D: Drift<T>>(&mut self, agents: &CompetitorStore<T, D>) {
+        for color_idx in 0..self.color_groups.groups.len() {
+            if self.color_groups.groups[color_idx].is_empty() {
+                continue;
+            }
+            let range = self.color_groups.color_range(color_idx);
+
+            // Borrow self.events as a mutable slice for this color range.
+            // self.skills and self.arena are separate fields — disjoint borrows are
+            // allowed within a single method body.
+            let p_draw = self.p_draw;
+            for ev in &mut self.events[range] {
+                ev.iteration_direct(&mut self.skills, agents, p_draw, &mut self.arena);
+            }
        }
    }

@@ -662,4 +820,67 @@ mod tests {
            epsilon = 1e-6
        );
    }
+
+    #[test]
+    fn time_slice_color_groups_reorders_events() {
+        // ev0: [a, b]; ev1: [c, d]; ev2: [a, c]
+        // Greedy coloring: ev0→c0, ev1→c0 (disjoint), ev2→c1 (overlaps both).
+        // After recompute_color_groups, physical order is [ev0, ev1, ev2]
+        // and groups == [[0, 1], [2]].
+        let mut index_map = KeyTable::new();
+
+        let a = index_map.get_or_create("a");
+        let b = index_map.get_or_create("b");
+        let c = index_map.get_or_create("c");
+        let d = index_map.get_or_create("d");
+
+        let mut agents: CompetitorStore<i64, ConstantDrift> = CompetitorStore::new();
+
+        for agent in [a, b, c, d] {
+            agents.insert(
+                agent,
+                Competitor {
+                    rating: Rating::new(
+                        Gaussian::from_ms(25.0, 25.0 / 3.0),
+                        25.0 / 6.0,
+                        ConstantDrift(25.0 / 300.0),
+                    ),
+                    ..Default::default()
+                },
+            );
+        }
+
+        let mut ts = TimeSlice::new(0i64, 0.0);
+
+        ts.add_events(
+            vec![
+                vec![vec![a], vec![b]],
+                vec![vec![c], vec![d]],
+                vec![vec![a], vec![c]],
+            ],
+            vec![vec![1.0, 0.0], vec![1.0, 0.0], vec![1.0, 0.0]],
+            vec![],
+            &agents,
+        );
+
+        assert_eq!(ts.color_groups.n_colors(), 2);
+        assert_eq!(ts.color_groups.groups[0], vec![0, 1]);
+        assert_eq!(ts.color_groups.groups[1], vec![2]);
+
+        assert_eq!(ts.color_groups.color_range(0), 0..2);
+        assert_eq!(ts.color_groups.color_range(1), 2..3);
+
+        // Events at positions 0 and 1 (color 0) must be disjoint — verify by
+        // checking that the agent sets of self.events[0] and self.events[1] do
+        // not include the agent at self.events[2].
+        let agents_in_ev2: Vec<Index> = ts.events[2].iter_agents().collect();
+        let agents_in_ev0: Vec<Index> = ts.events[0].iter_agents().collect();
+        let agents_in_ev1: Vec<Index> = ts.events[1].iter_agents().collect();
+        // ev0 and ev1 must be disjoint from each other (color-0 invariant).
+        assert!(agents_in_ev0.iter().all(|ag| !agents_in_ev1.contains(ag)));
+        // ev2 must share an agent with ev0 or ev1 (it needed its own color).
+        let ev2_overlaps_ev0 = agents_in_ev2.iter().any(|ag| agents_in_ev0.contains(ag));
+        let ev2_overlaps_ev1 = agents_in_ev2.iter().any(|ag| agents_in_ev1.contains(ag));
+        assert!(ev2_overlaps_ev0 || ev2_overlaps_ev1);
+    }
 }