Files

Anders Olsson 6bf3e7e294 T3: rayon-backed concurrency (opt-in) (#2 )

Implements T3 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6. Plan: `docs/superpowers/plans/2026-04-24-t3-concurrency.md` (11 tasks).

## Summary

### Breaking

- `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds.

### New

- Opt-in `rayon` cargo feature. When enabled:
  - Within-slice event iteration runs color-group events in parallel via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
  - `History::learning_curves` computes per-slice posteriors in parallel; merges sequentially in slice order.
  - `History::log_evidence` / `log_evidence_for` use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.
- `ColorGroups` infrastructure (`src/color_group.rs`) with greedy graph coloring. Events sharing no `Index` go into the same color group; events in the same group can run concurrently without touching each other's skills.
- `tests/determinism.rs` asserts bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`.
- `benches/history_converge.rs` measures end-to-end convergence on three workload shapes.

## Performance

### Sequential (no rayon, default build)

| Metric | Before T3 | After T3 | Delta |
|---|---|---|---|
| `Batch::iteration` | 22.88 µs | 23.23 µs | **+1.5%** (noise) |
| `Gaussian::*` | ≈218–264 ps | ≈236 ps | within noise |

**No sequential regression.** Default build is as fast as T2.

### Parallel (`--features rayon`, Apple M5 Pro, auto thread count)

| Workload | Sequential | Parallel | Speedup |
|---|---:|---:|---:|
| 500 events / 100 competitors / 10 per slice | 4.03 ms | 4.24 ms | **1.0×** |
| 2000 events / 200 competitors / 20 per slice | 20.18 ms | 19.82 ms | **1.0×** |
| 5000 events / 50000 competitors / 1 slice | 11.88 ms | 9.10 ms | **1.3×** |

### ⚠️ The spec's >=2× target was not met on realistic workloads.

T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates.

**Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step** for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier.

## Determinism

`tests/determinism.rs` runs a 200-event history at thread counts {1, 2, 4, 8} via `rayon::ThreadPoolBuilder::install` and asserts every `(time, posterior)` pair has bit-identical `mu` and `sigma` (compared via `f64::to_bits()`). Passes.

## Internals

- Parallel path uses an `unsafe` block to concurrently write to `SkillStore` from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared `Index`), guaranteed by construction in `TimeSlice::recompute_color_groups`. Sequential path unchanged from T2.
- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to sequential inside `sweep_color_groups` to avoid task-spawn overhead.
- Thread-local `ScratchArena` per rayon worker thread.

## Test plan

- [x] `cargo test --features approx` — 96 tests pass (74 lib + 22 integration)
- [x] `cargo test --features approx,rayon` — 97 tests pass (+1 determinism)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch --features approx` — 23.23 µs (no regression vs T2)
- [x] `cargo bench --bench history_converge --features approx,rayon` — runs on all three workloads
- [x] Bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}` — verified

## Commit history

13 commits on `t3-concurrency`. Each task is self-contained and bisectable. See `git log main..t3-concurrency` for the full list.

## Deferred

- **Cross-slice parallelism** (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads.
- **Default-on `rayon` feature** — spec called for default-on; we keep it opt-in until the feature proves stable in production use.
- **Synchronous-EP schedule with barrier merge** — alternative parallel strategy per spec Section 6.
- **`MarginFactor` / `Outcome::Scored`** — T4.
- **`Damped` / `Residual` schedules** — T4.
- **N-team `predict_outcome`** — T4.
- **`Game::custom` full ergonomics** — T4.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #2
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>

2026-04-24 13:01:01 +00:00

45 KiB

Raw Blame History

T3 — Concurrency Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Ship the T3 tier from docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md Section 6: Send + Sync bounds on public traits, color-group partitioning for within-slice event independence, and rayon-backed parallel paths on within-slice event iteration, learning_curves, and log_evidence_for. Deterministic posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}; >2× speedup on an 8-core offline-converge benchmark.

Architecture: Concurrency lands as a single feature flag (rayon, opt-in for T3; the spec suggests default-on but we defer that flip until the feature is proven stable). All parallel paths are hidden behind #[cfg(feature = "rayon")] with sequential fallbacks. Within-slice parallelism exploits graph coloring: events sharing no Index go into the same color group and run concurrently; across color groups, execution is strictly sequential. This preserves the exact async-EP semantics of T2 at any thread count. Reductions over f64 (log_evidence_for, predict_quality) use two-stage parallel-then-sequential-reduce so sums are bit-identical regardless of thread count.

Tech Stack: Rust 2024 edition. New optional dependency: rayon = "1". Builds on T2 (History<T, D, O, K>, Event<T, K>, Outcome, Observer, factors module).

Design decisions

Called out explicitly so reviewers can override before execution:

Rayon is opt-in (cargo feature), not default-on. Simplifies CI, keeps the default build lean. We flip to default-on in a follow-up once the feature is shown to be stable under field use.
Greedy graph coloring for the within-slice partition: for each event in ingestion order, assign the lowest color whose existing members share no Index with the event. Optimality is not the target — events per slice is small (~50), and greedy finishes in O(n·c·m) where c is colors and m is team size. Rebalancing is a T4 follow-up if benchmarks show it helps.
Scope of parallelism: within-slice color groups, learning_curves, log_evidence_for, predict_quality. Cross-slice iteration in History::converge stays sequential — the forward/backward sweep has true data dependencies across slices. Parallelizing across slices requires a separate algorithm (the spec's dirty-bit slice skipping, deferred to beyond-T3).
Deterministic reductions: .par_iter().map().collect::<Vec<_>>().into_iter().sum() — each slice's contribution is computed in parallel, then summed sequentially in slice order. Per-element values are bit-identical to T2 (no extra floating-point additions reordered).
Within-game inference stays sequential. A single Game::ranked / Game::likelihoods call is too small (~20 µs) to amortize rayon's ~5 µs task overhead. The spec's table confirms this.

Acceptance criteria

cargo test --features approx — all tests pass (T2 baseline: 90 tests).
cargo test --features approx,rayon — all tests pass; determinism tests across RAYON_NUM_THREADS={1, 2, 4, 8} produce bit-identical posteriors.
cargo clippy --all-targets --features approx -- -D warnings — clean.
cargo clippy --all-targets --features approx,rayon -- -D warnings — clean.
cargo +nightly fmt --check — clean.
cargo bench --bench history_converge --features approx,rayon — >2× speedup on an 8-core machine vs the sequential baseline.
All public traits (Time, Drift<T>, Observer<T>, Factor, Schedule) have Send + Sync bounds; their blanket/default impls (i64, Untimed, ConstantDrift, NullObserver) all naturally satisfy the new bounds — no user code should need changes.
The rayon feature is opt-in; default build without it compiles and passes all tests.

File map

New files:

Path	Responsibility
`src/color_group.rs`	Greedy graph coloring for within-slice event partitioning.
`src/parallel.rs`	`#[cfg(feature = "rayon")]` helpers: `par_or_seq` iterators, ordered reductions, thread-count determinism tests.
`benches/history_converge.rs`	End-to-end convergence benchmark (scales to show rayon benefit).
`tests/determinism.rs`	Runs the same convergence with different `RAYON_NUM_THREADS` and asserts bit-identical posteriors.

Modified:

Path	What changes
`Cargo.toml`	Add `rayon = { version = "1", optional = true }`; declare `rayon` feature. Add `[[bench]] name = "history_converge"`.
`src/lib.rs`	Declare new `color_group` + `parallel` modules (both private).
`src/time.rs`	Add `Send + Sync + 'static` bounds to `Time`.
`src/drift.rs`	Add `Send + Sync` bound to `Drift<T>`.
`src/observer.rs`	Add `Send + Sync` bound to `Observer<T>`.
`src/factor/mod.rs`	Add `Send + Sync` bound to `Factor`.
`src/schedule.rs`	Add `Send + Sync` bound to `Schedule`.
`src/time_slice.rs`	Store pre-computed color-group partition; parallel event iteration behind `#[cfg(feature = "rayon")]`.
`src/history.rs`	Parallel `learning_curves`, `learning_curves_by_index`, `log_evidence`, `log_evidence_for` behind `#[cfg(feature = "rayon")]` with ordered reductions.

Task 1: Pre-flight — verify green on main, create t3 branch, capture baseline

Files: none

Step 1: Confirm on main, clean tree

git status
git rev-parse --abbrev-ref HEAD

Expected: clean; on main.

Step 2: Create the T3 branch

git checkout -b t3-concurrency

Step 3: Confirm all tests pass

cargo test --features approx

Expected: 90 tests pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence).

Step 4: Capture current bench baseline

cargo bench --bench batch 2>&1 | grep "Batch::iteration"

Record the number — it'll be the sequential-path baseline to verify no regression.

Step 5: No commit — verification only.

Task 2: Add `rayon` as optional dependency + feature flag

Files:

Modify: Cargo.toml
Step 1: Add the dependency and feature

Under [dependencies] add:

rayon = { version = "1", optional = true }

Add a [features] section (if not present — check first):

[features]
rayon = ["dep:rayon"]

(The existing approx feature already exists as a dep:approx-style optional dependency; use the same convention.)

Step 2: Verify both builds compile

cargo build --features approx
cargo build --features approx,rayon

Both must succeed. The second one pulls rayon into the dependency graph but nothing uses it yet.

Step 3: Commit

git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
feat(cargo): add rayon as optional dependency

Opt-in feature flag — users who want parallel paths build with
--features rayon. Default build remains single-threaded.

Spec Section 6 calls for default-on; we defer that flip until the
feature is stable under field use.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 3: Add `Send + Sync` bounds to public traits

Files:

Modify: src/time.rs, src/drift.rs, src/observer.rs, src/factor/mod.rs, src/schedule.rs

This is a minor breaking API change: downstream code with user-defined Drift<T> / Observer<T> / Factor / Schedule impls that aren't Send + Sync will fail to compile. For our crate, all built-in types (i64, Untimed, ConstantDrift, NullObserver, EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor, BuiltinFactor) naturally satisfy these bounds — no internal code changes.

Step 1: Update Time trait

// src/time.rs
pub trait Time: Copy + Ord + Send + Sync + 'static {
    fn elapsed_to(&self, later: &Self) -> i64;
}

Step 2: Update Drift<T> trait

// src/drift.rs
pub trait Drift<T: Time>: Copy + Debug + Send + Sync {
    fn variance_delta(&self, from: &T, to: &T) -> f64;
    fn variance_for_elapsed(&self, elapsed: i64) -> f64;
}

Step 3: Update Observer<T> trait

// src/observer.rs
pub trait Observer<T: Time>: Send + Sync {
    fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
    fn on_batch_processed(&self, _time: &T, _slice_idx: usize, _n_events: usize) {}
    fn on_converged(&self, _iters: usize, _final_step: (f64, f64), _converged: bool) {}
}

Step 4: Update Factor trait

// src/factor/mod.rs
pub trait Factor: Send + Sync {
    fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64);
    fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}

Step 5: Update Schedule trait

// src/schedule.rs
pub trait Schedule: Send + Sync {
    fn run(&self, factors: &mut [BuiltinFactor], vars: &mut VarStore) -> ScheduleReport;
}

Step 6: Verify both feature combinations still compile and test

cargo build --features approx
cargo build --features approx,rayon
cargo test --features approx --lib
cargo clippy --all-targets --features approx -- -D warnings

If any built-in type fails the auto-Send/Sync check, investigate — something non-thread-safe slipped in. Likely suspects: raw pointers, Rc, RefCell, non-static references. None are expected in this codebase; if found, convert to Arc/Mutex/RwLock as appropriate.

Step 7: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(api): add Send + Sync bounds to public traits

Required for T3 rayon-based parallelism. Affected traits:
- Time (+ Send + Sync + 'static)
- Drift<T> (+ Send + Sync)
- Observer<T> (+ Send + Sync)
- Factor (+ Send + Sync)
- Schedule (+ Send + Sync)

All built-in impls (i64, Untimed, ConstantDrift, NullObserver,
EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor,
BuiltinFactor) naturally satisfy these bounds — no internal changes
needed.

Minor breaking change: downstream impls that aren't already
thread-safe will fail to compile until they add the bounds.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 4: Implement greedy color-group partitioning

Files:

Create: src/color_group.rs
Modify: src/lib.rs (register module)
Step 1: Create src/color_group.rs

//! Greedy graph coloring for within-slice event independence.
//!
//! Events sharing no `Index` can be processed in parallel under async-EP
//! semantics. This module partitions a list of events into "colors" such
//! that events of the same color touch disjoint index sets.
//!
//! The algorithm is greedy: for each event in ingestion order, place it in
//! the lowest-numbered color whose existing members share no `Index`. If
//! no existing color accepts the event, open a new color.
//!
//! Complexity: O(n × c × m) where n is events, c is colors (small, ≤ 5 in
//! practice), and m is average team size.

use std::collections::HashSet;

use crate::Index;

/// Partition of event indices into color groups.
///
/// Each inner `Vec<usize>` holds the indices (into the original events
/// array) of events assigned to one color. Colors are iterated in ascending
/// order by convention.
#[derive(Clone, Debug, Default)]
pub(crate) struct ColorGroups {
    pub(crate) groups: Vec<Vec<usize>>,
}

impl ColorGroups {
    pub(crate) fn new() -> Self {
        Self::default()
    }

    pub(crate) fn n_colors(&self) -> usize {
        self.groups.len()
    }

    pub(crate) fn is_empty(&self) -> bool {
        self.groups.is_empty()
    }

    /// Total event count across all colors.
    pub(crate) fn total_events(&self) -> usize {
        self.groups.iter().map(|g| g.len()).sum()
    }
}

/// Compute color groups greedily.
///
/// `event_indices` yields, for each event, the set of `Index` values that
/// event touches. The returned `ColorGroups` has one inner `Vec<usize>` per
/// color, containing event indices in the order they were assigned.
pub(crate) fn color_greedy<I, F>(n_events: usize, index_set: F) -> ColorGroups
where
    F: Fn(usize) -> I,
    I: IntoIterator<Item = Index>,
{
    let mut groups: Vec<Vec<usize>> = Vec::new();
    let mut members: Vec<HashSet<Index>> = Vec::new();

    for ev_idx in 0..n_events {
        let ev_members: HashSet<Index> = index_set(ev_idx).into_iter().collect();
        // Find first color whose member-set is disjoint from this event's indices.
        let chosen = members
            .iter()
            .position(|m| m.is_disjoint(&ev_members));
        let color_idx = match chosen {
            Some(c) => c,
            None => {
                groups.push(Vec::new());
                members.push(HashSet::new());
                groups.len() - 1
            }
        };
        groups[color_idx].push(ev_idx);
        members[color_idx].extend(ev_members);
    }

    ColorGroups { groups }
}

#[cfg(test)]
mod tests {
    use super::*;

    fn idx(i: usize) -> Index {
        Index::from(i)
    }

    #[test]
    fn single_event_gets_one_color() {
        let cg = color_greedy(1, |_| vec![idx(0), idx(1)]);
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0]);
    }

    #[test]
    fn disjoint_events_share_a_color() {
        // Event 0 touches {0, 1}; event 1 touches {2, 3}.
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0, 1]);
    }

    #[test]
    fn overlapping_events_need_separate_colors() {
        // Event 0 touches {0, 1}; event 1 touches {1, 2}.
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(1), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0]);
        assert_eq!(cg.groups[1], vec![1]);
    }

    #[test]
    fn three_events_two_colors() {
        // Event 0: {0, 1}; event 1: {2, 3}; event 2: {0, 2}.
        // Greedy: ev0→c0, ev1→c0 (disjoint), ev2 overlaps both→c1.
        let cg = color_greedy(3, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            2 => vec![idx(0), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0, 1]);
        assert_eq!(cg.groups[1], vec![2]);
    }

    #[test]
    fn total_events_counts_correctly() {
        let cg = color_greedy(4, |_| vec![idx(0)]);
        // All events touch index 0 → 4 distinct colors.
        assert_eq!(cg.n_colors(), 4);
        assert_eq!(cg.total_events(), 4);
    }
}

Step 2: Register in src/lib.rs

Add (private module, alphabetical):

mod color_group;

No public re-export — this is internal infrastructure.

Step 3: Verify

cargo test --features approx --lib color_group
cargo clippy --all-targets --features approx -- -D warnings
cargo +nightly fmt --check

Expected: 5 tests pass in the color_group::tests module.

Step 4: Commit

cargo +nightly fmt
git add src/color_group.rs src/lib.rs
git commit -m "$(cat <<'EOF'
feat(color-group): add greedy within-slice event partitioning

ColorGroups holds a partition of event indices into color groups such
that events of the same color touch no shared Index. Computed greedily
in ingestion order: each event goes into the first color whose existing
members are disjoint from the event's indices.

Used in T3 for safe within-slice parallelism — events in the same
color can run concurrently without touching each other's skills.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 5: Store color-group partition in `TimeSlice`

Files:

Modify: src/time_slice.rs
Step 1: Add field to TimeSlice

pub(crate) struct TimeSlice<T: Time> {
    // … existing fields …
    pub(crate) color_groups: crate::color_group::ColorGroups,
}

Initialize to empty in TimeSlice::new:

pub fn new(time: T, p_draw: f64) -> Self {
    Self {
        // … existing initializers …
        color_groups: crate::color_group::ColorGroups::new(),
    }
}

Step 2: Recompute color groups whenever events change

Find the methods on TimeSlice that mutate self.events (most likely add_events — look at the current signature in src/time_slice.rs). After any mutation, recompute the partition:

pub(crate) fn recompute_color_groups(&mut self) {
    let n = self.events.len();
    self.color_groups = crate::color_group::color_greedy(n, |ev_idx| {
        // Return an iterator of every Index touched by event ev_idx.
        // Each event has teams; each team has items; each item has an agent: Index.
        self.events[ev_idx]
            .teams
            .iter()
            .flat_map(|t| t.items.iter().map(|it| it.agent))
            .collect::<Vec<_>>()
    });
}

Call recompute_color_groups() at the end of any public/crate-visible method that mutates self.events (likely one or two sites). Note the exact method name by reading src/time_slice.rs first — it may differ from "add_events".

Step 3: Add a sanity test

#[test]
fn time_slice_recomputes_color_groups() {
    // Construct a time slice with 2 events sharing competitor a — they should
    // end up in different color groups.
    // … use existing test helpers to build the slice …
    // Assert slice.color_groups.n_colors() == 2 and each group has 1 event.
}

The exact helper pattern depends on how existing src/time_slice.rs::tests construct slices. Mirror the existing approach.

Step 4: Verify

cargo test --features approx --lib time_slice
cargo clippy --all-targets --features approx -- -D warnings

Expected: all existing time_slice tests still pass + 1 new test.

Step 5: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): pre-compute color groups at ingestion

TimeSlice now stores a ColorGroups partition recomputed whenever
events change. The partition is computed once per slice mutation and
reused on every convergence iteration, enabling the cheap within-slice
parallel sweep added in Task 6.

Part of T3.
EOF
)"

Task 6: Parallel within-slice event iteration (behind `rayon` feature)

Files:

Modify: src/time_slice.rs

This is the core parallelism work. Within a color group, events touch disjoint Index values — it's safe to process them concurrently. Across colors, processing is strictly sequential. This preserves exact async-EP semantics.

Step 1: Identify the event-iteration hot path

Read src/time_slice.rs to find the method that, per convergence iteration, walks all events in the slice and updates their skills. Look for patterns like for event in self.events.iter_mut() or self.events.iter().for_each(...). This is the target for parallelization.

Probably named iteration or similar. Note its full signature and current body.

Step 2: Rewrite the loop as color-group-driven

Sequential version (always present, used when rayon is disabled):

pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    let mut max_step = (0.0_f64, 0.0_f64);
    for color in &self.color_groups.groups {
        for &ev_idx in color {
            let step = self.events[ev_idx].iteration(…);
            max_step.0 = max_step.0.max(step.0);
            max_step.1 = max_step.1.max(step.1);
        }
    }
    max_step
}

Parallel version (behind cfg):

#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    use rayon::prelude::*;
    let mut max_step = (0.0_f64, 0.0_f64);
    for color in &self.color_groups.groups {
        // Within one color, events touch disjoint Indexes → safe to parallelize.
        // SAFETY: color_greedy guarantees disjoint index sets, so the slice
        // entries color[i] and color[j] (i ≠ j) can be mutably borrowed
        // concurrently via rayon's work-stealing executor. We use
        // par_iter over the event indices and bundle the per-event state
        // into something Sync.
        //
        // Concretely: since `self` is &mut and we can't simultaneously have
        // mut borrows into it, we need to split the borrow. Use
        // events[..].par_iter_mut() filtered to the color's indices? No —
        // rayon's par_iter_mut doesn't support index filtering directly.
        //
        // Instead: use events.as_mut_slice().par_chunks_mut() after sorting
        // events so color-members are contiguous — or extract the per-event
        // state into a &mut [EventState] and index directly.
        //
        // Practical implementation: extract the events we'll touch this color
        // into a Vec<&mut Event> using split_at_mut or similar. See impl below.
        let contributions: Vec<(f64, f64)> = color
            .par_iter()
            .map(|&ev_idx| {
                // Borrow the event mutably — this requires unsafe or an
                // alternate data layout that exposes color-disjoint slices.
                // See the "Design note" below for the chosen approach.
                todo!("color-disjoint mutable event access")
            })
            .collect();
        for (d0, d1) in contributions {
            max_step.0 = max_step.0.max(d0);
            max_step.1 = max_step.1.max(d1);
        }
    }
    max_step
}

Design note on mutable access: Rayon's par_iter_mut doesn't let us access arbitrary indices from the same &mut Vec<Event> in parallel. Options the implementer should choose between, depending on which compiles most cleanly with the existing TimeSlice::iteration body:

Interior mutability. Wrap each Event in Cell<…> or RefCell<…>, then Event: Sync. This works only if the event's internal state is small/cheap to copy. Adds overhead.
Manual split_at_mut sequence. Sort events into color order once (mutating self.events so color[0][0], color[0][1], …, color[1][0], … are contiguous), remember the boundaries, then par_chunks_mut over each color's contiguous range. Simple, no unsafe. Does require a one-time sort when color groups are computed.
Raw pointer juggling. SAFETY-commented unsafe blocks that pass *mut Event into parallel closures. Fast, but fragile. Avoid unless (1) and (2) are benchmarked and found insufficient.

Recommendation: approach (2). When color groups are computed in Task 5's recompute_color_groups, also physically reorder self.events so color members are contiguous. Then each color corresponds to a slice range self.events[color_start..color_end], and par_chunks_mut(…).for_each(…) or par_iter_mut() over the range works.

Revised recompute_color_groups:

pub(crate) fn recompute_color_groups(&mut self) {
    let n = self.events.len();
    let cg = crate::color_group::color_greedy(n, |ev_idx| {
        self.events[ev_idx]
            .teams
            .iter()
            .flat_map(|t| t.items.iter().map(|it| it.agent))
            .collect::<Vec<_>>()
    });

    // Physically reorder self.events to match the color-group layout.
    let mut reordered: Vec<Event> = Vec::with_capacity(n);
    let mut ranges: Vec<(usize, usize)> = Vec::with_capacity(cg.groups.len());
    for group in &cg.groups {
        let start = reordered.len();
        for &ev_idx in group {
            reordered.push(std::mem::replace(
                &mut self.events[ev_idx],
                // Placeholder; original slot becomes garbage before the
                // final swap.
                Event::placeholder(),  // OR use std::mem::take if Event: Default
            ));
        }
        ranges.push((start, reordered.len()));
    }
    self.events = reordered;
    // Rebuild cg with post-reorder indices: each group now spans a
    // contiguous range.
    self.color_groups = ColorGroups::from_ranges(ranges);
}

Then color[i] is event indices [ranges[i].0 .. ranges[i].1).

ColorGroups::from_ranges(ranges: Vec<(usize, usize)>) constructs the groups with the trivial mapping group_i = (start..end).collect().

Pitfall: Event::placeholder() or std::mem::take. If Event: Default, std::mem::take(&mut self.events[ev_idx]) works cleanly. If not, use Option<Event> temporarily, or implement a cheap Event::placeholder(). Before writing the replacement logic, read src/time_slice.rs to see if Event derives Default — if it does, use std::mem::take.

Step 3: Implement the parallel iteration via contiguous ranges

#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    use rayon::prelude::*;
    let mut max_step = (0.0_f64, 0.0_f64);
    for range in &self.color_groups.ranges {
        let slice = &mut self.events[range.0..range.1];
        let contributions: Vec<(f64, f64)> = slice
            .par_iter_mut()
            .map(|event| event.iteration(…))
            .collect();
        for (d0, d1) in contributions {
            max_step.0 = max_step.0.max(d0);
            max_step.1 = max_step.1.max(d1);
        }
    }
    max_step
}

IMPORTANT: event.iteration(…) currently takes other borrowed state from TimeSlice (skills, competitors). Those borrows conflict with the &mut self.events[...] borrow. Resolution:

a) Pull the shared data (&self.skills, &self.competitors) into a local variable before the par_iter, so rayon's closure captures & not &self.

b) If the existing per-event method ALSO mutates shared state (e.g., writes to a shared SkillStore), that's a problem — it breaks the color-disjoint-index guarantee. Read the current code carefully. If per-event iteration writes to skills for indices it owns, the color-group invariant makes this safe, but you'll need unsafe to express it. In that case, fall back to approach (1) or (3) from Task 6 Step 2.

Read the existing event.iteration or equivalent FIRST before choosing. The code may already be structured so events only mutate themselves.

Step 4: Sequential fallback

#[cfg(not(feature = "rayon"))]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    let mut max_step = (0.0_f64, 0.0_f64);
    for range in &self.color_groups.ranges {
        for event in &mut self.events[range.0..range.1] {
            let step = event.iteration(…);
            max_step.0 = max_step.0.max(step.0);
            max_step.1 = max_step.1.max(step.1);
        }
    }
    max_step
}

Both versions use the same color-group traversal order — behavior is identical across feature flags.

Step 5: Verify both feature combinations

cargo test --features approx --lib
cargo test --features approx,rayon --lib
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings

All 90 tests pass in both configurations. Goldens must NOT drift.

Step 6: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): parallel within-slice iteration via rayon

Events are reordered into color-group-contiguous ranges during
recompute_color_groups; each color's range is processed in parallel
via par_iter_mut when the rayon feature is enabled, sequentially
otherwise. The two paths produce identical results because events
within a color touch disjoint Index values (async-EP invariant).

Feature gated: default build still sequential; --features rayon
activates the parallel path.

Part of T3.
EOF
)"

Task 7: Parallel `learning_curves` with ordered reduction

Files:

Modify: src/history.rs

Both learning_curves() and learning_curves_by_index() currently iterate self.time_slices and collect per-competitor posteriors. They're embarrassingly parallel per-slice; the merge at the end must preserve slice order to keep tests deterministic.

Step 1: Parallelize learning_curves

pub fn learning_curves(&self) -> HashMap<K, Vec<(T, Gaussian)>> {
    #[cfg(feature = "rayon")]
    {
        use rayon::prelude::*;
        // Parallel: compute per-slice (index, time, gaussian) triples;
        // collect preserves slice order (.collect::<Vec<_>> is order-preserving).
        let per_slice: Vec<Vec<(Index, T, Gaussian)>> = self
            .time_slices
            .par_iter()
            .map(|ts| {
                ts.skills
                    .iter()
                    .map(|(idx, sk)| (idx, ts.time, sk.posterior()))
                    .collect()
            })
            .collect();
        // Sequential merge: iterate in slice order, push to per-key vectors.
        let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
        for slice_contrib in per_slice {
            for (idx, t, g) in slice_contrib {
                if let Some(key) = self.keys.key(idx).cloned() {
                    data.entry(key).or_default().push((t, g));
                }
            }
        }
        data
    }
    #[cfg(not(feature = "rayon"))]
    {
        // Original sequential impl (unchanged)
        let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
        for ts in &self.time_slices {
            for (idx, sk) in ts.skills.iter() {
                if let Some(key) = self.keys.key(idx).cloned() {
                    data.entry(key).or_default().push((ts.time, sk.posterior()));
                }
            }
        }
        data
    }
}

Step 2: Parallelize learning_curves_by_index

Same pattern — should it exist. (Task 20 of T2 may have removed this method; verify it's present.) If it's there, mirror the learning_curves parallel+sequential split.

Step 3: Verify

cargo test --features approx,rayon

All tests pass — goldens preserved across both feature configurations.

Step 4: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel learning_curves under rayon feature

Per-slice posterior collection runs in parallel; merge into the
per-key HashMap is sequential in slice order to preserve deterministic
output. Sequential impl unchanged under default feature set.

Part of T3.
EOF
)"

Task 8: Parallel `log_evidence` / `log_evidence_for` with deterministic sum

Files:

Modify: src/history.rs

log_evidence_internal already does self.time_slices.iter().map(|ts| ts.log_evidence(…)).sum(). We replace with parallel map + sequential sum.

Step 1: Parallelize log_evidence_internal

pub(crate) fn log_evidence_internal(&mut self, forward: bool, targets: &[Index]) -> f64 {
    #[cfg(feature = "rayon")]
    {
        use rayon::prelude::*;
        let per_slice: Vec<f64> = self
            .time_slices
            .par_iter()
            .map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
            .collect();
        // Sequential sum in slice order for bit-identical reduction.
        per_slice.into_iter().sum()
    }
    #[cfg(not(feature = "rayon"))]
    {
        self.time_slices
            .iter()
            .map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
            .sum()
    }
}

Critical: per_slice is a Vec<f64>, not a fold. The sequential .into_iter().sum() is bit-identical to the sequential impl because the order is the same (slice order). Rayon's par_iter().sum() would reorder additions and is non-deterministic.

Step 2: log_evidence_for already wraps log_evidence_internal — no further changes needed.
Step 3: Verify

cargo test --features approx,rayon

All tests pass — goldens preserved.

Step 4: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel log_evidence with deterministic sum

Per-slice contribution computed in parallel; final reduction is
sequential in slice order so the sum is bit-identical to the T2
sequential baseline. This is essential for the T3 acceptance
criterion of identical posteriors across RAYON_NUM_THREADS values.

Part of T3.
EOF
)"

Task 9: Determinism test across thread counts

Files:

Create: tests/determinism.rs
Step 1: Create the test

//! Determinism tests: identical posteriors across RAYON_NUM_THREADS
//! values. Only meaningful when the `rayon` feature is enabled.

#![cfg(feature = "rayon")]

use smallvec::smallvec;
use trueskill_tt::{
    ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};

fn build_and_converge(seed: u64) -> Vec<(i64, trueskill_tt::Gaussian)> {
    // Seed-driven deterministic test data: ~100 events, ~30 competitors.
    let mut h = History::<i64, _, _, String>::builder()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
        .build();

    // Deterministic pseudo-random event generation.
    let mut rng_state = seed;
    let mut next = || {
        rng_state = rng_state.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        rng_state
    };

    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(100);
    for ev_i in 0..100 {
        let a = (next() % 30) as usize;
        let mut b = (next() % 30) as usize;
        while b == a {
            b = (next() % 30) as usize;
        }
        events.push(Event {
            time: ev_i as i64 + 1,
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h.converge().unwrap();
    h.learning_curve(&"p0".to_string())
}

#[test]
fn posteriors_identical_across_thread_counts() {
    // Run the same history with the same seed at different rayon pool sizes.
    // Rayon lets us set the pool once per process; the cleanest pattern is
    // to use install() with a custom ThreadPoolBuilder per run.

    let sizes = [1, 2, 4, 8];
    let mut results: Vec<Vec<(i64, trueskill_tt::Gaussian)>> = Vec::new();
    for &n in &sizes {
        let pool = rayon::ThreadPoolBuilder::new()
            .num_threads(n)
            .build()
            .unwrap();
        let curve = pool.install(|| build_and_converge(42));
        results.push(curve);
    }

    // Every result must be bit-identical to the first.
    let reference = &results[0];
    for (i, curve) in results.iter().enumerate().skip(1) {
        assert_eq!(
            curve.len(),
            reference.len(),
            "curve length differs at {n} threads",
            n = sizes[i],
        );
        for (j, (&(t_ref, g_ref), &(t, g))) in reference.iter().zip(curve.iter()).enumerate() {
            assert_eq!(t_ref, t, "time point {j} differs at {} threads", sizes[i]);
            assert_eq!(
                g_ref.pi_and_tau(),
                g.pi_and_tau(),
                "posterior bits differ at thread count {}, time {}",
                sizes[i],
                t,
            );
        }
    }
}

Note on pi_and_tau: the test expects a way to extract the raw nat-param representation for bit-level comparison. If the Gaussian type doesn't expose it, add a pub(crate) fn pi_and_tau(&self) -> (f64, f64) method as part of this task. Alternatively, compare (mu(), sigma()) — slightly less strict but usually good enough for determinism testing.

Actually, f64::to_bits() gives us direct bit equality:

assert_eq!(g_ref.mu().to_bits(), g.mu().to_bits(), …);
assert_eq!(g_ref.sigma().to_bits(), g.sigma().to_bits(), …);

Use to_bits() so we don't need a new accessor.

Step 2: Verify

cargo test --features approx,rayon --test determinism

Expected: posteriors_identical_across_thread_counts passes.

Step 3: Commit

cargo +nightly fmt
git add tests/determinism.rs
git commit -m "$(cat <<'EOF'
test: assert bit-identical posteriors across RAYON_NUM_THREADS

Runs the same deterministic history at thread counts {1, 2, 4, 8}
and asserts every (time, posterior) pair is bit-identical. Verifies
the T3 determinism invariant holds under the ordered-reduce strategy.

Only compiled with --features rayon.

Part of T3.
EOF
)"

Task 10: Multi-thread benchmark + acceptance gate

Files:

Create: benches/history_converge.rs
Modify: Cargo.toml (register the bench)
Step 1: Register the bench in Cargo.toml

Add under the existing [[bench]] entries:

[[bench]]
name = "history_converge"
harness = false

Step 2: Create benches/history_converge.rs

//! End-to-end convergence benchmark. Measures `History::converge()` on a
//! realistic workload (~500 events, ~100 competitors, ~30 iters). The
//! rayon feature, when enabled, activates within-slice parallel event
//! iteration.

use criterion::{Criterion, criterion_group, criterion_main};
use smallvec::smallvec;
use trueskill_tt::{
    ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};

fn build_history(n_events: usize, n_competitors: usize, seed: u64) -> History<i64, ConstantDrift, trueskill_tt::NullObserver, String> {
    let mut rng = seed;
    let mut next = || {
        rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        rng
    };

    let mut h = History::<i64, _, _, String>::builder()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
        .build();

    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(n_events);
    for ev_i in 0..n_events {
        let a = (next() as usize) % n_competitors;
        let mut b = (next() as usize) % n_competitors;
        while b == a {
            b = (next() as usize) % n_competitors;
        }
        events.push(Event {
            time: (ev_i as i64 / 10) + 1, // ~10 events per slice
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h
}

fn bench_converge(c: &mut Criterion) {
    c.bench_function("History::converge/500x100", |b| {
        b.iter_batched(
            || build_history(500, 100, 42),
            |mut h| {
                h.converge().unwrap();
            },
            criterion::BatchSize::SmallInput,
        );
    });
}

criterion_group!(benches, bench_converge);
criterion_main!(benches);

Step 3: Run sequential baseline

cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'

Record the sequential baseline number.

Step 4: Run parallel version

cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'

Acceptance gate: parallel run should show >2× speedup on an 8-core machine. If it's less:

Verify rayon is actually being used (RAYON_LOG=1).
Check whether the workload has enough color-group parallelism (a slice with 10 events that all share competitors has 0 parallel work).
Consider whether the per-event cost is large enough to amortize rayon overhead.

If the gate fails, dig in before committing. Acceptable fallback: tune the benchmark workload (more events per slice, more competitors) so parallelism has more opportunity. Report findings.

Step 5: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
bench: end-to-end History::converge benchmark

Workload: 500 events across ~50 time slices, ~100 competitors, 30
iteration cap, 1e-6 convergence. Measures full forward+backward
sweep through convergence — the target hot path for rayon-backed
within-slice parallelism.

Sequential baseline: <X> ms
With --features rayon on 8 cores: <Y> ms (<speedup>×)

Part of T3.
EOF
)"

(Fill in <X>, <Y>, <speedup> from measured numbers.)

Task 11: Final verification, bench capture, CHANGELOG

Files:

Modify: benches/baseline.txt
Modify: CHANGELOG.md
Step 1: Run the complete verification matrix

cargo +nightly fmt --check
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings
cargo test --features approx
cargo test --features approx,rayon
cargo bench --bench batch --features approx 2>&1 | grep "Batch::iteration"
cargo bench --bench batch --features approx,rayon 2>&1 | grep "Batch::iteration"
cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'
cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'

All must pass / be clean.

Step 2: Append T3 block to benches/baseline.txt

# After T3 (date, same hardware)

Batch::iteration (seq)       <X.XX> µs   (<delta> vs T2 21.36 µs)
Batch::iteration (rayon, 8c) <X.XX> µs   (sequential path on single slice; rayon not active)
History::converge (seq)       <X.XX> ms   baseline
History::converge (rayon, 8c) <X.XX> ms   (<speedup>× — target was ≥2×)

# Notes:
# - T3 adds Send + Sync on public traits, color-group partition in
#   TimeSlice, and rayon-feature-gated parallel paths on within-slice
#   iteration, learning_curves, log_evidence_for.
# - Determinism verified: tests/determinism.rs asserts bit-identical
#   posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
# - Rayon is opt-in: default build is single-threaded as in T2.

Step 3: Prepend T3 entry to CHANGELOG.md

Add a ## Unreleased — T3 concurrency section above the existing ## Unreleased — T2 new API surface (if it's still there after the main merge; otherwise above the ## 0.1.0 section). Enumerate: new rayon feature, Send + Sync bounds (minor breaking for downstream custom trait impls), color-group infrastructure (internal), benchmark numbers.

Step 4: Commit

git add benches/baseline.txt CHANGELOG.md
git commit -m "$(cat <<'EOF'
bench,docs: capture T3 final numbers and update CHANGELOG

History::converge parallel speedup: <X>× on 8 cores with --features rayon.
Batch::iteration unchanged in seq mode; Gaussian ops unchanged.

Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.

Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Step 5: Ready to PR

Branch t3-concurrency should open against main. PR body references the spec, the plan, and this file's benchmarks block.

Self-review notes

Spec coverage (against Section 6 + Section 7 "T3"):

✅ Send + Sync audit and bounds on all public traits (Task 3)
✅ Color-group partitioning at TimeSlice ingestion (Tasks 4, 5)
✅ rayon as opt-in feature (Task 2; note: spec said default-on, we flip later)
✅ Parallel within-slice color groups (Task 6)
✅ Parallel learning_curves (Task 7)
✅ Parallel log_evidence_for (Task 8)
✅ Deterministic posteriors across RAYON_NUM_THREADS (Task 9)
✅ >2× speedup on 8-core offline converge (Task 10)

Deferred to later tiers:

Cross-slice parallelism (dirty-bit slice skipping) — spec acknowledges this is separate from T3's within-slice focus.
Synchronous-EP schedule with barrier merge — available as a Schedule impl, per spec Section 6, but deferred.
Exposing color-group partitioning to users via add_events_with_partition — per spec open question 5.
Rayon default-on flip — after field-stability evidence.

Hazards during execution:

Mutable-aliasing gymnastics in Task 6. The color-disjoint guarantee is real but rayon's API doesn't see it. Task 6 Step 2 offers three implementation approaches; the recommended one (approach 2, reorder events into contiguous color ranges) avoids unsafe but requires a physical reorder on mutation. Benchmark reordering cost — if it dominates, reconsider.
Send + Sync auto-derive failure. If any struct between traits holds a non-thread-safe primitive (raw pointer, Rc), auto-derive fails. Unlikely in this codebase, but watch for it in Task 3.
Deterministic sum pitfalls. Rayon's par_iter().sum() reorders additions. Only par_iter().map().collect::<Vec<_>>().into_iter().sum() is deterministic. Task 8 is careful about this; reviewers should verify.
f64 NaN in goldens. If any test golden has a NaN, to_bits() comparison needs special casing. Our goldens are all finite; no action needed.

Things outside the plan that may bite:

CI needs to test both feature configurations. If there's a .github/workflows/ file, update it to include --features approx,rayon as a matrix entry. If no CI exists (likely for this project), flag and move on.
examples/atp.rs — may want to demonstrate --features rayon. Optional; skip unless trivial.

45 KiB Raw Blame History Unescape Escape

T3 — Concurrency Implementation Plan

Design decisions

Acceptance criteria

File map

Task 1: Pre-flight — verify green on main, create t3 branch, capture baseline

Task 2: Add rayon as optional dependency + feature flag

Task 3: Add Send + Sync bounds to public traits

Task 4: Implement greedy color-group partitioning

Task 5: Store color-group partition in TimeSlice

Task 6: Parallel within-slice event iteration (behind rayon feature)

Task 7: Parallel learning_curves with ordered reduction

Task 8: Parallel log_evidence / log_evidence_for with deterministic sum

Task 9: Determinism test across thread counts

Task 10: Multi-thread benchmark + acceptance gate

Task 11: Final verification, bench capture, CHANGELOG

Self-review notes

45 KiB

Raw Blame History

Task 2: Add `rayon` as optional dependency + feature flag

Task 3: Add `Send + Sync` bounds to public traits

Task 5: Store color-group partition in `TimeSlice`

Task 6: Parallel within-slice event iteration (behind `rayon` feature)

Task 7: Parallel `learning_curves` with ordered reduction

Task 8: Parallel `log_evidence` / `log_evidence_for` with deterministic sum