Files

Anders Olsson f0793a8470 docs: add T3 concurrency implementation plan

11-task plan for rayon-backed within-slice parallelism per
Section 6 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 13:34:00 +02:00

45 KiB

Raw Blame History

T3 — Concurrency Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Ship the T3 tier from docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md Section 6: Send + Sync bounds on public traits, color-group partitioning for within-slice event independence, and rayon-backed parallel paths on within-slice event iteration, learning_curves, and log_evidence_for. Deterministic posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}; >2× speedup on an 8-core offline-converge benchmark.

Architecture: Concurrency lands as a single feature flag (rayon, opt-in for T3; the spec suggests default-on but we defer that flip until the feature is proven stable). All parallel paths are hidden behind #[cfg(feature = "rayon")] with sequential fallbacks. Within-slice parallelism exploits graph coloring: events sharing no Index go into the same color group and run concurrently; across color groups, execution is strictly sequential. This preserves the exact async-EP semantics of T2 at any thread count. Reductions over f64 (log_evidence_for, predict_quality) use two-stage parallel-then-sequential-reduce so sums are bit-identical regardless of thread count.

Tech Stack: Rust 2024 edition. New optional dependency: rayon = "1". Builds on T2 (History<T, D, O, K>, Event<T, K>, Outcome, Observer, factors module).

Design decisions

Called out explicitly so reviewers can override before execution:

Rayon is opt-in (cargo feature), not default-on. Simplifies CI, keeps the default build lean. We flip to default-on in a follow-up once the feature is shown to be stable under field use.
Greedy graph coloring for the within-slice partition: for each event in ingestion order, assign the lowest color whose existing members share no Index with the event. Optimality is not the target — events per slice is small (~50), and greedy finishes in O(n·c·m) where c is colors and m is team size. Rebalancing is a T4 follow-up if benchmarks show it helps.
Scope of parallelism: within-slice color groups, learning_curves, log_evidence_for, predict_quality. Cross-slice iteration in History::converge stays sequential — the forward/backward sweep has true data dependencies across slices. Parallelizing across slices requires a separate algorithm (the spec's dirty-bit slice skipping, deferred to beyond-T3).
Deterministic reductions: .par_iter().map().collect::<Vec<_>>().into_iter().sum() — each slice's contribution is computed in parallel, then summed sequentially in slice order. Per-element values are bit-identical to T2 (no extra floating-point additions reordered).
Within-game inference stays sequential. A single Game::ranked / Game::likelihoods call is too small (~20 µs) to amortize rayon's ~5 µs task overhead. The spec's table confirms this.

Acceptance criteria

cargo test --features approx — all tests pass (T2 baseline: 90 tests).
cargo test --features approx,rayon — all tests pass; determinism tests across RAYON_NUM_THREADS={1, 2, 4, 8} produce bit-identical posteriors.
cargo clippy --all-targets --features approx -- -D warnings — clean.
cargo clippy --all-targets --features approx,rayon -- -D warnings — clean.
cargo +nightly fmt --check — clean.
cargo bench --bench history_converge --features approx,rayon — >2× speedup on an 8-core machine vs the sequential baseline.
All public traits (Time, Drift<T>, Observer<T>, Factor, Schedule) have Send + Sync bounds; their blanket/default impls (i64, Untimed, ConstantDrift, NullObserver) all naturally satisfy the new bounds — no user code should need changes.
The rayon feature is opt-in; default build without it compiles and passes all tests.

File map

New files:

Path	Responsibility
`src/color_group.rs`	Greedy graph coloring for within-slice event partitioning.
`src/parallel.rs`	`#[cfg(feature = "rayon")]` helpers: `par_or_seq` iterators, ordered reductions, thread-count determinism tests.
`benches/history_converge.rs`	End-to-end convergence benchmark (scales to show rayon benefit).
`tests/determinism.rs`	Runs the same convergence with different `RAYON_NUM_THREADS` and asserts bit-identical posteriors.

Modified:

Path	What changes
`Cargo.toml`	Add `rayon = { version = "1", optional = true }`; declare `rayon` feature. Add `[[bench]] name = "history_converge"`.
`src/lib.rs`	Declare new `color_group` + `parallel` modules (both private).
`src/time.rs`	Add `Send + Sync + 'static` bounds to `Time`.
`src/drift.rs`	Add `Send + Sync` bound to `Drift<T>`.
`src/observer.rs`	Add `Send + Sync` bound to `Observer<T>`.
`src/factor/mod.rs`	Add `Send + Sync` bound to `Factor`.
`src/schedule.rs`	Add `Send + Sync` bound to `Schedule`.
`src/time_slice.rs`	Store pre-computed color-group partition; parallel event iteration behind `#[cfg(feature = "rayon")]`.
`src/history.rs`	Parallel `learning_curves`, `learning_curves_by_index`, `log_evidence`, `log_evidence_for` behind `#[cfg(feature = "rayon")]` with ordered reductions.

Task 1: Pre-flight — verify green on main, create t3 branch, capture baseline

Files: none

Step 1: Confirm on main, clean tree

git status
git rev-parse --abbrev-ref HEAD

Expected: clean; on main.

Step 2: Create the T3 branch

git checkout -b t3-concurrency

Step 3: Confirm all tests pass

cargo test --features approx

Expected: 90 tests pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence).

Step 4: Capture current bench baseline

cargo bench --bench batch 2>&1 | grep "Batch::iteration"

Record the number — it'll be the sequential-path baseline to verify no regression.

Step 5: No commit — verification only.

Task 2: Add `rayon` as optional dependency + feature flag

Files:

Modify: Cargo.toml
Step 1: Add the dependency and feature

Under [dependencies] add:

rayon = { version = "1", optional = true }

Add a [features] section (if not present — check first):

[features]
rayon = ["dep:rayon"]

(The existing approx feature already exists as a dep:approx-style optional dependency; use the same convention.)

Step 2: Verify both builds compile

cargo build --features approx
cargo build --features approx,rayon

Both must succeed. The second one pulls rayon into the dependency graph but nothing uses it yet.

Step 3: Commit

git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
feat(cargo): add rayon as optional dependency

Opt-in feature flag — users who want parallel paths build with
--features rayon. Default build remains single-threaded.

Spec Section 6 calls for default-on; we defer that flip until the
feature is stable under field use.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 3: Add `Send + Sync` bounds to public traits

Files:

Modify: src/time.rs, src/drift.rs, src/observer.rs, src/factor/mod.rs, src/schedule.rs

This is a minor breaking API change: downstream code with user-defined Drift<T> / Observer<T> / Factor / Schedule impls that aren't Send + Sync will fail to compile. For our crate, all built-in types (i64, Untimed, ConstantDrift, NullObserver, EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor, BuiltinFactor) naturally satisfy these bounds — no internal code changes.

Step 1: Update Time trait

// src/time.rs
pub trait Time: Copy + Ord + Send + Sync + 'static {
    fn elapsed_to(&self, later: &Self) -> i64;
}

Step 2: Update Drift<T> trait

// src/drift.rs
pub trait Drift<T: Time>: Copy + Debug + Send + Sync {
    fn variance_delta(&self, from: &T, to: &T) -> f64;
    fn variance_for_elapsed(&self, elapsed: i64) -> f64;
}

Step 3: Update Observer<T> trait

// src/observer.rs
pub trait Observer<T: Time>: Send + Sync {
    fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
    fn on_batch_processed(&self, _time: &T, _slice_idx: usize, _n_events: usize) {}
    fn on_converged(&self, _iters: usize, _final_step: (f64, f64), _converged: bool) {}
}

Step 4: Update Factor trait

// src/factor/mod.rs
pub trait Factor: Send + Sync {
    fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64);
    fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}

Step 5: Update Schedule trait

// src/schedule.rs
pub trait Schedule: Send + Sync {
    fn run(&self, factors: &mut [BuiltinFactor], vars: &mut VarStore) -> ScheduleReport;
}

Step 6: Verify both feature combinations still compile and test

cargo build --features approx
cargo build --features approx,rayon
cargo test --features approx --lib
cargo clippy --all-targets --features approx -- -D warnings

If any built-in type fails the auto-Send/Sync check, investigate — something non-thread-safe slipped in. Likely suspects: raw pointers, Rc, RefCell, non-static references. None are expected in this codebase; if found, convert to Arc/Mutex/RwLock as appropriate.

Step 7: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(api): add Send + Sync bounds to public traits

Required for T3 rayon-based parallelism. Affected traits:
- Time (+ Send + Sync + 'static)
- Drift<T> (+ Send + Sync)
- Observer<T> (+ Send + Sync)
- Factor (+ Send + Sync)
- Schedule (+ Send + Sync)

All built-in impls (i64, Untimed, ConstantDrift, NullObserver,
EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor,
BuiltinFactor) naturally satisfy these bounds — no internal changes
needed.

Minor breaking change: downstream impls that aren't already
thread-safe will fail to compile until they add the bounds.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 4: Implement greedy color-group partitioning

Files:

Create: src/color_group.rs
Modify: src/lib.rs (register module)
Step 1: Create src/color_group.rs

//! Greedy graph coloring for within-slice event independence.
//!
//! Events sharing no `Index` can be processed in parallel under async-EP
//! semantics. This module partitions a list of events into "colors" such
//! that events of the same color touch disjoint index sets.
//!
//! The algorithm is greedy: for each event in ingestion order, place it in
//! the lowest-numbered color whose existing members share no `Index`. If
//! no existing color accepts the event, open a new color.
//!
//! Complexity: O(n × c × m) where n is events, c is colors (small, ≤ 5 in
//! practice), and m is average team size.

use std::collections::HashSet;

use crate::Index;

/// Partition of event indices into color groups.
///
/// Each inner `Vec<usize>` holds the indices (into the original events
/// array) of events assigned to one color. Colors are iterated in ascending
/// order by convention.
#[derive(Clone, Debug, Default)]
pub(crate) struct ColorGroups {
    pub(crate) groups: Vec<Vec<usize>>,
}

impl ColorGroups {
    pub(crate) fn new() -> Self {
        Self::default()
    }

    pub(crate) fn n_colors(&self) -> usize {
        self.groups.len()
    }

    pub(crate) fn is_empty(&self) -> bool {
        self.groups.is_empty()
    }

    /// Total event count across all colors.
    pub(crate) fn total_events(&self) -> usize {
        self.groups.iter().map(|g| g.len()).sum()
    }
}

/// Compute color groups greedily.
///
/// `event_indices` yields, for each event, the set of `Index` values that
/// event touches. The returned `ColorGroups` has one inner `Vec<usize>` per
/// color, containing event indices in the order they were assigned.
pub(crate) fn color_greedy<I, F>(n_events: usize, index_set: F) -> ColorGroups
where
    F: Fn(usize) -> I,
    I: IntoIterator<Item = Index>,
{
    let mut groups: Vec<Vec<usize>> = Vec::new();
    let mut members: Vec<HashSet<Index>> = Vec::new();

    for ev_idx in 0..n_events {
        let ev_members: HashSet<Index> = index_set(ev_idx).into_iter().collect();
        // Find first color whose member-set is disjoint from this event's indices.
        let chosen = members
            .iter()
            .position(|m| m.is_disjoint(&ev_members));
        let color_idx = match chosen {
            Some(c) => c,
            None => {
                groups.push(Vec::new());
                members.push(HashSet::new());
                groups.len() - 1
            }
        };
        groups[color_idx].push(ev_idx);
        members[color_idx].extend(ev_members);
    }

    ColorGroups { groups }
}

#[cfg(test)]
mod tests {
    use super::*;

    fn idx(i: usize) -> Index {
        Index::from(i)
    }

    #[test]
    fn single_event_gets_one_color() {
        let cg = color_greedy(1, |_| vec![idx(0), idx(1)]);
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0]);
    }

    #[test]
    fn disjoint_events_share_a_color() {
        // Event 0 touches {0, 1}; event 1 touches {2, 3}.
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 1);
        assert_eq!(cg.groups[0], vec![0, 1]);
    }

    #[test]
    fn overlapping_events_need_separate_colors() {
        // Event 0 touches {0, 1}; event 1 touches {1, 2}.
        let cg = color_greedy(2, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(1), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0]);
        assert_eq!(cg.groups[1], vec![1]);
    }

    #[test]
    fn three_events_two_colors() {
        // Event 0: {0, 1}; event 1: {2, 3}; event 2: {0, 2}.
        // Greedy: ev0→c0, ev1→c0 (disjoint), ev2 overlaps both→c1.
        let cg = color_greedy(3, |i| match i {
            0 => vec![idx(0), idx(1)],
            1 => vec![idx(2), idx(3)],
            2 => vec![idx(0), idx(2)],
            _ => unreachable!(),
        });
        assert_eq!(cg.n_colors(), 2);
        assert_eq!(cg.groups[0], vec![0, 1]);
        assert_eq!(cg.groups[1], vec![2]);
    }

    #[test]
    fn total_events_counts_correctly() {
        let cg = color_greedy(4, |_| vec![idx(0)]);
        // All events touch index 0 → 4 distinct colors.
        assert_eq!(cg.n_colors(), 4);
        assert_eq!(cg.total_events(), 4);
    }
}

Step 2: Register in src/lib.rs

Add (private module, alphabetical):

mod color_group;

No public re-export — this is internal infrastructure.

Step 3: Verify

cargo test --features approx --lib color_group
cargo clippy --all-targets --features approx -- -D warnings
cargo +nightly fmt --check

Expected: 5 tests pass in the color_group::tests module.

Step 4: Commit

cargo +nightly fmt
git add src/color_group.rs src/lib.rs
git commit -m "$(cat <<'EOF'
feat(color-group): add greedy within-slice event partitioning

ColorGroups holds a partition of event indices into color groups such
that events of the same color touch no shared Index. Computed greedily
in ingestion order: each event goes into the first color whose existing
members are disjoint from the event's indices.

Used in T3 for safe within-slice parallelism — events in the same
color can run concurrently without touching each other's skills.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Task 5: Store color-group partition in `TimeSlice`

Files:

Modify: src/time_slice.rs
Step 1: Add field to TimeSlice

pub(crate) struct TimeSlice<T: Time> {
    // … existing fields …
    pub(crate) color_groups: crate::color_group::ColorGroups,
}

Initialize to empty in TimeSlice::new:

pub fn new(time: T, p_draw: f64) -> Self {
    Self {
        // … existing initializers …
        color_groups: crate::color_group::ColorGroups::new(),
    }
}

Step 2: Recompute color groups whenever events change

Find the methods on TimeSlice that mutate self.events (most likely add_events — look at the current signature in src/time_slice.rs). After any mutation, recompute the partition:

pub(crate) fn recompute_color_groups(&mut self) {
    let n = self.events.len();
    self.color_groups = crate::color_group::color_greedy(n, |ev_idx| {
        // Return an iterator of every Index touched by event ev_idx.
        // Each event has teams; each team has items; each item has an agent: Index.
        self.events[ev_idx]
            .teams
            .iter()
            .flat_map(|t| t.items.iter().map(|it| it.agent))
            .collect::<Vec<_>>()
    });
}

Call recompute_color_groups() at the end of any public/crate-visible method that mutates self.events (likely one or two sites). Note the exact method name by reading src/time_slice.rs first — it may differ from "add_events".

Step 3: Add a sanity test

#[test]
fn time_slice_recomputes_color_groups() {
    // Construct a time slice with 2 events sharing competitor a — they should
    // end up in different color groups.
    // … use existing test helpers to build the slice …
    // Assert slice.color_groups.n_colors() == 2 and each group has 1 event.
}

The exact helper pattern depends on how existing src/time_slice.rs::tests construct slices. Mirror the existing approach.

Step 4: Verify

cargo test --features approx --lib time_slice
cargo clippy --all-targets --features approx -- -D warnings

Expected: all existing time_slice tests still pass + 1 new test.

Step 5: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): pre-compute color groups at ingestion

TimeSlice now stores a ColorGroups partition recomputed whenever
events change. The partition is computed once per slice mutation and
reused on every convergence iteration, enabling the cheap within-slice
parallel sweep added in Task 6.

Part of T3.
EOF
)"

Task 6: Parallel within-slice event iteration (behind `rayon` feature)

Files:

Modify: src/time_slice.rs

This is the core parallelism work. Within a color group, events touch disjoint Index values — it's safe to process them concurrently. Across colors, processing is strictly sequential. This preserves exact async-EP semantics.

Step 1: Identify the event-iteration hot path

Read src/time_slice.rs to find the method that, per convergence iteration, walks all events in the slice and updates their skills. Look for patterns like for event in self.events.iter_mut() or self.events.iter().for_each(...). This is the target for parallelization.

Probably named iteration or similar. Note its full signature and current body.

Step 2: Rewrite the loop as color-group-driven

Sequential version (always present, used when rayon is disabled):

pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    let mut max_step = (0.0_f64, 0.0_f64);
    for color in &self.color_groups.groups {
        for &ev_idx in color {
            let step = self.events[ev_idx].iteration(…);
            max_step.0 = max_step.0.max(step.0);
            max_step.1 = max_step.1.max(step.1);
        }
    }
    max_step
}

Parallel version (behind cfg):

#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    use rayon::prelude::*;
    let mut max_step = (0.0_f64, 0.0_f64);
    for color in &self.color_groups.groups {
        // Within one color, events touch disjoint Indexes → safe to parallelize.
        // SAFETY: color_greedy guarantees disjoint index sets, so the slice
        // entries color[i] and color[j] (i ≠ j) can be mutably borrowed
        // concurrently via rayon's work-stealing executor. We use
        // par_iter over the event indices and bundle the per-event state
        // into something Sync.
        //
        // Concretely: since `self` is &mut and we can't simultaneously have
        // mut borrows into it, we need to split the borrow. Use
        // events[..].par_iter_mut() filtered to the color's indices? No —
        // rayon's par_iter_mut doesn't support index filtering directly.
        //
        // Instead: use events.as_mut_slice().par_chunks_mut() after sorting
        // events so color-members are contiguous — or extract the per-event
        // state into a &mut [EventState] and index directly.
        //
        // Practical implementation: extract the events we'll touch this color
        // into a Vec<&mut Event> using split_at_mut or similar. See impl below.
        let contributions: Vec<(f64, f64)> = color
            .par_iter()
            .map(|&ev_idx| {
                // Borrow the event mutably — this requires unsafe or an
                // alternate data layout that exposes color-disjoint slices.
                // See the "Design note" below for the chosen approach.
                todo!("color-disjoint mutable event access")
            })
            .collect();
        for (d0, d1) in contributions {
            max_step.0 = max_step.0.max(d0);
            max_step.1 = max_step.1.max(d1);
        }
    }
    max_step
}

Design note on mutable access: Rayon's par_iter_mut doesn't let us access arbitrary indices from the same &mut Vec<Event> in parallel. Options the implementer should choose between, depending on which compiles most cleanly with the existing TimeSlice::iteration body:

Interior mutability. Wrap each Event in Cell<…> or RefCell<…>, then Event: Sync. This works only if the event's internal state is small/cheap to copy. Adds overhead.
Manual split_at_mut sequence. Sort events into color order once (mutating self.events so color[0][0], color[0][1], …, color[1][0], … are contiguous), remember the boundaries, then par_chunks_mut over each color's contiguous range. Simple, no unsafe. Does require a one-time sort when color groups are computed.
Raw pointer juggling. SAFETY-commented unsafe blocks that pass *mut Event into parallel closures. Fast, but fragile. Avoid unless (1) and (2) are benchmarked and found insufficient.

Recommendation: approach (2). When color groups are computed in Task 5's recompute_color_groups, also physically reorder self.events so color members are contiguous. Then each color corresponds to a slice range self.events[color_start..color_end], and par_chunks_mut(…).for_each(…) or par_iter_mut() over the range works.

Revised recompute_color_groups:

pub(crate) fn recompute_color_groups(&mut self) {
    let n = self.events.len();
    let cg = crate::color_group::color_greedy(n, |ev_idx| {
        self.events[ev_idx]
            .teams
            .iter()
            .flat_map(|t| t.items.iter().map(|it| it.agent))
            .collect::<Vec<_>>()
    });

    // Physically reorder self.events to match the color-group layout.
    let mut reordered: Vec<Event> = Vec::with_capacity(n);
    let mut ranges: Vec<(usize, usize)> = Vec::with_capacity(cg.groups.len());
    for group in &cg.groups {
        let start = reordered.len();
        for &ev_idx in group {
            reordered.push(std::mem::replace(
                &mut self.events[ev_idx],
                // Placeholder; original slot becomes garbage before the
                // final swap.
                Event::placeholder(),  // OR use std::mem::take if Event: Default
            ));
        }
        ranges.push((start, reordered.len()));
    }
    self.events = reordered;
    // Rebuild cg with post-reorder indices: each group now spans a
    // contiguous range.
    self.color_groups = ColorGroups::from_ranges(ranges);
}

Then color[i] is event indices [ranges[i].0 .. ranges[i].1).

ColorGroups::from_ranges(ranges: Vec<(usize, usize)>) constructs the groups with the trivial mapping group_i = (start..end).collect().

Pitfall: Event::placeholder() or std::mem::take. If Event: Default, std::mem::take(&mut self.events[ev_idx]) works cleanly. If not, use Option<Event> temporarily, or implement a cheap Event::placeholder(). Before writing the replacement logic, read src/time_slice.rs to see if Event derives Default — if it does, use std::mem::take.

Step 3: Implement the parallel iteration via contiguous ranges

#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    use rayon::prelude::*;
    let mut max_step = (0.0_f64, 0.0_f64);
    for range in &self.color_groups.ranges {
        let slice = &mut self.events[range.0..range.1];
        let contributions: Vec<(f64, f64)> = slice
            .par_iter_mut()
            .map(|event| event.iteration(…))
            .collect();
        for (d0, d1) in contributions {
            max_step.0 = max_step.0.max(d0);
            max_step.1 = max_step.1.max(d1);
        }
    }
    max_step
}

IMPORTANT: event.iteration(…) currently takes other borrowed state from TimeSlice (skills, competitors). Those borrows conflict with the &mut self.events[...] borrow. Resolution:

a) Pull the shared data (&self.skills, &self.competitors) into a local variable before the par_iter, so rayon's closure captures & not &self.

b) If the existing per-event method ALSO mutates shared state (e.g., writes to a shared SkillStore), that's a problem — it breaks the color-disjoint-index guarantee. Read the current code carefully. If per-event iteration writes to skills for indices it owns, the color-group invariant makes this safe, but you'll need unsafe to express it. In that case, fall back to approach (1) or (3) from Task 6 Step 2.

Read the existing event.iteration or equivalent FIRST before choosing. The code may already be structured so events only mutate themselves.

Step 4: Sequential fallback

#[cfg(not(feature = "rayon"))]
pub(crate) fn iteration(&mut self, …) -> (f64, f64) {
    let mut max_step = (0.0_f64, 0.0_f64);
    for range in &self.color_groups.ranges {
        for event in &mut self.events[range.0..range.1] {
            let step = event.iteration(…);
            max_step.0 = max_step.0.max(step.0);
            max_step.1 = max_step.1.max(step.1);
        }
    }
    max_step
}

Both versions use the same color-group traversal order — behavior is identical across feature flags.

Step 5: Verify both feature combinations

cargo test --features approx --lib
cargo test --features approx,rayon --lib
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings

All 90 tests pass in both configurations. Goldens must NOT drift.

Step 6: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): parallel within-slice iteration via rayon

Events are reordered into color-group-contiguous ranges during
recompute_color_groups; each color's range is processed in parallel
via par_iter_mut when the rayon feature is enabled, sequentially
otherwise. The two paths produce identical results because events
within a color touch disjoint Index values (async-EP invariant).

Feature gated: default build still sequential; --features rayon
activates the parallel path.

Part of T3.
EOF
)"

Task 7: Parallel `learning_curves` with ordered reduction

Files:

Modify: src/history.rs

Both learning_curves() and learning_curves_by_index() currently iterate self.time_slices and collect per-competitor posteriors. They're embarrassingly parallel per-slice; the merge at the end must preserve slice order to keep tests deterministic.

Step 1: Parallelize learning_curves

pub fn learning_curves(&self) -> HashMap<K, Vec<(T, Gaussian)>> {
    #[cfg(feature = "rayon")]
    {
        use rayon::prelude::*;
        // Parallel: compute per-slice (index, time, gaussian) triples;
        // collect preserves slice order (.collect::<Vec<_>> is order-preserving).
        let per_slice: Vec<Vec<(Index, T, Gaussian)>> = self
            .time_slices
            .par_iter()
            .map(|ts| {
                ts.skills
                    .iter()
                    .map(|(idx, sk)| (idx, ts.time, sk.posterior()))
                    .collect()
            })
            .collect();
        // Sequential merge: iterate in slice order, push to per-key vectors.
        let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
        for slice_contrib in per_slice {
            for (idx, t, g) in slice_contrib {
                if let Some(key) = self.keys.key(idx).cloned() {
                    data.entry(key).or_default().push((t, g));
                }
            }
        }
        data
    }
    #[cfg(not(feature = "rayon"))]
    {
        // Original sequential impl (unchanged)
        let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
        for ts in &self.time_slices {
            for (idx, sk) in ts.skills.iter() {
                if let Some(key) = self.keys.key(idx).cloned() {
                    data.entry(key).or_default().push((ts.time, sk.posterior()));
                }
            }
        }
        data
    }
}

Step 2: Parallelize learning_curves_by_index

Same pattern — should it exist. (Task 20 of T2 may have removed this method; verify it's present.) If it's there, mirror the learning_curves parallel+sequential split.

Step 3: Verify

cargo test --features approx,rayon

All tests pass — goldens preserved across both feature configurations.

Step 4: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel learning_curves under rayon feature

Per-slice posterior collection runs in parallel; merge into the
per-key HashMap is sequential in slice order to preserve deterministic
output. Sequential impl unchanged under default feature set.

Part of T3.
EOF
)"

Task 8: Parallel `log_evidence` / `log_evidence_for` with deterministic sum

Files:

Modify: src/history.rs

log_evidence_internal already does self.time_slices.iter().map(|ts| ts.log_evidence(…)).sum(). We replace with parallel map + sequential sum.

Step 1: Parallelize log_evidence_internal

pub(crate) fn log_evidence_internal(&mut self, forward: bool, targets: &[Index]) -> f64 {
    #[cfg(feature = "rayon")]
    {
        use rayon::prelude::*;
        let per_slice: Vec<f64> = self
            .time_slices
            .par_iter()
            .map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
            .collect();
        // Sequential sum in slice order for bit-identical reduction.
        per_slice.into_iter().sum()
    }
    #[cfg(not(feature = "rayon"))]
    {
        self.time_slices
            .iter()
            .map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
            .sum()
    }
}

Critical: per_slice is a Vec<f64>, not a fold. The sequential .into_iter().sum() is bit-identical to the sequential impl because the order is the same (slice order). Rayon's par_iter().sum() would reorder additions and is non-deterministic.

Step 2: log_evidence_for already wraps log_evidence_internal — no further changes needed.
Step 3: Verify

cargo test --features approx,rayon

All tests pass — goldens preserved.

Step 4: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel log_evidence with deterministic sum

Per-slice contribution computed in parallel; final reduction is
sequential in slice order so the sum is bit-identical to the T2
sequential baseline. This is essential for the T3 acceptance
criterion of identical posteriors across RAYON_NUM_THREADS values.

Part of T3.
EOF
)"

Task 9: Determinism test across thread counts

Files:

Create: tests/determinism.rs
Step 1: Create the test

//! Determinism tests: identical posteriors across RAYON_NUM_THREADS
//! values. Only meaningful when the `rayon` feature is enabled.

#![cfg(feature = "rayon")]

use smallvec::smallvec;
use trueskill_tt::{
    ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};

fn build_and_converge(seed: u64) -> Vec<(i64, trueskill_tt::Gaussian)> {
    // Seed-driven deterministic test data: ~100 events, ~30 competitors.
    let mut h = History::<i64, _, _, String>::builder()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
        .build();

    // Deterministic pseudo-random event generation.
    let mut rng_state = seed;
    let mut next = || {
        rng_state = rng_state.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        rng_state
    };

    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(100);
    for ev_i in 0..100 {
        let a = (next() % 30) as usize;
        let mut b = (next() % 30) as usize;
        while b == a {
            b = (next() % 30) as usize;
        }
        events.push(Event {
            time: ev_i as i64 + 1,
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h.converge().unwrap();
    h.learning_curve(&"p0".to_string())
}

#[test]
fn posteriors_identical_across_thread_counts() {
    // Run the same history with the same seed at different rayon pool sizes.
    // Rayon lets us set the pool once per process; the cleanest pattern is
    // to use install() with a custom ThreadPoolBuilder per run.

    let sizes = [1, 2, 4, 8];
    let mut results: Vec<Vec<(i64, trueskill_tt::Gaussian)>> = Vec::new();
    for &n in &sizes {
        let pool = rayon::ThreadPoolBuilder::new()
            .num_threads(n)
            .build()
            .unwrap();
        let curve = pool.install(|| build_and_converge(42));
        results.push(curve);
    }

    // Every result must be bit-identical to the first.
    let reference = &results[0];
    for (i, curve) in results.iter().enumerate().skip(1) {
        assert_eq!(
            curve.len(),
            reference.len(),
            "curve length differs at {n} threads",
            n = sizes[i],
        );
        for (j, (&(t_ref, g_ref), &(t, g))) in reference.iter().zip(curve.iter()).enumerate() {
            assert_eq!(t_ref, t, "time point {j} differs at {} threads", sizes[i]);
            assert_eq!(
                g_ref.pi_and_tau(),
                g.pi_and_tau(),
                "posterior bits differ at thread count {}, time {}",
                sizes[i],
                t,
            );
        }
    }
}

Note on pi_and_tau: the test expects a way to extract the raw nat-param representation for bit-level comparison. If the Gaussian type doesn't expose it, add a pub(crate) fn pi_and_tau(&self) -> (f64, f64) method as part of this task. Alternatively, compare (mu(), sigma()) — slightly less strict but usually good enough for determinism testing.

Actually, f64::to_bits() gives us direct bit equality:

assert_eq!(g_ref.mu().to_bits(), g.mu().to_bits(), …);
assert_eq!(g_ref.sigma().to_bits(), g.sigma().to_bits(), …);

Use to_bits() so we don't need a new accessor.

Step 2: Verify

cargo test --features approx,rayon --test determinism

Expected: posteriors_identical_across_thread_counts passes.

Step 3: Commit

cargo +nightly fmt
git add tests/determinism.rs
git commit -m "$(cat <<'EOF'
test: assert bit-identical posteriors across RAYON_NUM_THREADS

Runs the same deterministic history at thread counts {1, 2, 4, 8}
and asserts every (time, posterior) pair is bit-identical. Verifies
the T3 determinism invariant holds under the ordered-reduce strategy.

Only compiled with --features rayon.

Part of T3.
EOF
)"

Task 10: Multi-thread benchmark + acceptance gate

Files:

Create: benches/history_converge.rs
Modify: Cargo.toml (register the bench)
Step 1: Register the bench in Cargo.toml

Add under the existing [[bench]] entries:

[[bench]]
name = "history_converge"
harness = false

Step 2: Create benches/history_converge.rs

//! End-to-end convergence benchmark. Measures `History::converge()` on a
//! realistic workload (~500 events, ~100 competitors, ~30 iters). The
//! rayon feature, when enabled, activates within-slice parallel event
//! iteration.

use criterion::{Criterion, criterion_group, criterion_main};
use smallvec::smallvec;
use trueskill_tt::{
    ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};

fn build_history(n_events: usize, n_competitors: usize, seed: u64) -> History<i64, ConstantDrift, trueskill_tt::NullObserver, String> {
    let mut rng = seed;
    let mut next = || {
        rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        rng
    };

    let mut h = History::<i64, _, _, String>::builder()
        .mu(25.0)
        .sigma(25.0 / 3.0)
        .beta(25.0 / 6.0)
        .drift(ConstantDrift(25.0 / 300.0))
        .convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
        .build();

    let mut events: Vec<Event<i64, String>> = Vec::with_capacity(n_events);
    for ev_i in 0..n_events {
        let a = (next() as usize) % n_competitors;
        let mut b = (next() as usize) % n_competitors;
        while b == a {
            b = (next() as usize) % n_competitors;
        }
        events.push(Event {
            time: (ev_i as i64 / 10) + 1, // ~10 events per slice
            teams: smallvec![
                Team::with_members([Member::new(format!("p{a}"))]),
                Team::with_members([Member::new(format!("p{b}"))]),
            ],
            outcome: Outcome::winner((next() % 2) as u32, 2),
        });
    }
    h.add_events(events).unwrap();
    h
}

fn bench_converge(c: &mut Criterion) {
    c.bench_function("History::converge/500x100", |b| {
        b.iter_batched(
            || build_history(500, 100, 42),
            |mut h| {
                h.converge().unwrap();
            },
            criterion::BatchSize::SmallInput,
        );
    });
}

criterion_group!(benches, bench_converge);
criterion_main!(benches);

Step 3: Run sequential baseline

cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'

Record the sequential baseline number.

Step 4: Run parallel version

cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'

Acceptance gate: parallel run should show >2× speedup on an 8-core machine. If it's less:

Verify rayon is actually being used (RAYON_LOG=1).
Check whether the workload has enough color-group parallelism (a slice with 10 events that all share competitors has 0 parallel work).
Consider whether the per-event cost is large enough to amortize rayon overhead.

If the gate fails, dig in before committing. Acceptable fallback: tune the benchmark workload (more events per slice, more competitors) so parallelism has more opportunity. Report findings.

Step 5: Commit

cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
bench: end-to-end History::converge benchmark

Workload: 500 events across ~50 time slices, ~100 competitors, 30
iteration cap, 1e-6 convergence. Measures full forward+backward
sweep through convergence — the target hot path for rayon-backed
within-slice parallelism.

Sequential baseline: <X> ms
With --features rayon on 8 cores: <Y> ms (<speedup>×)

Part of T3.
EOF
)"

(Fill in <X>, <Y>, <speedup> from measured numbers.)

Task 11: Final verification, bench capture, CHANGELOG

Files:

Modify: benches/baseline.txt
Modify: CHANGELOG.md
Step 1: Run the complete verification matrix

cargo +nightly fmt --check
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings
cargo test --features approx
cargo test --features approx,rayon
cargo bench --bench batch --features approx 2>&1 | grep "Batch::iteration"
cargo bench --bench batch --features approx,rayon 2>&1 | grep "Batch::iteration"
cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'
cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'

All must pass / be clean.

Step 2: Append T3 block to benches/baseline.txt

# After T3 (date, same hardware)

Batch::iteration (seq)       <X.XX> µs   (<delta> vs T2 21.36 µs)
Batch::iteration (rayon, 8c) <X.XX> µs   (sequential path on single slice; rayon not active)
History::converge (seq)       <X.XX> ms   baseline
History::converge (rayon, 8c) <X.XX> ms   (<speedup>× — target was ≥2×)

# Notes:
# - T3 adds Send + Sync on public traits, color-group partition in
#   TimeSlice, and rayon-feature-gated parallel paths on within-slice
#   iteration, learning_curves, log_evidence_for.
# - Determinism verified: tests/determinism.rs asserts bit-identical
#   posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
# - Rayon is opt-in: default build is single-threaded as in T2.

Step 3: Prepend T3 entry to CHANGELOG.md

Add a ## Unreleased — T3 concurrency section above the existing ## Unreleased — T2 new API surface (if it's still there after the main merge; otherwise above the ## 0.1.0 section). Enumerate: new rayon feature, Send + Sync bounds (minor breaking for downstream custom trait impls), color-group infrastructure (internal), benchmark numbers.

Step 4: Commit

git add benches/baseline.txt CHANGELOG.md
git commit -m "$(cat <<'EOF'
bench,docs: capture T3 final numbers and update CHANGELOG

History::converge parallel speedup: <X>× on 8 cores with --features rayon.
Batch::iteration unchanged in seq mode; Gaussian ops unchanged.

Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.

Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"

Step 5: Ready to PR

Branch t3-concurrency should open against main. PR body references the spec, the plan, and this file's benchmarks block.

Self-review notes

Spec coverage (against Section 6 + Section 7 "T3"):

✅ Send + Sync audit and bounds on all public traits (Task 3)
✅ Color-group partitioning at TimeSlice ingestion (Tasks 4, 5)
✅ rayon as opt-in feature (Task 2; note: spec said default-on, we flip later)
✅ Parallel within-slice color groups (Task 6)
✅ Parallel learning_curves (Task 7)
✅ Parallel log_evidence_for (Task 8)
✅ Deterministic posteriors across RAYON_NUM_THREADS (Task 9)
✅ >2× speedup on 8-core offline converge (Task 10)

Deferred to later tiers:

Cross-slice parallelism (dirty-bit slice skipping) — spec acknowledges this is separate from T3's within-slice focus.
Synchronous-EP schedule with barrier merge — available as a Schedule impl, per spec Section 6, but deferred.
Exposing color-group partitioning to users via add_events_with_partition — per spec open question 5.
Rayon default-on flip — after field-stability evidence.

Hazards during execution:

Mutable-aliasing gymnastics in Task 6. The color-disjoint guarantee is real but rayon's API doesn't see it. Task 6 Step 2 offers three implementation approaches; the recommended one (approach 2, reorder events into contiguous color ranges) avoids unsafe but requires a physical reorder on mutation. Benchmark reordering cost — if it dominates, reconsider.
Send + Sync auto-derive failure. If any struct between traits holds a non-thread-safe primitive (raw pointer, Rc), auto-derive fails. Unlikely in this codebase, but watch for it in Task 3.
Deterministic sum pitfalls. Rayon's par_iter().sum() reorders additions. Only par_iter().map().collect::<Vec<_>>().into_iter().sum() is deterministic. Task 8 is careful about this; reviewers should verify.
f64 NaN in goldens. If any test golden has a NaN, to_bits() comparison needs special casing. Our goldens are all finite; no action needed.

Things outside the plan that may bite:

CI needs to test both feature configurations. If there's a .github/workflows/ file, update it to include --features approx,rayon as a matrix entry. If no CI exists (likely for this project), flag and move on.
examples/atp.rs — may want to demonstrate --features rayon. Optional; skip unless trivial.

45 KiB Raw Blame History Unescape Escape

T3 — Concurrency Implementation Plan

Design decisions

Acceptance criteria

File map

Task 1: Pre-flight — verify green on main, create t3 branch, capture baseline

Task 2: Add rayon as optional dependency + feature flag

Task 3: Add Send + Sync bounds to public traits

Task 4: Implement greedy color-group partitioning

Task 5: Store color-group partition in TimeSlice

Task 6: Parallel within-slice event iteration (behind rayon feature)

Task 7: Parallel learning_curves with ordered reduction

Task 8: Parallel log_evidence / log_evidence_for with deterministic sum

Task 9: Determinism test across thread counts

Task 10: Multi-thread benchmark + acceptance gate

Task 11: Final verification, bench capture, CHANGELOG

Self-review notes

45 KiB

Raw Blame History

Task 2: Add `rayon` as optional dependency + feature flag

Task 3: Add `Send + Sync` bounds to public traits

Task 5: Store color-group partition in `TimeSlice`

Task 6: Parallel within-slice event iteration (behind `rayon` feature)

Task 7: Parallel `learning_curves` with ordered reduction

Task 8: Parallel `log_evidence` / `log_evidence_for` with deterministic sum