Files
trueskill-tt/docs/superpowers/plans/2026-04-24-t3-concurrency.md
Anders Olsson 6bf3e7e294 T3: rayon-backed concurrency (opt-in) (#2)
Implements T3 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6. Plan: `docs/superpowers/plans/2026-04-24-t3-concurrency.md` (11 tasks).

## Summary

### Breaking

- `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds.

### New

- Opt-in `rayon` cargo feature. When enabled:
  - Within-slice event iteration runs color-group events in parallel via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
  - `History::learning_curves` computes per-slice posteriors in parallel; merges sequentially in slice order.
  - `History::log_evidence` / `log_evidence_for` use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.
- `ColorGroups` infrastructure (`src/color_group.rs`) with greedy graph coloring. Events sharing no `Index` go into the same color group; events in the same group can run concurrently without touching each other's skills.
- `tests/determinism.rs` asserts bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`.
- `benches/history_converge.rs` measures end-to-end convergence on three workload shapes.

## Performance

### Sequential (no rayon, default build)

| Metric | Before T3 | After T3 | Delta |
|---|---|---|---|
| `Batch::iteration` | 22.88 µs | 23.23 µs | **+1.5%** (noise) |
| `Gaussian::*` | ≈218–264 ps | ≈236 ps | within noise |

**No sequential regression.** Default build is as fast as T2.

### Parallel (`--features rayon`, Apple M5 Pro, auto thread count)

| Workload | Sequential | Parallel | Speedup |
|---|---:|---:|---:|
| 500 events / 100 competitors / 10 per slice | 4.03 ms | 4.24 ms | **1.0×** |
| 2000 events / 200 competitors / 20 per slice | 20.18 ms | 19.82 ms | **1.0×** |
| 5000 events / 50000 competitors / 1 slice | 11.88 ms | 9.10 ms | **1.3×** |

### ⚠️ The spec's >=2× target was not met on realistic workloads.

T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates.

**Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step** for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier.

## Determinism

`tests/determinism.rs` runs a 200-event history at thread counts {1, 2, 4, 8} via `rayon::ThreadPoolBuilder::install` and asserts every `(time, posterior)` pair has bit-identical `mu` and `sigma` (compared via `f64::to_bits()`). Passes.

## Internals

- Parallel path uses an `unsafe` block to concurrently write to `SkillStore` from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared `Index`), guaranteed by construction in `TimeSlice::recompute_color_groups`. Sequential path unchanged from T2.
- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to sequential inside `sweep_color_groups` to avoid task-spawn overhead.
- Thread-local `ScratchArena` per rayon worker thread.

## Test plan

- [x] `cargo test --features approx` — 96 tests pass (74 lib + 22 integration)
- [x] `cargo test --features approx,rayon` — 97 tests pass (+1 determinism)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch --features approx` — 23.23 µs (no regression vs T2)
- [x] `cargo bench --bench history_converge --features approx,rayon` — runs on all three workloads
- [x] Bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}` — verified

## Commit history

13 commits on `t3-concurrency`. Each task is self-contained and bisectable. See `git log main..t3-concurrency` for the full list.

## Deferred

- **Cross-slice parallelism** (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads.
- **Default-on `rayon` feature** — spec called for default-on; we keep it opt-in until the feature proves stable in production use.
- **Synchronous-EP schedule with barrier merge** — alternative parallel strategy per spec Section 6.
- **`MarginFactor` / `Outcome::Scored`** — T4.
- **`Damped` / `Residual` schedules** — T4.
- **N-team `predict_outcome`** — T4.
- **`Game::custom` full ergonomics** — T4.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #2
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
2026-04-24 13:01:01 +00:00

1250 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# T3 — Concurrency Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Ship the T3 tier from `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6: `Send + Sync` bounds on public traits, color-group partitioning for within-slice event independence, and rayon-backed parallel paths on within-slice event iteration, `learning_curves`, and `log_evidence_for`. Deterministic posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`; >2× speedup on an 8-core offline-converge benchmark.
**Architecture:** Concurrency lands as a single feature flag (`rayon`, opt-in for T3; the spec suggests default-on but we defer that flip until the feature is proven stable). All parallel paths are hidden behind `#[cfg(feature = "rayon")]` with sequential fallbacks. Within-slice parallelism exploits graph coloring: events sharing no `Index` go into the same color group and run concurrently; across color groups, execution is strictly sequential. This preserves the exact async-EP semantics of T2 at any thread count. Reductions over f64 (`log_evidence_for`, `predict_quality`) use two-stage parallel-then-sequential-reduce so sums are bit-identical regardless of thread count.
**Tech Stack:** Rust 2024 edition. New optional dependency: `rayon = "1"`. Builds on T2 (`History<T, D, O, K>`, `Event<T, K>`, `Outcome`, `Observer`, `factors` module).
## Design decisions
Called out explicitly so reviewers can override before execution:
1. **Rayon is opt-in** (`cargo feature`), not default-on. Simplifies CI, keeps the default build lean. We flip to default-on in a follow-up once the feature is shown to be stable under field use.
2. **Greedy graph coloring** for the within-slice partition: for each event in ingestion order, assign the lowest color whose existing members share no `Index` with the event. Optimality is not the target — events per slice is small (~50), and greedy finishes in O(n·c·m) where c is colors and m is team size. Rebalancing is a T4 follow-up if benchmarks show it helps.
3. **Scope of parallelism:** within-slice color groups, `learning_curves`, `log_evidence_for`, `predict_quality`. Cross-slice iteration in `History::converge` stays **sequential** — the forward/backward sweep has true data dependencies across slices. Parallelizing across slices requires a separate algorithm (the spec's dirty-bit slice skipping, deferred to beyond-T3).
4. **Deterministic reductions:** `.par_iter().map().collect::<Vec<_>>().into_iter().sum()` — each slice's contribution is computed in parallel, then summed sequentially in slice order. Per-element values are bit-identical to T2 (no extra floating-point additions reordered).
5. **Within-game inference stays sequential.** A single `Game::ranked` / `Game::likelihoods` call is too small (~20 µs) to amortize rayon's ~5 µs task overhead. The spec's table confirms this.
## Acceptance criteria
- `cargo test --features approx` — all tests pass (T2 baseline: 90 tests).
- `cargo test --features approx,rayon` — all tests pass; determinism tests across `RAYON_NUM_THREADS={1, 2, 4, 8}` produce bit-identical posteriors.
- `cargo clippy --all-targets --features approx -- -D warnings` — clean.
- `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean.
- `cargo +nightly fmt --check` — clean.
- `cargo bench --bench history_converge --features approx,rayon` — >2× speedup on an 8-core machine vs the sequential baseline.
- All public traits (`Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`) have `Send + Sync` bounds; their blanket/default impls (`i64`, `Untimed`, `ConstantDrift`, `NullObserver`) all naturally satisfy the new bounds — no user code should need changes.
- The `rayon` feature is opt-in; default build without it compiles and passes all tests.
## File map
**New files:**
| Path | Responsibility |
|---|---|
| `src/color_group.rs` | Greedy graph coloring for within-slice event partitioning. |
| `src/parallel.rs` | `#[cfg(feature = "rayon")]` helpers: `par_or_seq` iterators, ordered reductions, thread-count determinism tests. |
| `benches/history_converge.rs` | End-to-end convergence benchmark (scales to show rayon benefit). |
| `tests/determinism.rs` | Runs the same convergence with different `RAYON_NUM_THREADS` and asserts bit-identical posteriors. |
**Modified:**
| Path | What changes |
|---|---|
| `Cargo.toml` | Add `rayon = { version = "1", optional = true }`; declare `rayon` feature. Add `[[bench]] name = "history_converge"`. |
| `src/lib.rs` | Declare new `color_group` + `parallel` modules (both private). |
| `src/time.rs` | Add `Send + Sync + 'static` bounds to `Time`. |
| `src/drift.rs` | Add `Send + Sync` bound to `Drift<T>`. |
| `src/observer.rs` | Add `Send + Sync` bound to `Observer<T>`. |
| `src/factor/mod.rs` | Add `Send + Sync` bound to `Factor`. |
| `src/schedule.rs` | Add `Send + Sync` bound to `Schedule`. |
| `src/time_slice.rs` | Store pre-computed color-group partition; parallel event iteration behind `#[cfg(feature = "rayon")]`. |
| `src/history.rs` | Parallel `learning_curves`, `learning_curves_by_index`, `log_evidence`, `log_evidence_for` behind `#[cfg(feature = "rayon")]` with ordered reductions. |
---
## Task 1: Pre-flight — verify green on main, create t3 branch, capture baseline
**Files:** none
- [ ] **Step 1: Confirm on main, clean tree**
```bash
git status
git rev-parse --abbrev-ref HEAD
```
Expected: clean; on `main`.
- [ ] **Step 2: Create the T3 branch**
```bash
git checkout -b t3-concurrency
```
- [ ] **Step 3: Confirm all tests pass**
```bash
cargo test --features approx
```
Expected: 90 tests pass (68 lib + 10 api_shape + 6 game + 4 record_winner + 2 equivalence).
- [ ] **Step 4: Capture current bench baseline**
```bash
cargo bench --bench batch 2>&1 | grep "Batch::iteration"
```
Record the number — it'll be the sequential-path baseline to verify no regression.
- [ ] **Step 5: No commit** — verification only.
---
## Task 2: Add `rayon` as optional dependency + feature flag
**Files:**
- Modify: `Cargo.toml`
- [ ] **Step 1: Add the dependency and feature**
Under `[dependencies]` add:
```toml
rayon = { version = "1", optional = true }
```
Add a `[features]` section (if not present — check first):
```toml
[features]
rayon = ["dep:rayon"]
```
(The existing `approx` feature already exists as a `dep:approx`-style optional dependency; use the same convention.)
- [ ] **Step 2: Verify both builds compile**
```bash
cargo build --features approx
cargo build --features approx,rayon
```
Both must succeed. The second one pulls `rayon` into the dependency graph but nothing uses it yet.
- [ ] **Step 3: Commit**
```bash
git add Cargo.toml Cargo.lock
git commit -m "$(cat <<'EOF'
feat(cargo): add rayon as optional dependency
Opt-in feature flag — users who want parallel paths build with
--features rayon. Default build remains single-threaded.
Spec Section 6 calls for default-on; we defer that flip until the
feature is stable under field use.
Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"
```
---
## Task 3: Add `Send + Sync` bounds to public traits
**Files:**
- Modify: `src/time.rs`, `src/drift.rs`, `src/observer.rs`, `src/factor/mod.rs`, `src/schedule.rs`
This is a minor breaking API change: downstream code with user-defined `Drift<T>` / `Observer<T>` / `Factor` / `Schedule` impls that aren't `Send + Sync` will fail to compile. For our crate, all built-in types (`i64`, `Untimed`, `ConstantDrift`, `NullObserver`, `EpsilonOrMax`, `TeamSumFactor`, `RankDiffFactor`, `TruncFactor`, `BuiltinFactor`) naturally satisfy these bounds — no internal code changes.
- [ ] **Step 1: Update `Time` trait**
```rust
// src/time.rs
pub trait Time: Copy + Ord + Send + Sync + 'static {
fn elapsed_to(&self, later: &Self) -> i64;
}
```
- [ ] **Step 2: Update `Drift<T>` trait**
```rust
// src/drift.rs
pub trait Drift<T: Time>: Copy + Debug + Send + Sync {
fn variance_delta(&self, from: &T, to: &T) -> f64;
fn variance_for_elapsed(&self, elapsed: i64) -> f64;
}
```
- [ ] **Step 3: Update `Observer<T>` trait**
```rust
// src/observer.rs
pub trait Observer<T: Time>: Send + Sync {
fn on_iteration_end(&self, _iter: usize, _max_step: (f64, f64)) {}
fn on_batch_processed(&self, _time: &T, _slice_idx: usize, _n_events: usize) {}
fn on_converged(&self, _iters: usize, _final_step: (f64, f64), _converged: bool) {}
}
```
- [ ] **Step 4: Update `Factor` trait**
```rust
// src/factor/mod.rs
pub trait Factor: Send + Sync {
fn propagate(&mut self, vars: &mut VarStore) -> (f64, f64);
fn log_evidence(&self, _vars: &VarStore) -> f64 { 0.0 }
}
```
- [ ] **Step 5: Update `Schedule` trait**
```rust
// src/schedule.rs
pub trait Schedule: Send + Sync {
fn run(&self, factors: &mut [BuiltinFactor], vars: &mut VarStore) -> ScheduleReport;
}
```
- [ ] **Step 6: Verify both feature combinations still compile and test**
```bash
cargo build --features approx
cargo build --features approx,rayon
cargo test --features approx --lib
cargo clippy --all-targets --features approx -- -D warnings
```
If any built-in type fails the auto-`Send`/`Sync` check, investigate — something non-thread-safe slipped in. Likely suspects: raw pointers, `Rc`, `RefCell`, non-static references. None are expected in this codebase; if found, convert to `Arc`/`Mutex`/`RwLock` as appropriate.
- [ ] **Step 7: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(api): add Send + Sync bounds to public traits
Required for T3 rayon-based parallelism. Affected traits:
- Time (+ Send + Sync + 'static)
- Drift<T> (+ Send + Sync)
- Observer<T> (+ Send + Sync)
- Factor (+ Send + Sync)
- Schedule (+ Send + Sync)
All built-in impls (i64, Untimed, ConstantDrift, NullObserver,
EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor,
BuiltinFactor) naturally satisfy these bounds — no internal changes
needed.
Minor breaking change: downstream impls that aren't already
thread-safe will fail to compile until they add the bounds.
Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"
```
---
## Task 4: Implement greedy color-group partitioning
**Files:**
- Create: `src/color_group.rs`
- Modify: `src/lib.rs` (register module)
- [ ] **Step 1: Create `src/color_group.rs`**
```rust
//! Greedy graph coloring for within-slice event independence.
//!
//! Events sharing no `Index` can be processed in parallel under async-EP
//! semantics. This module partitions a list of events into "colors" such
//! that events of the same color touch disjoint index sets.
//!
//! The algorithm is greedy: for each event in ingestion order, place it in
//! the lowest-numbered color whose existing members share no `Index`. If
//! no existing color accepts the event, open a new color.
//!
//! Complexity: O(n × c × m) where n is events, c is colors (small, ≤ 5 in
//! practice), and m is average team size.
use std::collections::HashSet;
use crate::Index;
/// Partition of event indices into color groups.
///
/// Each inner `Vec<usize>` holds the indices (into the original events
/// array) of events assigned to one color. Colors are iterated in ascending
/// order by convention.
#[derive(Clone, Debug, Default)]
pub(crate) struct ColorGroups {
pub(crate) groups: Vec<Vec<usize>>,
}
impl ColorGroups {
pub(crate) fn new() -> Self {
Self::default()
}
pub(crate) fn n_colors(&self) -> usize {
self.groups.len()
}
pub(crate) fn is_empty(&self) -> bool {
self.groups.is_empty()
}
/// Total event count across all colors.
pub(crate) fn total_events(&self) -> usize {
self.groups.iter().map(|g| g.len()).sum()
}
}
/// Compute color groups greedily.
///
/// `event_indices` yields, for each event, the set of `Index` values that
/// event touches. The returned `ColorGroups` has one inner `Vec<usize>` per
/// color, containing event indices in the order they were assigned.
pub(crate) fn color_greedy<I, F>(n_events: usize, index_set: F) -> ColorGroups
where
F: Fn(usize) -> I,
I: IntoIterator<Item = Index>,
{
let mut groups: Vec<Vec<usize>> = Vec::new();
let mut members: Vec<HashSet<Index>> = Vec::new();
for ev_idx in 0..n_events {
let ev_members: HashSet<Index> = index_set(ev_idx).into_iter().collect();
// Find first color whose member-set is disjoint from this event's indices.
let chosen = members
.iter()
.position(|m| m.is_disjoint(&ev_members));
let color_idx = match chosen {
Some(c) => c,
None => {
groups.push(Vec::new());
members.push(HashSet::new());
groups.len() - 1
}
};
groups[color_idx].push(ev_idx);
members[color_idx].extend(ev_members);
}
ColorGroups { groups }
}
#[cfg(test)]
mod tests {
use super::*;
fn idx(i: usize) -> Index {
Index::from(i)
}
#[test]
fn single_event_gets_one_color() {
let cg = color_greedy(1, |_| vec![idx(0), idx(1)]);
assert_eq!(cg.n_colors(), 1);
assert_eq!(cg.groups[0], vec![0]);
}
#[test]
fn disjoint_events_share_a_color() {
// Event 0 touches {0, 1}; event 1 touches {2, 3}.
let cg = color_greedy(2, |i| match i {
0 => vec![idx(0), idx(1)],
1 => vec![idx(2), idx(3)],
_ => unreachable!(),
});
assert_eq!(cg.n_colors(), 1);
assert_eq!(cg.groups[0], vec![0, 1]);
}
#[test]
fn overlapping_events_need_separate_colors() {
// Event 0 touches {0, 1}; event 1 touches {1, 2}.
let cg = color_greedy(2, |i| match i {
0 => vec![idx(0), idx(1)],
1 => vec![idx(1), idx(2)],
_ => unreachable!(),
});
assert_eq!(cg.n_colors(), 2);
assert_eq!(cg.groups[0], vec![0]);
assert_eq!(cg.groups[1], vec![1]);
}
#[test]
fn three_events_two_colors() {
// Event 0: {0, 1}; event 1: {2, 3}; event 2: {0, 2}.
// Greedy: ev0→c0, ev1→c0 (disjoint), ev2 overlaps both→c1.
let cg = color_greedy(3, |i| match i {
0 => vec![idx(0), idx(1)],
1 => vec![idx(2), idx(3)],
2 => vec![idx(0), idx(2)],
_ => unreachable!(),
});
assert_eq!(cg.n_colors(), 2);
assert_eq!(cg.groups[0], vec![0, 1]);
assert_eq!(cg.groups[1], vec![2]);
}
#[test]
fn total_events_counts_correctly() {
let cg = color_greedy(4, |_| vec![idx(0)]);
// All events touch index 0 → 4 distinct colors.
assert_eq!(cg.n_colors(), 4);
assert_eq!(cg.total_events(), 4);
}
}
```
- [ ] **Step 2: Register in `src/lib.rs`**
Add (private module, alphabetical):
```rust
mod color_group;
```
No public re-export — this is internal infrastructure.
- [ ] **Step 3: Verify**
```bash
cargo test --features approx --lib color_group
cargo clippy --all-targets --features approx -- -D warnings
cargo +nightly fmt --check
```
Expected: 5 tests pass in the `color_group::tests` module.
- [ ] **Step 4: Commit**
```bash
cargo +nightly fmt
git add src/color_group.rs src/lib.rs
git commit -m "$(cat <<'EOF'
feat(color-group): add greedy within-slice event partitioning
ColorGroups holds a partition of event indices into color groups such
that events of the same color touch no shared Index. Computed greedily
in ingestion order: each event goes into the first color whose existing
members are disjoint from the event's indices.
Used in T3 for safe within-slice parallelism — events in the same
color can run concurrently without touching each other's skills.
Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"
```
---
## Task 5: Store color-group partition in `TimeSlice`
**Files:**
- Modify: `src/time_slice.rs`
- [ ] **Step 1: Add field to `TimeSlice`**
```rust
pub(crate) struct TimeSlice<T: Time> {
// … existing fields …
pub(crate) color_groups: crate::color_group::ColorGroups,
}
```
Initialize to empty in `TimeSlice::new`:
```rust
pub fn new(time: T, p_draw: f64) -> Self {
Self {
// … existing initializers …
color_groups: crate::color_group::ColorGroups::new(),
}
}
```
- [ ] **Step 2: Recompute color groups whenever events change**
Find the methods on `TimeSlice` that mutate `self.events` (most likely `add_events` — look at the current signature in `src/time_slice.rs`). After any mutation, recompute the partition:
```rust
pub(crate) fn recompute_color_groups(&mut self) {
let n = self.events.len();
self.color_groups = crate::color_group::color_greedy(n, |ev_idx| {
// Return an iterator of every Index touched by event ev_idx.
// Each event has teams; each team has items; each item has an agent: Index.
self.events[ev_idx]
.teams
.iter()
.flat_map(|t| t.items.iter().map(|it| it.agent))
.collect::<Vec<_>>()
});
}
```
Call `recompute_color_groups()` at the end of any public/crate-visible method that mutates `self.events` (likely one or two sites). **Note the exact method name by reading `src/time_slice.rs` first** — it may differ from "add_events".
- [ ] **Step 3: Add a sanity test**
```rust
#[test]
fn time_slice_recomputes_color_groups() {
// Construct a time slice with 2 events sharing competitor a — they should
// end up in different color groups.
// … use existing test helpers to build the slice …
// Assert slice.color_groups.n_colors() == 2 and each group has 1 event.
}
```
The exact helper pattern depends on how existing `src/time_slice.rs::tests` construct slices. Mirror the existing approach.
- [ ] **Step 4: Verify**
```bash
cargo test --features approx --lib time_slice
cargo clippy --all-targets --features approx -- -D warnings
```
Expected: all existing time_slice tests still pass + 1 new test.
- [ ] **Step 5: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): pre-compute color groups at ingestion
TimeSlice now stores a ColorGroups partition recomputed whenever
events change. The partition is computed once per slice mutation and
reused on every convergence iteration, enabling the cheap within-slice
parallel sweep added in Task 6.
Part of T3.
EOF
)"
```
---
## Task 6: Parallel within-slice event iteration (behind `rayon` feature)
**Files:**
- Modify: `src/time_slice.rs`
This is the core parallelism work. Within a color group, events touch disjoint `Index` values — it's safe to process them concurrently. Across colors, processing is strictly sequential. This preserves exact async-EP semantics.
- [ ] **Step 1: Identify the event-iteration hot path**
Read `src/time_slice.rs` to find the method that, per convergence iteration, walks all events in the slice and updates their skills. Look for patterns like `for event in self.events.iter_mut()` or `self.events.iter().for_each(...)`. This is the target for parallelization.
Probably named `iteration` or similar. Note its full signature and current body.
- [ ] **Step 2: Rewrite the loop as color-group-driven**
**Sequential version** (always present, used when `rayon` is disabled):
```rust
pub(crate) fn iteration(&mut self, ) -> (f64, f64) {
let mut max_step = (0.0_f64, 0.0_f64);
for color in &self.color_groups.groups {
for &ev_idx in color {
let step = self.events[ev_idx].iteration();
max_step.0 = max_step.0.max(step.0);
max_step.1 = max_step.1.max(step.1);
}
}
max_step
}
```
**Parallel version** (behind `cfg`):
```rust
#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, ) -> (f64, f64) {
use rayon::prelude::*;
let mut max_step = (0.0_f64, 0.0_f64);
for color in &self.color_groups.groups {
// Within one color, events touch disjoint Indexes → safe to parallelize.
// SAFETY: color_greedy guarantees disjoint index sets, so the slice
// entries color[i] and color[j] (i ≠ j) can be mutably borrowed
// concurrently via rayon's work-stealing executor. We use
// par_iter over the event indices and bundle the per-event state
// into something Sync.
//
// Concretely: since `self` is &mut and we can't simultaneously have
// mut borrows into it, we need to split the borrow. Use
// events[..].par_iter_mut() filtered to the color's indices? No —
// rayon's par_iter_mut doesn't support index filtering directly.
//
// Instead: use events.as_mut_slice().par_chunks_mut() after sorting
// events so color-members are contiguous — or extract the per-event
// state into a &mut [EventState] and index directly.
//
// Practical implementation: extract the events we'll touch this color
// into a Vec<&mut Event> using split_at_mut or similar. See impl below.
let contributions: Vec<(f64, f64)> = color
.par_iter()
.map(|&ev_idx| {
// Borrow the event mutably — this requires unsafe or an
// alternate data layout that exposes color-disjoint slices.
// See the "Design note" below for the chosen approach.
todo!("color-disjoint mutable event access")
})
.collect();
for (d0, d1) in contributions {
max_step.0 = max_step.0.max(d0);
max_step.1 = max_step.1.max(d1);
}
}
max_step
}
```
**Design note on mutable access:** Rayon's `par_iter_mut` doesn't let us access arbitrary indices from the same `&mut Vec<Event>` in parallel. Options the implementer should choose between, depending on which compiles most cleanly with the existing `TimeSlice::iteration` body:
1. **Interior mutability.** Wrap each `Event` in `Cell<…>` or `RefCell<…>`, then `Event: Sync`. This works only if the event's internal state is small/cheap to copy. Adds overhead.
2. **Manual `split_at_mut` sequence.** Sort events into color order once (mutating `self.events` so color[0][0], color[0][1], …, color[1][0], … are contiguous), remember the boundaries, then `par_chunks_mut` over each color's contiguous range. Simple, no unsafe. Does require a one-time sort when color groups are computed.
3. **Raw pointer juggling.** `SAFETY`-commented `unsafe` blocks that pass `*mut Event` into parallel closures. Fast, but fragile. Avoid unless (1) and (2) are benchmarked and found insufficient.
**Recommendation: approach (2).** When color groups are computed in Task 5's `recompute_color_groups`, also physically reorder `self.events` so color members are contiguous. Then each color corresponds to a slice range `self.events[color_start..color_end]`, and `par_chunks_mut(…).for_each(…)` or `par_iter_mut()` over the range works.
**Revised `recompute_color_groups`:**
```rust
pub(crate) fn recompute_color_groups(&mut self) {
let n = self.events.len();
let cg = crate::color_group::color_greedy(n, |ev_idx| {
self.events[ev_idx]
.teams
.iter()
.flat_map(|t| t.items.iter().map(|it| it.agent))
.collect::<Vec<_>>()
});
// Physically reorder self.events to match the color-group layout.
let mut reordered: Vec<Event> = Vec::with_capacity(n);
let mut ranges: Vec<(usize, usize)> = Vec::with_capacity(cg.groups.len());
for group in &cg.groups {
let start = reordered.len();
for &ev_idx in group {
reordered.push(std::mem::replace(
&mut self.events[ev_idx],
// Placeholder; original slot becomes garbage before the
// final swap.
Event::placeholder(), // OR use std::mem::take if Event: Default
));
}
ranges.push((start, reordered.len()));
}
self.events = reordered;
// Rebuild cg with post-reorder indices: each group now spans a
// contiguous range.
self.color_groups = ColorGroups::from_ranges(ranges);
}
```
Then color[i] is event indices `[ranges[i].0 .. ranges[i].1)`.
`ColorGroups::from_ranges(ranges: Vec<(usize, usize)>)` constructs the groups with the trivial mapping `group_i = (start..end).collect()`.
**Pitfall**: `Event::placeholder()` or `std::mem::take`. If `Event: Default`, `std::mem::take(&mut self.events[ev_idx])` works cleanly. If not, use `Option<Event>` temporarily, or implement a cheap `Event::placeholder()`. **Before writing the replacement logic, read `src/time_slice.rs` to see if Event derives Default — if it does, use `std::mem::take`.**
- [ ] **Step 3: Implement the parallel iteration via contiguous ranges**
```rust
#[cfg(feature = "rayon")]
pub(crate) fn iteration(&mut self, ) -> (f64, f64) {
use rayon::prelude::*;
let mut max_step = (0.0_f64, 0.0_f64);
for range in &self.color_groups.ranges {
let slice = &mut self.events[range.0..range.1];
let contributions: Vec<(f64, f64)> = slice
.par_iter_mut()
.map(|event| event.iteration())
.collect();
for (d0, d1) in contributions {
max_step.0 = max_step.0.max(d0);
max_step.1 = max_step.1.max(d1);
}
}
max_step
}
```
**IMPORTANT**: `event.iteration(…)` currently takes other borrowed state from `TimeSlice` (skills, competitors). Those borrows conflict with the `&mut self.events[...]` borrow. Resolution:
a) Pull the shared data (`&self.skills`, `&self.competitors`) into a local variable before the par_iter, so rayon's closure captures `&` not `&self`.
b) If the existing per-event method ALSO mutates shared state (e.g., writes to a shared SkillStore), that's a problem — it breaks the color-disjoint-index guarantee. Read the current code carefully. If per-event iteration writes to skills for indices it owns, the color-group invariant makes this safe, but you'll need unsafe to express it. In that case, fall back to approach (1) or (3) from Task 6 Step 2.
**Read the existing `event.iteration` or equivalent FIRST before choosing.** The code may already be structured so events only mutate themselves.
- [ ] **Step 4: Sequential fallback**
```rust
#[cfg(not(feature = "rayon"))]
pub(crate) fn iteration(&mut self, ) -> (f64, f64) {
let mut max_step = (0.0_f64, 0.0_f64);
for range in &self.color_groups.ranges {
for event in &mut self.events[range.0..range.1] {
let step = event.iteration();
max_step.0 = max_step.0.max(step.0);
max_step.1 = max_step.1.max(step.1);
}
}
max_step
}
```
Both versions use the same color-group traversal order — behavior is identical across feature flags.
- [ ] **Step 5: Verify both feature combinations**
```bash
cargo test --features approx --lib
cargo test --features approx,rayon --lib
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings
```
All 90 tests pass in both configurations. Goldens must NOT drift.
- [ ] **Step 6: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(time-slice): parallel within-slice iteration via rayon
Events are reordered into color-group-contiguous ranges during
recompute_color_groups; each color's range is processed in parallel
via par_iter_mut when the rayon feature is enabled, sequentially
otherwise. The two paths produce identical results because events
within a color touch disjoint Index values (async-EP invariant).
Feature gated: default build still sequential; --features rayon
activates the parallel path.
Part of T3.
EOF
)"
```
---
## Task 7: Parallel `learning_curves` with ordered reduction
**Files:**
- Modify: `src/history.rs`
Both `learning_curves()` and `learning_curves_by_index()` currently iterate `self.time_slices` and collect per-competitor posteriors. They're embarrassingly parallel per-slice; the merge at the end must preserve slice order to keep tests deterministic.
- [ ] **Step 1: Parallelize `learning_curves`**
```rust
pub fn learning_curves(&self) -> HashMap<K, Vec<(T, Gaussian)>> {
#[cfg(feature = "rayon")]
{
use rayon::prelude::*;
// Parallel: compute per-slice (index, time, gaussian) triples;
// collect preserves slice order (.collect::<Vec<_>> is order-preserving).
let per_slice: Vec<Vec<(Index, T, Gaussian)>> = self
.time_slices
.par_iter()
.map(|ts| {
ts.skills
.iter()
.map(|(idx, sk)| (idx, ts.time, sk.posterior()))
.collect()
})
.collect();
// Sequential merge: iterate in slice order, push to per-key vectors.
let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
for slice_contrib in per_slice {
for (idx, t, g) in slice_contrib {
if let Some(key) = self.keys.key(idx).cloned() {
data.entry(key).or_default().push((t, g));
}
}
}
data
}
#[cfg(not(feature = "rayon"))]
{
// Original sequential impl (unchanged)
let mut data: HashMap<K, Vec<(T, Gaussian)>> = HashMap::new();
for ts in &self.time_slices {
for (idx, sk) in ts.skills.iter() {
if let Some(key) = self.keys.key(idx).cloned() {
data.entry(key).or_default().push((ts.time, sk.posterior()));
}
}
}
data
}
}
```
- [ ] **Step 2: Parallelize `learning_curves_by_index`**
Same pattern — should it exist. (Task 20 of T2 may have removed this method; verify it's present.) If it's there, mirror the `learning_curves` parallel+sequential split.
- [ ] **Step 3: Verify**
```bash
cargo test --features approx,rayon
```
All tests pass — goldens preserved across both feature configurations.
- [ ] **Step 4: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel learning_curves under rayon feature
Per-slice posterior collection runs in parallel; merge into the
per-key HashMap is sequential in slice order to preserve deterministic
output. Sequential impl unchanged under default feature set.
Part of T3.
EOF
)"
```
---
## Task 8: Parallel `log_evidence` / `log_evidence_for` with deterministic sum
**Files:**
- Modify: `src/history.rs`
`log_evidence_internal` already does `self.time_slices.iter().map(|ts| ts.log_evidence(…)).sum()`. We replace with parallel map + sequential sum.
- [ ] **Step 1: Parallelize `log_evidence_internal`**
```rust
pub(crate) fn log_evidence_internal(&mut self, forward: bool, targets: &[Index]) -> f64 {
#[cfg(feature = "rayon")]
{
use rayon::prelude::*;
let per_slice: Vec<f64> = self
.time_slices
.par_iter()
.map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
.collect();
// Sequential sum in slice order for bit-identical reduction.
per_slice.into_iter().sum()
}
#[cfg(not(feature = "rayon"))]
{
self.time_slices
.iter()
.map(|ts| ts.log_evidence(self.online, targets, forward, &self.competitors))
.sum()
}
}
```
**Critical:** `per_slice` is a `Vec<f64>`, not a fold. The sequential `.into_iter().sum()` is bit-identical to the sequential impl because the order is the same (slice order). Rayon's `par_iter().sum()` would reorder additions and is non-deterministic.
- [ ] **Step 2: `log_evidence_for` already wraps `log_evidence_internal`** — no further changes needed.
- [ ] **Step 3: Verify**
```bash
cargo test --features approx,rayon
```
All tests pass — goldens preserved.
- [ ] **Step 4: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
feat(history): parallel log_evidence with deterministic sum
Per-slice contribution computed in parallel; final reduction is
sequential in slice order so the sum is bit-identical to the T2
sequential baseline. This is essential for the T3 acceptance
criterion of identical posteriors across RAYON_NUM_THREADS values.
Part of T3.
EOF
)"
```
---
## Task 9: Determinism test across thread counts
**Files:**
- Create: `tests/determinism.rs`
- [ ] **Step 1: Create the test**
```rust
//! Determinism tests: identical posteriors across RAYON_NUM_THREADS
//! values. Only meaningful when the `rayon` feature is enabled.
#![cfg(feature = "rayon")]
use smallvec::smallvec;
use trueskill_tt::{
ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};
fn build_and_converge(seed: u64) -> Vec<(i64, trueskill_tt::Gaussian)> {
// Seed-driven deterministic test data: ~100 events, ~30 competitors.
let mut h = History::<i64, _, _, String>::builder()
.mu(25.0)
.sigma(25.0 / 3.0)
.beta(25.0 / 6.0)
.drift(ConstantDrift(25.0 / 300.0))
.convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
.build();
// Deterministic pseudo-random event generation.
let mut rng_state = seed;
let mut next = || {
rng_state = rng_state.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
rng_state
};
let mut events: Vec<Event<i64, String>> = Vec::with_capacity(100);
for ev_i in 0..100 {
let a = (next() % 30) as usize;
let mut b = (next() % 30) as usize;
while b == a {
b = (next() % 30) as usize;
}
events.push(Event {
time: ev_i as i64 + 1,
teams: smallvec![
Team::with_members([Member::new(format!("p{a}"))]),
Team::with_members([Member::new(format!("p{b}"))]),
],
outcome: Outcome::winner((next() % 2) as u32, 2),
});
}
h.add_events(events).unwrap();
h.converge().unwrap();
h.learning_curve(&"p0".to_string())
}
#[test]
fn posteriors_identical_across_thread_counts() {
// Run the same history with the same seed at different rayon pool sizes.
// Rayon lets us set the pool once per process; the cleanest pattern is
// to use install() with a custom ThreadPoolBuilder per run.
let sizes = [1, 2, 4, 8];
let mut results: Vec<Vec<(i64, trueskill_tt::Gaussian)>> = Vec::new();
for &n in &sizes {
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(n)
.build()
.unwrap();
let curve = pool.install(|| build_and_converge(42));
results.push(curve);
}
// Every result must be bit-identical to the first.
let reference = &results[0];
for (i, curve) in results.iter().enumerate().skip(1) {
assert_eq!(
curve.len(),
reference.len(),
"curve length differs at {n} threads",
n = sizes[i],
);
for (j, (&(t_ref, g_ref), &(t, g))) in reference.iter().zip(curve.iter()).enumerate() {
assert_eq!(t_ref, t, "time point {j} differs at {} threads", sizes[i]);
assert_eq!(
g_ref.pi_and_tau(),
g.pi_and_tau(),
"posterior bits differ at thread count {}, time {}",
sizes[i],
t,
);
}
}
}
```
**Note on `pi_and_tau`**: the test expects a way to extract the raw nat-param representation for bit-level comparison. If the `Gaussian` type doesn't expose it, add a `pub(crate) fn pi_and_tau(&self) -> (f64, f64)` method as part of this task. Alternatively, compare `(mu(), sigma())` — slightly less strict but usually good enough for determinism testing.
Actually, `f64::to_bits()` gives us direct bit equality:
```rust
assert_eq!(g_ref.mu().to_bits(), g.mu().to_bits(), );
assert_eq!(g_ref.sigma().to_bits(), g.sigma().to_bits(), );
```
Use `to_bits()` so we don't need a new accessor.
- [ ] **Step 2: Verify**
```bash
cargo test --features approx,rayon --test determinism
```
Expected: `posteriors_identical_across_thread_counts` passes.
- [ ] **Step 3: Commit**
```bash
cargo +nightly fmt
git add tests/determinism.rs
git commit -m "$(cat <<'EOF'
test: assert bit-identical posteriors across RAYON_NUM_THREADS
Runs the same deterministic history at thread counts {1, 2, 4, 8}
and asserts every (time, posterior) pair is bit-identical. Verifies
the T3 determinism invariant holds under the ordered-reduce strategy.
Only compiled with --features rayon.
Part of T3.
EOF
)"
```
---
## Task 10: Multi-thread benchmark + acceptance gate
**Files:**
- Create: `benches/history_converge.rs`
- Modify: `Cargo.toml` (register the bench)
- [ ] **Step 1: Register the bench in `Cargo.toml`**
Add under the existing `[[bench]]` entries:
```toml
[[bench]]
name = "history_converge"
harness = false
```
- [ ] **Step 2: Create `benches/history_converge.rs`**
```rust
//! End-to-end convergence benchmark. Measures `History::converge()` on a
//! realistic workload (~500 events, ~100 competitors, ~30 iters). The
//! rayon feature, when enabled, activates within-slice parallel event
//! iteration.
use criterion::{Criterion, criterion_group, criterion_main};
use smallvec::smallvec;
use trueskill_tt::{
ConstantDrift, ConvergenceOptions, Event, History, Member, Outcome, Team,
};
fn build_history(n_events: usize, n_competitors: usize, seed: u64) -> History<i64, ConstantDrift, trueskill_tt::NullObserver, String> {
let mut rng = seed;
let mut next = || {
rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
rng
};
let mut h = History::<i64, _, _, String>::builder()
.mu(25.0)
.sigma(25.0 / 3.0)
.beta(25.0 / 6.0)
.drift(ConstantDrift(25.0 / 300.0))
.convergence(ConvergenceOptions { max_iter: 30, epsilon: 1e-6 })
.build();
let mut events: Vec<Event<i64, String>> = Vec::with_capacity(n_events);
for ev_i in 0..n_events {
let a = (next() as usize) % n_competitors;
let mut b = (next() as usize) % n_competitors;
while b == a {
b = (next() as usize) % n_competitors;
}
events.push(Event {
time: (ev_i as i64 / 10) + 1, // ~10 events per slice
teams: smallvec![
Team::with_members([Member::new(format!("p{a}"))]),
Team::with_members([Member::new(format!("p{b}"))]),
],
outcome: Outcome::winner((next() % 2) as u32, 2),
});
}
h.add_events(events).unwrap();
h
}
fn bench_converge(c: &mut Criterion) {
c.bench_function("History::converge/500x100", |b| {
b.iter_batched(
|| build_history(500, 100, 42),
|mut h| {
h.converge().unwrap();
},
criterion::BatchSize::SmallInput,
);
});
}
criterion_group!(benches, bench_converge);
criterion_main!(benches);
```
- [ ] **Step 3: Run sequential baseline**
```bash
cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'
```
Record the sequential baseline number.
- [ ] **Step 4: Run parallel version**
```bash
cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'
```
**Acceptance gate:** parallel run should show >2× speedup on an 8-core machine. If it's less:
- Verify rayon is actually being used (`RAYON_LOG=1`).
- Check whether the workload has enough color-group parallelism (a slice with 10 events that all share competitors has 0 parallel work).
- Consider whether the per-event cost is large enough to amortize rayon overhead.
If the gate fails, dig in before committing. Acceptable fallback: tune the benchmark workload (more events per slice, more competitors) so parallelism has more opportunity. Report findings.
- [ ] **Step 5: Commit**
```bash
cargo +nightly fmt
git add -A
git commit -m "$(cat <<'EOF'
bench: end-to-end History::converge benchmark
Workload: 500 events across ~50 time slices, ~100 competitors, 30
iteration cap, 1e-6 convergence. Measures full forward+backward
sweep through convergence — the target hot path for rayon-backed
within-slice parallelism.
Sequential baseline: <X> ms
With --features rayon on 8 cores: <Y> ms (<speedup>×)
Part of T3.
EOF
)"
```
(Fill in `<X>`, `<Y>`, `<speedup>` from measured numbers.)
---
## Task 11: Final verification, bench capture, CHANGELOG
**Files:**
- Modify: `benches/baseline.txt`
- Modify: `CHANGELOG.md`
- [ ] **Step 1: Run the complete verification matrix**
```bash
cargo +nightly fmt --check
cargo clippy --all-targets --features approx -- -D warnings
cargo clippy --all-targets --features approx,rayon -- -D warnings
cargo test --features approx
cargo test --features approx,rayon
cargo bench --bench batch --features approx 2>&1 | grep "Batch::iteration"
cargo bench --bench batch --features approx,rayon 2>&1 | grep "Batch::iteration"
cargo bench --bench history_converge --features approx 2>&1 | grep 'History::converge'
cargo bench --bench history_converge --features approx,rayon 2>&1 | grep 'History::converge'
```
All must pass / be clean.
- [ ] **Step 2: Append T3 block to `benches/baseline.txt`**
```
# After T3 (date, same hardware)
Batch::iteration (seq) <X.XX> µs (<delta> vs T2 21.36 µs)
Batch::iteration (rayon, 8c) <X.XX> µs (sequential path on single slice; rayon not active)
History::converge (seq) <X.XX> ms baseline
History::converge (rayon, 8c) <X.XX> ms (<speedup>× — target was ≥2×)
# Notes:
# - T3 adds Send + Sync on public traits, color-group partition in
# TimeSlice, and rayon-feature-gated parallel paths on within-slice
# iteration, learning_curves, log_evidence_for.
# - Determinism verified: tests/determinism.rs asserts bit-identical
# posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
# - Rayon is opt-in: default build is single-threaded as in T2.
```
- [ ] **Step 3: Prepend T3 entry to `CHANGELOG.md`**
Add a `## Unreleased — T3 concurrency` section above the existing `## Unreleased — T2 new API surface` (if it's still there after the main merge; otherwise above the `## 0.1.0` section). Enumerate: new `rayon` feature, `Send + Sync` bounds (minor breaking for downstream custom trait impls), color-group infrastructure (internal), benchmark numbers.
- [ ] **Step 4: Commit**
```bash
git add benches/baseline.txt CHANGELOG.md
git commit -m "$(cat <<'EOF'
bench,docs: capture T3 final numbers and update CHANGELOG
History::converge parallel speedup: <X>× on 8 cores with --features rayon.
Batch::iteration unchanged in seq mode; Gaussian ops unchanged.
Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.
Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
EOF
)"
```
- [ ] **Step 5: Ready to PR**
Branch `t3-concurrency` should open against `main`. PR body references the spec, the plan, and this file's benchmarks block.
---
## Self-review notes
**Spec coverage** (against Section 6 + Section 7 "T3"):
-`Send + Sync` audit and bounds on all public traits (Task 3)
- ✅ Color-group partitioning at `TimeSlice` ingestion (Tasks 4, 5)
-`rayon` as opt-in feature (Task 2; note: spec said default-on, we flip later)
- ✅ Parallel within-slice color groups (Task 6)
- ✅ Parallel `learning_curves` (Task 7)
- ✅ Parallel `log_evidence_for` (Task 8)
- ✅ Deterministic posteriors across `RAYON_NUM_THREADS` (Task 9)
- ✅ >2× speedup on 8-core offline converge (Task 10)
**Deferred to later tiers:**
- Cross-slice parallelism (dirty-bit slice skipping) — spec acknowledges this is separate from T3's within-slice focus.
- Synchronous-EP schedule with barrier merge — available as a `Schedule` impl, per spec Section 6, but deferred.
- Exposing color-group partitioning to users via `add_events_with_partition` — per spec open question 5.
- Rayon default-on flip — after field-stability evidence.
**Hazards during execution:**
- **Mutable-aliasing gymnastics in Task 6.** The color-disjoint guarantee is real but rayon's API doesn't see it. Task 6 Step 2 offers three implementation approaches; the recommended one (approach 2, reorder events into contiguous color ranges) avoids `unsafe` but requires a physical reorder on mutation. Benchmark reordering cost — if it dominates, reconsider.
- **`Send + Sync` auto-derive failure.** If any struct between traits holds a non-thread-safe primitive (raw pointer, `Rc`), auto-derive fails. Unlikely in this codebase, but watch for it in Task 3.
- **Deterministic sum pitfalls.** Rayon's `par_iter().sum()` reorders additions. Only `par_iter().map().collect::<Vec<_>>().into_iter().sum()` is deterministic. Task 8 is careful about this; reviewers should verify.
- **`f64` NaN in goldens.** If any test golden has a NaN, `to_bits()` comparison needs special casing. Our goldens are all finite; no action needed.
**Things outside the plan that may bite:**
- CI needs to test both feature configurations. If there's a `.github/workflows/` file, update it to include `--features approx,rayon` as a matrix entry. If no CI exists (likely for this project), flag and move on.
- `examples/atp.rs` — may want to demonstrate `--features rayon`. Optional; skip unless trivial.