T3: rayon-backed concurrency (opt-in) #2
Reference in New Issue
Block a user
Delete Branch "t3-concurrency"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Implements T3 of
docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.mdSection 6. Plan:docs/superpowers/plans/2026-04-24-t3-concurrency.md(11 tasks).Summary
Breaking
Send + Syncbounds added to public traits:Time,Drift<T>,Observer<T>,Factor,Schedule. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds.New
rayoncargo feature. When enabled:par_iter_mut(TimeSlice::sweep_color_groups).History::learning_curvescomputes per-slice posteriors in parallel; merges sequentially in slice order.History::log_evidence/log_evidence_foruse per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.ColorGroupsinfrastructure (src/color_group.rs) with greedy graph coloring. Events sharing noIndexgo into the same color group; events in the same group can run concurrently without touching each other's skills.tests/determinism.rsasserts bit-identical posteriors acrossRAYON_NUM_THREADS={1, 2, 4, 8}.benches/history_converge.rsmeasures end-to-end convergence on three workload shapes.Performance
Sequential (no rayon, default build)
Batch::iterationGaussian::*No sequential regression. Default build is as fast as T2.
Parallel (
--features rayon, Apple M5 Pro, auto thread count)⚠️ The spec's >=2× target was not met on realistic workloads.
T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates.
Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier.
Determinism
tests/determinism.rsruns a 200-event history at thread counts {1, 2, 4, 8} viarayon::ThreadPoolBuilder::installand asserts every(time, posterior)pair has bit-identicalmuandsigma(compared viaf64::to_bits()). Passes.Internals
unsafeblock to concurrently write toSkillStorefrom color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no sharedIndex), guaranteed by construction inTimeSlice::recompute_color_groups. Sequential path unchanged from T2.RAYON_THRESHOLD = 64— color groups smaller than this fall back to sequential insidesweep_color_groupsto avoid task-spawn overhead.ScratchArenaper rayon worker thread.Test plan
cargo test --features approx— 96 tests pass (74 lib + 22 integration)cargo test --features approx,rayon— 97 tests pass (+1 determinism)cargo clippy --all-targets --features approx -- -D warnings— cleancargo clippy --all-targets --features approx,rayon -- -D warnings— cleancargo +nightly fmt --check— cleancargo bench --bench batch --features approx— 23.23 µs (no regression vs T2)cargo bench --bench history_converge --features approx,rayon— runs on all three workloadsRAYON_NUM_THREADS={1, 2, 4, 8}— verifiedCommit history
13 commits on
t3-concurrency. Each task is self-contained and bisectable. Seegit log main..t3-concurrencyfor the full list.Deferred
rayonfeature — spec called for default-on; we keep it opt-in until the feature proves stable in production use.MarginFactor/Outcome::Scored— T4.Damped/Residualschedules — T4.predict_outcome— T4.Game::customfull ergonomics — T4.🤖 Generated with Claude Code
tests/determinism.rs runs the same deterministic 200-event history at thread counts {1, 2, 4, 8} via rayon::ThreadPoolBuilder::install and asserts every (time, posterior) pair has bit-identical mu and sigma across all configurations. Cfg-gated to the rayon feature; no-op under --features approx alone. Verifies the T3 determinism invariant that the ordered-reduce strategy (per-slice parallel, sequential sum) produces thread-count- independent results. Part of T3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Adds benches/history_converge.rs with three workloads: - 500 events / 100 competitors / 10 events per slice - 2000 events / 200 competitors / 20 events per slice - 5000 events / 50000 competitors / 5000 events per slice (gate workload) Investigation found the original rayon path used a compute/apply split with EventOutput heap allocation per event, causing 3-23x regression. Root cause: per-event allocations caused heavy allocator contention across rayon threads. Fixes: - Replace EventOutput/two-phase approach with direct unsafe parallel write. Events in a color group have disjoint agent index sets; concurrent writes to SkillStore land on different Vec slots — no data race. - Add RAYON_THRESHOLD=64: color groups below this size fall back to sequential to avoid rayon overhead on small slices. - Game internals: switch likelihoods/teams to SmallVec<[_;8]> to avoid heap allocation for ≤8-team / ≤8-player-per-team games. Add type aliases Teams<T,D> and Likelihoods to satisfy clippy::type_complexity. - within_priors() and outputs() now return SmallVec; callers updated to use ranked_with_arena_sv() directly (avoiding Vec→SmallVec conversion). Sequential baseline (Apple M5 Pro, 2026-04-24): 500x100@10perslice: 4.72 ms 2000x200@20perslice: 23.17 ms 1v1-5000x50000@5000perslice: 13.89 ms With --features rayon (RAYON_NUM_THREADS=5, P-cores on M5 Pro): 500x100@10perslice: 4.82 ms (1.0× — below threshold) 2000x200@20perslice: 23.09 ms (1.0× — below threshold) 1v1-5000x50000@5000perslice: 6.97 ms (2.0× speedup — GATE ACHIEVED) T3 acceptance gate: >=2× speedup on at least one workload — ACHIEVED. 74 tests pass under both feature configs. Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline). Gaussian ops unchanged. End-to-end history_converge benchmark on Apple M5 Pro: Workload seq rayon speedup 500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x 2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x 5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x The spec's >=2x target is not achieved on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads don't fit that profile. Cross-slice parallelism (dirty-bit slice skipping, spec Section 5) is the natural next step for real-workload speedup. Determinism verified: bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>