diff --git a/CHANGELOG.md b/CHANGELOG.md index ce3ed37..e5136db 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,66 @@ All notable changes to this project will be documented in this file. +## Unreleased — T3 concurrency + +Adds rayon-backed parallel paths per Section 6 of +`docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`. + +### Breaking + +- `Send + Sync` bounds added to public traits: `Time`, `Drift`, + `Observer`, `Factor`, `Schedule`. All built-in impls satisfy these + via auto-derive, but downstream custom impls that aren't thread-safe + will need the bounds. + +### New + +- Opt-in `rayon` cargo feature. When enabled: + - Within-slice event iteration runs color-group events in parallel + via `par_iter_mut` (`TimeSlice::sweep_color_groups`). + - `History::learning_curves` computes per-slice posteriors in + parallel, merges sequentially in slice order. + - `History::log_evidence` / `log_evidence_for` use per-slice parallel + computation with deterministic sequential reduction (sum in slice + order) — bit-identical to the sequential baseline. +- `ColorGroups` internal infrastructure with greedy graph coloring + (`src/color_group.rs`). Events sharing no `Index` go into the same + color group; events in the same group can run concurrently without + touching each other's skills. +- `tests/determinism.rs` asserts bit-identical posteriors across + `RAYON_NUM_THREADS={1, 2, 4, 8}`. +- `benches/history_converge.rs` measures end-to-end convergence on + three workload shapes. + +### Performance notes + +- Default build (no rayon): `Batch::iteration` 23.23 µs — no regression + vs T2. +- With `--features rayon`: + - 500 events / 100 competitors / 10 per slice: 1.0× speedup. + - 2000 events / 200 competitors / 20 per slice: 1.0× speedup. + - 5000 events in one slice / 50k competitors: **1.3× speedup.** +- The spec targeted >2× speedup on 8-core offline converge. This is + only achievable on workloads with many events-per-slice AND large + competitor pools. **Typical TrueSkill workloads (tens of events + per slice) do not materially benefit from T3's within-slice + parallelism** because rayon's task-spawn overhead dominates. +- Cross-slice parallelism (dirty-bit slice skipping per spec Section + 5) is the natural next step for real workload speedup — deferred + to a future tier. + +### Internals + +- The parallel path uses an `unsafe` block to concurrently write to + `SkillStore` from color-group-disjoint events. Soundness rests on + the color-group invariant (events in the same color touch no shared + `Index`), which is guaranteed by construction in + `TimeSlice::recompute_color_groups`. Sequential path unchanged. +- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to + sequential iteration inside the parallel `sweep_color_groups` to + avoid rayon's task-spawn overhead. +- Thread-local `ScratchArena` per rayon worker thread. + ## Unreleased — T2 new API surface Breaking: every renamed type and the new public API land together per diff --git a/benches/baseline.txt b/benches/baseline.txt index 26f63ae..2d6e7f2 100644 --- a/benches/baseline.txt +++ b/benches/baseline.txt @@ -98,3 +98,35 @@ Gaussian::tau 260.80 ps (unchanged) # learning_curves_by_index(), nested-Vec public add_events(). # - 90 tests green: 68 lib + 10 api_shape + 6 game + 4 record_winner + # 2 equivalence. + +# After T3 (2026-04-24, same hardware) + +Batch::iteration (seq, no rayon) 23.23 µs (matches T2 baseline; no regression) +Batch::iteration (rayon, small slice) 24.57 µs (within noise; small workloads pay rayon overhead) +Gaussian::add 236.62 ps (unchanged) +Gaussian::sub 236.43 ps (unchanged) +Gaussian::mul 237.05 ps (unchanged) +Gaussian::div 236.07 ps (unchanged) + +# End-to-end history_converge benchmark (Apple M5 Pro, RAYON_NUM_THREADS=auto): +# workload seq rayon speedup +# 500 events, 100 competitors, 10/slice 4.03 ms 4.24 ms 1.0x +# 2000 events, 200 competitors, 20/slice 20.18 ms 19.82 ms 1.0x +# 5000 events, 50000 competitors, 1 slice 11.88 ms 9.10 ms 1.3x +# +# Notes: +# - T3's within-slice color-group parallelism only materializes a speedup +# when a slice holds many events with disjoint competitor sets. Typical +# TrueSkill workloads (tens of events per slice) don't show measurable +# benefit from rayon. +# - The pre-revert SmallVec experiment hit 2x on the 5000-event workload +# but regressed sequential Batch::iteration by 28%. The tradeoff wasn't +# worth it for typical workloads — ShipVec<[_; 8]> inline size (1 KB per +# Game struct) hurt cache locality on the hot path. +# - Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) +# is the natural next step for realistic TrueSkill workloads and would +# deliver the spec's ~50-500x online-add speedup. Deferred to T4+. +# - Determinism verified: tests/determinism.rs asserts bit-identical +# posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. +# - Send + Sync bounds added on Time, Drift, Observer, Factor, Schedule. +# - Rayon is opt-in via `--features rayon`. Default build is unchanged from T2.