bench,docs: capture T3 final numbers and update CHANGELOG

Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline). Gaussian ops unchanged. End-to-end history_converge benchmark on Apple M5 Pro: Workload seq rayon speedup 500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x 2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x 5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x The spec's >=2x target is not achieved on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads don't fit that profile. Cross-slice parallelism (dirty-bit slice skipping, spec Section 5) is the natural next step for real-workload speedup. Determinism verified: bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:58:24 +02:00
parent f0d6211387
commit db633bdafe
2 changed files with 92 additions and 0 deletions
@@ -2,6 +2,66 @@

 All notable changes to this project will be documented in this file.

+## Unreleased — T3 concurrency
+
+Adds rayon-backed parallel paths per Section 6 of
+`docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`.
+
+### Breaking
+
+- `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`,
+  `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these
+  via auto-derive, but downstream custom impls that aren't thread-safe
+  will need the bounds.
+
+### New
+
+- Opt-in `rayon` cargo feature. When enabled:
+  - Within-slice event iteration runs color-group events in parallel
+    via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
+  - `History::learning_curves` computes per-slice posteriors in
+    parallel, merges sequentially in slice order.
+  - `History::log_evidence` / `log_evidence_for` use per-slice parallel
+    computation with deterministic sequential reduction (sum in slice
+    order) — bit-identical to the sequential baseline.
+- `ColorGroups` internal infrastructure with greedy graph coloring
+  (`src/color_group.rs`). Events sharing no `Index` go into the same
+  color group; events in the same group can run concurrently without
+  touching each other's skills.
+- `tests/determinism.rs` asserts bit-identical posteriors across
+  `RAYON_NUM_THREADS={1, 2, 4, 8}`.
+- `benches/history_converge.rs` measures end-to-end convergence on
+  three workload shapes.
+
+### Performance notes
+
+- Default build (no rayon): `Batch::iteration` 23.23 µs — no regression
+  vs T2.
+- With `--features rayon`:
+  - 500 events / 100 competitors / 10 per slice: 1.0× speedup.
+  - 2000 events / 200 competitors / 20 per slice: 1.0× speedup.
+  - 5000 events in one slice / 50k competitors: **1.3× speedup.**
+- The spec targeted >2× speedup on 8-core offline converge. This is
+  only achievable on workloads with many events-per-slice AND large
+  competitor pools. **Typical TrueSkill workloads (tens of events
+  per slice) do not materially benefit from T3's within-slice
+  parallelism** because rayon's task-spawn overhead dominates.
+- Cross-slice parallelism (dirty-bit slice skipping per spec Section
+  5) is the natural next step for real workload speedup — deferred
+  to a future tier.
+
+### Internals
+
+- The parallel path uses an `unsafe` block to concurrently write to
+  `SkillStore` from color-group-disjoint events. Soundness rests on
+  the color-group invariant (events in the same color touch no shared
+  `Index`), which is guaranteed by construction in
+  `TimeSlice::recompute_color_groups`. Sequential path unchanged.
+- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to
+  sequential iteration inside the parallel `sweep_color_groups` to
+  avoid rayon's task-spawn overhead.
+- Thread-local `ScratchArena` per rayon worker thread.
+
 ## Unreleased — T2 new API surface

 Breaking: every renamed type and the new public API land together per
@@ -98,3 +98,35 @@ Gaussian::tau             260.80 ps    (unchanged)
 #   learning_curves_by_index(), nested-Vec public add_events().
 # - 90 tests green: 68 lib + 10 api_shape + 6 game + 4 record_winner +
 #   2 equivalence.
+
+# After T3 (2026-04-24, same hardware)
+
+Batch::iteration (seq, no rayon)     23.23 µs   (matches T2 baseline; no regression)
+Batch::iteration (rayon, small slice) 24.57 µs   (within noise; small workloads pay rayon overhead)
+Gaussian::add                         236.62 ps  (unchanged)
+Gaussian::sub                         236.43 ps  (unchanged)
+Gaussian::mul                         237.05 ps  (unchanged)
+Gaussian::div                         236.07 ps  (unchanged)
+
+# End-to-end history_converge benchmark (Apple M5 Pro, RAYON_NUM_THREADS=auto):
+# workload                              seq      rayon    speedup
+# 500 events, 100 competitors, 10/slice 4.03 ms  4.24 ms  1.0x
+# 2000 events, 200 competitors, 20/slice 20.18 ms 19.82 ms 1.0x
+# 5000 events, 50000 competitors, 1 slice 11.88 ms 9.10 ms 1.3x
+#
+# Notes:
+# - T3's within-slice color-group parallelism only materializes a speedup
+#   when a slice holds many events with disjoint competitor sets. Typical
+#   TrueSkill workloads (tens of events per slice) don't show measurable
+#   benefit from rayon.
+# - The pre-revert SmallVec experiment hit 2x on the 5000-event workload
+#   but regressed sequential Batch::iteration by 28%. The tradeoff wasn't
+#   worth it for typical workloads — ShipVec<[_; 8]> inline size (1 KB per
+#   Game struct) hurt cache locality on the hot path.
+# - Cross-slice parallelism (dirty-bit slice skipping per spec Section 5)
+#   is the natural next step for realistic TrueSkill workloads and would
+#   deliver the spec's ~50-500x online-add speedup. Deferred to T4+.
+# - Determinism verified: tests/determinism.rs asserts bit-identical
+#   posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
+# - Send + Sync bounds added on Time, Drift<T>, Observer<T>, Factor, Schedule.
+# - Rayon is opt-in via `--features rayon`. Default build is unchanged from T2.