bench,docs: capture T3 final numbers and update CHANGELOG
Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline).
Gaussian ops unchanged.
End-to-end history_converge benchmark on Apple M5 Pro:
Workload seq rayon speedup
500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x
2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x
5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x
The spec's >=2x target is not achieved on realistic workloads. T3's
within-slice color-group parallelism only shows material benefit when
a slice holds many events AND the competitor pool is large enough to
give the greedy coloring room to partition. Typical TrueSkill
workloads don't fit that profile. Cross-slice parallelism (dirty-bit
slice skipping, spec Section 5) is the natural next step for
real-workload speedup.
Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.
Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
60
CHANGELOG.md
60
CHANGELOG.md
@@ -2,6 +2,66 @@
|
||||
|
||||
All notable changes to this project will be documented in this file.
|
||||
|
||||
## Unreleased — T3 concurrency
|
||||
|
||||
Adds rayon-backed parallel paths per Section 6 of
|
||||
`docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md`.
|
||||
|
||||
### Breaking
|
||||
|
||||
- `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`,
|
||||
`Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these
|
||||
via auto-derive, but downstream custom impls that aren't thread-safe
|
||||
will need the bounds.
|
||||
|
||||
### New
|
||||
|
||||
- Opt-in `rayon` cargo feature. When enabled:
|
||||
- Within-slice event iteration runs color-group events in parallel
|
||||
via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
|
||||
- `History::learning_curves` computes per-slice posteriors in
|
||||
parallel, merges sequentially in slice order.
|
||||
- `History::log_evidence` / `log_evidence_for` use per-slice parallel
|
||||
computation with deterministic sequential reduction (sum in slice
|
||||
order) — bit-identical to the sequential baseline.
|
||||
- `ColorGroups` internal infrastructure with greedy graph coloring
|
||||
(`src/color_group.rs`). Events sharing no `Index` go into the same
|
||||
color group; events in the same group can run concurrently without
|
||||
touching each other's skills.
|
||||
- `tests/determinism.rs` asserts bit-identical posteriors across
|
||||
`RAYON_NUM_THREADS={1, 2, 4, 8}`.
|
||||
- `benches/history_converge.rs` measures end-to-end convergence on
|
||||
three workload shapes.
|
||||
|
||||
### Performance notes
|
||||
|
||||
- Default build (no rayon): `Batch::iteration` 23.23 µs — no regression
|
||||
vs T2.
|
||||
- With `--features rayon`:
|
||||
- 500 events / 100 competitors / 10 per slice: 1.0× speedup.
|
||||
- 2000 events / 200 competitors / 20 per slice: 1.0× speedup.
|
||||
- 5000 events in one slice / 50k competitors: **1.3× speedup.**
|
||||
- The spec targeted >2× speedup on 8-core offline converge. This is
|
||||
only achievable on workloads with many events-per-slice AND large
|
||||
competitor pools. **Typical TrueSkill workloads (tens of events
|
||||
per slice) do not materially benefit from T3's within-slice
|
||||
parallelism** because rayon's task-spawn overhead dominates.
|
||||
- Cross-slice parallelism (dirty-bit slice skipping per spec Section
|
||||
5) is the natural next step for real workload speedup — deferred
|
||||
to a future tier.
|
||||
|
||||
### Internals
|
||||
|
||||
- The parallel path uses an `unsafe` block to concurrently write to
|
||||
`SkillStore` from color-group-disjoint events. Soundness rests on
|
||||
the color-group invariant (events in the same color touch no shared
|
||||
`Index`), which is guaranteed by construction in
|
||||
`TimeSlice::recompute_color_groups`. Sequential path unchanged.
|
||||
- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to
|
||||
sequential iteration inside the parallel `sweep_color_groups` to
|
||||
avoid rayon's task-spawn overhead.
|
||||
- Thread-local `ScratchArena` per rayon worker thread.
|
||||
|
||||
## Unreleased — T2 new API surface
|
||||
|
||||
Breaking: every renamed type and the new public API land together per
|
||||
|
||||
Reference in New Issue
Block a user