bench,docs: capture T3 final numbers and update CHANGELOG
Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline).
Gaussian ops unchanged.
End-to-end history_converge benchmark on Apple M5 Pro:
Workload seq rayon speedup
500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x
2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x
5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x
The spec's >=2x target is not achieved on realistic workloads. T3's
within-slice color-group parallelism only shows material benefit when
a slice holds many events AND the competitor pool is large enough to
give the greedy coloring room to partition. Typical TrueSkill
workloads don't fit that profile. Cross-slice parallelism (dirty-bit
slice skipping, spec Section 5) is the natural next step for
real-workload speedup.
Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.
Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -98,3 +98,35 @@ Gaussian::tau 260.80 ps (unchanged)
|
||||
# learning_curves_by_index(), nested-Vec public add_events().
|
||||
# - 90 tests green: 68 lib + 10 api_shape + 6 game + 4 record_winner +
|
||||
# 2 equivalence.
|
||||
|
||||
# After T3 (2026-04-24, same hardware)
|
||||
|
||||
Batch::iteration (seq, no rayon) 23.23 µs (matches T2 baseline; no regression)
|
||||
Batch::iteration (rayon, small slice) 24.57 µs (within noise; small workloads pay rayon overhead)
|
||||
Gaussian::add 236.62 ps (unchanged)
|
||||
Gaussian::sub 236.43 ps (unchanged)
|
||||
Gaussian::mul 237.05 ps (unchanged)
|
||||
Gaussian::div 236.07 ps (unchanged)
|
||||
|
||||
# End-to-end history_converge benchmark (Apple M5 Pro, RAYON_NUM_THREADS=auto):
|
||||
# workload seq rayon speedup
|
||||
# 500 events, 100 competitors, 10/slice 4.03 ms 4.24 ms 1.0x
|
||||
# 2000 events, 200 competitors, 20/slice 20.18 ms 19.82 ms 1.0x
|
||||
# 5000 events, 50000 competitors, 1 slice 11.88 ms 9.10 ms 1.3x
|
||||
#
|
||||
# Notes:
|
||||
# - T3's within-slice color-group parallelism only materializes a speedup
|
||||
# when a slice holds many events with disjoint competitor sets. Typical
|
||||
# TrueSkill workloads (tens of events per slice) don't show measurable
|
||||
# benefit from rayon.
|
||||
# - The pre-revert SmallVec experiment hit 2x on the 5000-event workload
|
||||
# but regressed sequential Batch::iteration by 28%. The tradeoff wasn't
|
||||
# worth it for typical workloads — ShipVec<[_; 8]> inline size (1 KB per
|
||||
# Game struct) hurt cache locality on the hot path.
|
||||
# - Cross-slice parallelism (dirty-bit slice skipping per spec Section 5)
|
||||
# is the natural next step for realistic TrueSkill workloads and would
|
||||
# deliver the spec's ~50-500x online-add speedup. Deferred to T4+.
|
||||
# - Determinism verified: tests/determinism.rs asserts bit-identical
|
||||
# posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
|
||||
# - Send + Sync bounds added on Time, Drift<T>, Observer<T>, Factor, Schedule.
|
||||
# - Rayon is opt-in via `--features rayon`. Default build is unchanged from T2.
|
||||
|
||||
Reference in New Issue
Block a user