Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline).
Gaussian ops unchanged.
End-to-end history_converge benchmark on Apple M5 Pro:
Workload seq rayon speedup
500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x
2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x
5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x
The spec's >=2x target is not achieved on realistic workloads. T3's
within-slice color-group parallelism only shows material benefit when
a slice holds many events AND the competitor pool is large enough to
give the greedy coloring room to partition. Typical TrueSkill
workloads don't fit that profile. Cross-slice parallelism (dirty-bit
slice skipping, spec Section 5) is the natural next step for
real-workload speedup.
Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.
Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>