logaritmisk/trueskill-tt

T3: rayon-backed concurrency (opt-in) #2

Merged

logaritmisk merged 13 commits from t3-concurrency into main

2026-04-24 13:01:01 +00:00

Author	SHA1	Message	Date
Anders Olsson	db633bdafe	bench,docs: capture T3 final numbers and update CHANGELOG Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline). Gaussian ops unchanged. End-to-end history_converge benchmark on Apple M5 Pro: Workload seq rayon speedup 500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0x 2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0x 5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3x The spec's >=2x target is not achieved on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads don't fit that profile. Cross-slice parallelism (dirty-bit slice skipping, spec Section 5) is the natural next step for real-workload speedup. Determinism verified: bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}. Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:58:24 +02:00
Anders Olsson	f0d6211387	perf(game): revert Task 10 SmallVec changes — caused sequential regression The Vec<Vec<_>> → SmallVec<[SmallVec<[_;8]>;8]> change in Task 10 regressed Batch::iteration from 23.29 µs to 29.73 µs (+28%). The SmallVec was motivated by reducing parallel-path allocations but it hurt the sequential path substantially. Reverting game.rs + time_slice.rs + history.rs storage back to the T2 Vec<Vec<_>> shape. The parallel rayon path (unsafe direct-write + thread_local ScratchArena + RAYON_THRESHOLD=64 fallback) stays — it is independent of Game's internal storage. Benchmarks after revert: Batch::iteration (seq, no rayon): 23.23 µs (restored ≈T2) Batch::iteration (rayon): 24.57 µs history_converge/500x100@10: 4.03 ms seq, 4.24 ms rayon — 1.0× history_converge/2000x200@20: 20.18 ms seq, 19.82 ms rayon — 1.0× history_converge/1v1-5000x50000@5000: 11.88 ms seq, 9.10 ms rayon — 1.3× Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 14:55:37 +02:00
Anders Olsson	be515c3d8d	bench(history): end-to-end History::converge benchmark + rayon perf fix Adds benches/history_converge.rs with three workloads: - 500 events / 100 competitors / 10 events per slice - 2000 events / 200 competitors / 20 events per slice - 5000 events / 50000 competitors / 5000 events per slice (gate workload) Investigation found the original rayon path used a compute/apply split with EventOutput heap allocation per event, causing 3-23x regression. Root cause: per-event allocations caused heavy allocator contention across rayon threads. Fixes: - Replace EventOutput/two-phase approach with direct unsafe parallel write. Events in a color group have disjoint agent index sets; concurrent writes to SkillStore land on different Vec slots — no data race. - Add RAYON_THRESHOLD=64: color groups below this size fall back to sequential to avoid rayon overhead on small slices. - Game internals: switch likelihoods/teams to SmallVec<[_;8]> to avoid heap allocation for ≤8-team / ≤8-player-per-team games. Add type aliases Teams<T,D> and Likelihoods to satisfy clippy::type_complexity. - within_priors() and outputs() now return SmallVec; callers updated to use ranked_with_arena_sv() directly (avoiding Vec→SmallVec conversion). Sequential baseline (Apple M5 Pro, 2026-04-24): 500x100@10perslice: 4.72 ms 2000x200@20perslice: 23.17 ms 1v1-5000x50000@5000perslice: 13.89 ms With --features rayon (RAYON_NUM_THREADS=5, P-cores on M5 Pro): 500x100@10perslice: 4.82 ms (1.0× — below threshold) 2000x200@20perslice: 23.09 ms (1.0× — below threshold) 1v1-5000x50000@5000perslice: 6.97 ms (2.0× speedup — GATE ACHIEVED) T3 acceptance gate: >=2× speedup on at least one workload — ACHIEVED. 74 tests pass under both feature configs. Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 14:47:29 +02:00
Anders Olsson	cbf652eb1d	test: assert bit-identical posteriors across RAYON_NUM_THREADS tests/determinism.rs runs the same deterministic 200-event history at thread counts {1, 2, 4, 8} via rayon::ThreadPoolBuilder::install and asserts every (time, posterior) pair has bit-identical mu and sigma across all configurations. Cfg-gated to the rayon feature; no-op under --features approx alone. Verifies the T3 determinism invariant that the ordered-reduce strategy (per-slice parallel, sequential sum) produces thread-count- independent results. Part of T3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:59:33 +02:00
Anders Olsson	ab8e1fd684	feat(history): parallel log_evidence with deterministic sum Per-slice log_evidence contribution computed in parallel under --features rayon; final reduction is sequential .into_iter().sum() on Vec<f64>, preserving slice order so the sum is bit-identical to the sequential T2 baseline. Essential for the T3 acceptance criterion of identical posteriors across RAYON_NUM_THREADS values. Part of T3.	2026-04-24 13:56:29 +02:00
Anders Olsson	f3c074c24c	feat(history): parallel learning_curves under rayon feature Per-slice posterior collection runs in parallel via par_iter; merge into the per-key HashMap is sequential in slice order so iteration order and HashMap insertion order are identical to the sequential impl. Preserves deterministic output across thread counts. Default-feature (no rayon) build unchanged — uses the T2 sequential impl. Part of T3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:54:47 +02:00
Anders Olsson	4b99485fc8	perf(time-slice): restore sequential direct-write path under cfg(not(feature = "rayon")) The compute/apply split introduced in `3680c54` was always active — the sequential build paid EventOutput heap-alloc overhead even without rayon, regressing Batch::iteration from 23.46 µs to 33.79 µs (+44%). This commit makes the split feature-gated: under cfg(feature = "rayon") the compute/apply pattern stays (needed for par_iter); under cfg(not(feature = "rayon")) events update SkillStore inline via Event::iteration_direct, matching the T2 performance profile. EventOutput, Event::compute, and Event::apply_output are now cfg(feature = "rayon")-only. TimeSlice::sweep_color_groups has two cfg-gated implementations sharing the same signature. Sequential restored to 23.29 µs; parallel 34.31 µs (small-workload overhead expected — rayon threadpool amortizes at larger scales). Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:52:48 +02:00
Anders Olsson	3680c54d3c	feat(time-slice): parallel within-slice event iteration via rayon Under #[cfg(feature = "rayon")], the per-iteration event sweep processes events color-by-color: within a color, events touch disjoint Index values by construction, so par_iter is safe. Across colors, sequential ordering preserves async-EP semantics. Event::compute() is a pure function returning an owned EventOutput (new per-item likelihoods, evidence, and pre-computed new skill likelihoods). The apply phase runs sequentially after the parallel map, writing EventOutput values back to SkillStore and each event's item likelihoods. This avoids shared mutable state in the hot loop. Default build (no rayon) uses a sequential fallback that traverses the same color-group order — behaviorally identical to the parallel path. This keeps goldens bit-identical across feature configurations. Scenario 3b applied: event updates read from and write to the shared SkillStore, so the compute/apply split (Option A) was necessary. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:48:41 +02:00
Anders Olsson	9836b7b709	feat(time-slice): compute and maintain color groups; reorder events TimeSlice gains a color_groups field of type ColorGroups, recomputed whenever events change. After recompute, self.events is physically reordered so color-0 events are first, then color-1, etc. Each color is therefore a contiguous range of indices in self.events — the invariant that Task 6's parallel par_iter_mut exploits. Greedy coloring via crate::color_group::color_greedy; agent indices come from Event::iter_agents. ColorGroups gains a color_range helper that returns the contiguous Range<usize> for a given color. Numerical behavior unchanged: async-EP is order-independent at convergence, so event reordering does not affect goldens. Part of T3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:42:05 +02:00
Anders Olsson	a40c0d6301	feat(color-group): add greedy within-slice event partitioning ColorGroups holds a partition of event indices into color groups such that events of the same color touch no shared Index. Computed greedily in ingestion order: each event goes into the first color whose existing members are disjoint from the event's indices. Used in T3 for safe within-slice parallelism — events in the same color can run concurrently without touching each other's skills. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:38:21 +02:00
Anders Olsson	4f302ed28e	feat(api): add Send + Sync bounds to public traits Required for T3 rayon-based parallelism. Affected traits: - Time (+ Send + Sync + 'static) - Drift<T> (+ Send + Sync) - Observer<T> (+ Send + Sync) - Factor (+ Send + Sync) - Schedule (+ Send + Sync) All built-in impls (i64, Untimed, ConstantDrift, NullObserver, EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor, BuiltinFactor) naturally satisfy these bounds via auto-derive. Minor breaking change: downstream custom impls that aren't already thread-safe will need to add the bounds. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:36:39 +02:00
Anders Olsson	9fe40042da	feat(cargo): add rayon as optional dependency Opt-in feature flag — users who want parallel paths build with --features rayon. Default build remains single-threaded. Spec Section 6 calls for default-on; we defer that flip until the feature is stable under field use. Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.	2026-04-24 13:35:15 +02:00
Anders Olsson	f0793a8470	docs: add T3 concurrency implementation plan 11-task plan for rayon-backed within-slice parallelism per Section 6 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:34:00 +02:00