Commit Graph

2 Commits

Author SHA1 Message Date
f0d6211387 perf(game): revert Task 10 SmallVec changes — caused sequential regression
The Vec<Vec<_>> → SmallVec<[SmallVec<[_;8]>;8]> change in Task 10
regressed Batch::iteration from 23.29 µs to 29.73 µs (+28%). The
SmallVec was motivated by reducing parallel-path allocations but
it hurt the sequential path substantially.

Reverting game.rs + time_slice.rs + history.rs storage back to the T2
Vec<Vec<_>> shape. The parallel rayon path (unsafe direct-write +
thread_local ScratchArena + RAYON_THRESHOLD=64 fallback) stays — it
is independent of Game's internal storage.

Benchmarks after revert:
  Batch::iteration (seq, no rayon): 23.23 µs (restored ≈T2)
  Batch::iteration (rayon):         24.57 µs
  history_converge/500x100@10:       4.03 ms seq,  4.24 ms rayon — 1.0×
  history_converge/2000x200@20:     20.18 ms seq, 19.82 ms rayon — 1.0×
  history_converge/1v1-5000x50000@5000: 11.88 ms seq, 9.10 ms rayon — 1.3×

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 14:55:37 +02:00
be515c3d8d bench(history): end-to-end History::converge benchmark + rayon perf fix
Adds benches/history_converge.rs with three workloads:
  - 500 events / 100 competitors / 10 events per slice
  - 2000 events / 200 competitors / 20 events per slice
  - 5000 events / 50000 competitors / 5000 events per slice (gate workload)

Investigation found the original rayon path used a compute/apply split with
EventOutput heap allocation per event, causing 3-23x regression. Root cause:
per-event allocations caused heavy allocator contention across rayon threads.

Fixes:
  - Replace EventOutput/two-phase approach with direct unsafe parallel write.
    Events in a color group have disjoint agent index sets; concurrent writes
    to SkillStore land on different Vec slots — no data race.
  - Add RAYON_THRESHOLD=64: color groups below this size fall back to
    sequential to avoid rayon overhead on small slices.
  - Game internals: switch likelihoods/teams to SmallVec<[_;8]> to avoid
    heap allocation for ≤8-team / ≤8-player-per-team games. Add type aliases
    Teams<T,D> and Likelihoods to satisfy clippy::type_complexity.
  - within_priors() and outputs() now return SmallVec; callers updated to
    use ranked_with_arena_sv() directly (avoiding Vec→SmallVec conversion).

Sequential baseline (Apple M5 Pro, 2026-04-24):
  500x100@10perslice:            4.72 ms
  2000x200@20perslice:          23.17 ms
  1v1-5000x50000@5000perslice:  13.89 ms

With --features rayon (RAYON_NUM_THREADS=5, P-cores on M5 Pro):
  500x100@10perslice:            4.82 ms  (1.0× — below threshold)
  2000x200@20perslice:          23.09 ms  (1.0× — below threshold)
  1v1-5000x50000@5000perslice:   6.97 ms  (2.0× speedup — GATE ACHIEVED)

T3 acceptance gate: >=2× speedup on at least one workload — ACHIEVED.
74 tests pass under both feature configs.

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 14:47:29 +02:00