perf(game): revert Task 10 SmallVec changes — caused sequential regression

The Vec<Vec<_>> → SmallVec<[SmallVec<[_;8]>;8]> change in Task 10
regressed Batch::iteration from 23.29 µs to 29.73 µs (+28%). The
SmallVec was motivated by reducing parallel-path allocations but
it hurt the sequential path substantially.

Reverting game.rs + time_slice.rs + history.rs storage back to the T2
Vec<Vec<_>> shape. The parallel rayon path (unsafe direct-write +
thread_local ScratchArena + RAYON_THRESHOLD=64 fallback) stays — it
is independent of Game's internal storage.

Benchmarks after revert:
  Batch::iteration (seq, no rayon): 23.23 µs (restored ≈T2)
  Batch::iteration (rayon):         24.57 µs
  history_converge/500x100@10:       4.03 ms seq,  4.24 ms rayon — 1.0×
  history_converge/2000x200@20:     20.18 ms seq, 19.82 ms rayon — 1.0×
  history_converge/1v1-5000x50000@5000: 11.88 ms seq, 9.10 ms rayon — 1.3×

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-24 14:55:37 +02:00
parent be515c3d8d
commit f0d6211387
4 changed files with 29 additions and 45 deletions

View File

@@ -10,17 +10,18 @@
//! The rayon thread pool is initialised to `min(P-cores, available)` to
//! avoid scheduling onto the slower E-cores.
//!
//! ## Results (Apple M5 Pro, 2026-04-24, 5 P-core threads)
//! ## Results (Apple M5 Pro, 2026-04-24, after SmallVec revert)
//!
//! | Workload | Sequential | Parallel | Speedup |
//! |---------------------------------------------|------------:|-----------:|--------:|
//! | History::converge/500x100@10perslice | 4.71 ms | 4.79 ms | 1.0× |
//! | History::converge/2000x200@20perslice | 23.36 ms | 23.28 ms | 1.0× |
//! | History::converge/1v1-5000x50000@5000perslice| 13.90 ms | 6.99 ms | **2.0×** |
//! | History::converge/500x100@10perslice | 4.03 ms | 4.24 ms | 1.0× |
//! | History::converge/2000x200@20perslice | 20.18 ms | 19.82 ms | 1.0× |
//! | History::converge/1v1-5000x50000@5000perslice| 11.88 ms | 9.10 ms | 1.3× |
//!
//! T3 acceptance gate: ≥2× speedup on at least one workload — ACHIEVED.
//! Small workloads fall below the RAYON_THRESHOLD (64 events/color) and
//! run sequentially with near-zero overhead.
//! T3 acceptance gate: ≥2× speedup on at least one workload — NOT achieved after revert.
//! The SmallVec storage that enabled the 2× gate caused a +28% regression in the
//! sequential Batch::iteration benchmark and was reverted. Small workloads still fall
//! below the RAYON_THRESHOLD (64 events/color) and run sequentially with near-zero overhead.
use criterion::{BatchSize, Criterion, criterion_group, criterion_main};
use smallvec::smallvec;