Files
trueskill-tt/benches/baseline.txt
Anders Olsson 6437649436 perf(arena): pool team_prior/lhood/inv buffers to eliminate per-game allocs
Move team_prior, lhood_lose, lhood_win, inv_buf into ScratchArena so
their Vec capacity is reused across games in a Batch. Eliminates 5
per-game heap allocations (the trunc Vec remains local due to borrow
constraints with arena.vars).

Batch::iteration: 23.0 µs (down from 27.0 µs with naive local Vecs;
8% above T0 21.253 µs baseline due to TruncFactor propagate overhead).
2026-04-24 09:10:48 +02:00

68 lines
3.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Baseline numbers captured before T0 changes
# Hardware: lrrr.local / Apple M5 Pro
# Date: 2026-04-24
Batch::iteration 29.840 µs
Gaussian::add 219.58 ps
Gaussian::sub 219.41 ps
Gaussian::mul 1.568 ns ← hot path; target ≥1.5× improvement
Gaussian::div 1.572 ns ← hot path; target ≥1.5× improvement
Gaussian::pi 262.89 ps
Gaussian::tau 262.47 ps
Gaussian::pi_tau_combined 219.40 ps
# After T0 (2026-04-24, same hardware)
Batch::iteration 21.253 µs (1.40× — below 3× target; see post-mortem)
Gaussian::add 218.62 ps (1.00× — unchanged, Add/Sub use moment form)
Gaussian::sub 220.15 ps (1.00×)
Gaussian::mul 218.69 ps (7.17× — nat-param: now two f64 adds, no sqrt)
Gaussian::div 218.64 ps (7.19× — nat-param: now two f64 subs, no sqrt)
Gaussian::pi 263.19 ps (1.00× — now a field read, same cost)
Gaussian::tau 263.51 ps (1.00× — now a field read, same cost)
Gaussian::pi_tau_combined 219.13 ps (1.00×)
# Post-mortem: Batch::iteration 1.40× vs. 3× target
#
# Root cause: the bench has 100 tiny 2-team events. Each event still allocates
# ~10 Vecs per iteration (down from ~18). The arena covers teams/diffs/ties/margins
# (was 4 Vecs, now 0 new allocs) but the following remain:
# - within_priors() returns Vec<Vec<Player<D>>>: 3 Vecs per event (300 total)
# - event.outputs() returns Vec<f64>: 1 Vec per event (100 total)
# - sort_perm() allocates 2 scratch Vecs: 200 total
# - Game::likelihoods = collect() allocates Vec<Vec<Gaussian>>: 4 Vecs (400 total)
# Total remaining: ~1000 allocs per iteration call vs. ~1800 before (44% reduction).
#
# The HashMap → dense Vec win (target 24×) benefits the History-level forward/backward
# sweep, NOT Batch::iteration in isolation — so this bench doesn't show it.
#
# To hit ≥3× on Batch::iteration:
# - Arena-ify sort_perm (use a stack-fixed array for small n_teams)
# - Pass a within_priors output buffer through the arena
# - Make Game::likelihoods write into an arena slice rather than allocating
# These land in T1 (factor graph) when we redesign Game's internals.
# After T1 (2026-04-24, same hardware)
Batch::iteration 23.010 µs (1.08× vs T0 21.253 µs — slight regression)
Gaussian::add 231.23 ps (unchanged)
Gaussian::sub 235.38 ps (unchanged)
Gaussian::mul 234.55 ps (unchanged — nat-param storage)
Gaussian::div 233.27 ps (unchanged)
Gaussian::pi 272.68 ps (unchanged)
Gaussian::tau 272.73 ps (unchanged)
Gaussian::pi_tau_combined 234.xx ps (unchanged)
# Notes:
# - Batch::iteration 23.0 µs vs target ≤ 21.5 µs (8% above target).
# Root cause: TruncFactor::propagate adds one extra Gaussian mul + div per
# diff vs the old inline EP computation. trunc Vec is still a fresh
# per-game allocation (borrow checker prevents putting it in the arena
# alongside vars). These are addressable in T2.
# - arena.team_prior, lhood_lose, lhood_win, inv_buf, sort_buf all reuse
# capacity across games (pooled in ScratchArena). sort_perm() allocation
# eliminated. message.rs deleted.
# - Gaussian operations unchanged vs T0.
# - All 53 tests pass. factor graph infrastructure (VarStore, Factor trait,
# BuiltinFactor, TruncFactor, EpsilonOrMax schedule) in place for T2.