T3: rayon-backed concurrency (opt-in) #2

Merged
logaritmisk merged 13 commits from t3-concurrency into main 2026-04-24 13:01:01 +00:00
Owner

Implements T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md Section 6. Plan: docs/superpowers/plans/2026-04-24-t3-concurrency.md (11 tasks).

Summary

Breaking

  • Send + Sync bounds added to public traits: Time, Drift<T>, Observer<T>, Factor, Schedule. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds.

New

  • Opt-in rayon cargo feature. When enabled:
    • Within-slice event iteration runs color-group events in parallel via par_iter_mut (TimeSlice::sweep_color_groups).
    • History::learning_curves computes per-slice posteriors in parallel; merges sequentially in slice order.
    • History::log_evidence / log_evidence_for use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.
  • ColorGroups infrastructure (src/color_group.rs) with greedy graph coloring. Events sharing no Index go into the same color group; events in the same group can run concurrently without touching each other's skills.
  • tests/determinism.rs asserts bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8}.
  • benches/history_converge.rs measures end-to-end convergence on three workload shapes.

Performance

Sequential (no rayon, default build)

Metric Before T3 After T3 Delta
Batch::iteration 22.88 µs 23.23 µs +1.5% (noise)
Gaussian::* ≈218–264 ps ≈236 ps within noise

No sequential regression. Default build is as fast as T2.

Parallel (--features rayon, Apple M5 Pro, auto thread count)

Workload Sequential Parallel Speedup
500 events / 100 competitors / 10 per slice 4.03 ms 4.24 ms 1.0×
2000 events / 200 competitors / 20 per slice 20.18 ms 19.82 ms 1.0×
5000 events / 50000 competitors / 1 slice 11.88 ms 9.10 ms 1.3×

⚠️ The spec's >=2× target was not met on realistic workloads.

T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates.

Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier.

Determinism

tests/determinism.rs runs a 200-event history at thread counts {1, 2, 4, 8} via rayon::ThreadPoolBuilder::install and asserts every (time, posterior) pair has bit-identical mu and sigma (compared via f64::to_bits()). Passes.

Internals

  • Parallel path uses an unsafe block to concurrently write to SkillStore from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared Index), guaranteed by construction in TimeSlice::recompute_color_groups. Sequential path unchanged from T2.
  • RAYON_THRESHOLD = 64 — color groups smaller than this fall back to sequential inside sweep_color_groups to avoid task-spawn overhead.
  • Thread-local ScratchArena per rayon worker thread.

Test plan

  • cargo test --features approx — 96 tests pass (74 lib + 22 integration)
  • cargo test --features approx,rayon — 97 tests pass (+1 determinism)
  • cargo clippy --all-targets --features approx -- -D warnings — clean
  • cargo clippy --all-targets --features approx,rayon -- -D warnings — clean
  • cargo +nightly fmt --check — clean
  • cargo bench --bench batch --features approx — 23.23 µs (no regression vs T2)
  • cargo bench --bench history_converge --features approx,rayon — runs on all three workloads
  • Bit-identical posteriors across RAYON_NUM_THREADS={1, 2, 4, 8} — verified

Commit history

13 commits on t3-concurrency. Each task is self-contained and bisectable. See git log main..t3-concurrency for the full list.

Deferred

  • Cross-slice parallelism (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads.
  • Default-on rayon feature — spec called for default-on; we keep it opt-in until the feature proves stable in production use.
  • Synchronous-EP schedule with barrier merge — alternative parallel strategy per spec Section 6.
  • MarginFactor / Outcome::Scored — T4.
  • Damped / Residual schedules — T4.
  • N-team predict_outcome — T4.
  • Game::custom full ergonomics — T4.

🤖 Generated with Claude Code

Implements T3 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6. Plan: `docs/superpowers/plans/2026-04-24-t3-concurrency.md` (11 tasks). ## Summary ### Breaking - `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds. ### New - Opt-in `rayon` cargo feature. When enabled: - Within-slice event iteration runs color-group events in parallel via `par_iter_mut` (`TimeSlice::sweep_color_groups`). - `History::learning_curves` computes per-slice posteriors in parallel; merges sequentially in slice order. - `History::log_evidence` / `log_evidence_for` use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline. - `ColorGroups` infrastructure (`src/color_group.rs`) with greedy graph coloring. Events sharing no `Index` go into the same color group; events in the same group can run concurrently without touching each other's skills. - `tests/determinism.rs` asserts bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`. - `benches/history_converge.rs` measures end-to-end convergence on three workload shapes. ## Performance ### Sequential (no rayon, default build) | Metric | Before T3 | After T3 | Delta | |---|---|---|---| | `Batch::iteration` | 22.88 µs | 23.23 µs | **+1.5%** (noise) | | `Gaussian::*` | ≈218–264 ps | ≈236 ps | within noise | **No sequential regression.** Default build is as fast as T2. ### Parallel (`--features rayon`, Apple M5 Pro, auto thread count) | Workload | Sequential | Parallel | Speedup | |---|---:|---:|---:| | 500 events / 100 competitors / 10 per slice | 4.03 ms | 4.24 ms | **1.0×** | | 2000 events / 200 competitors / 20 per slice | 20.18 ms | 19.82 ms | **1.0×** | | 5000 events / 50000 competitors / 1 slice | 11.88 ms | 9.10 ms | **1.3×** | ### ⚠️ The spec's >=2× target was not met on realistic workloads. T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates. **Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step** for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier. ## Determinism `tests/determinism.rs` runs a 200-event history at thread counts {1, 2, 4, 8} via `rayon::ThreadPoolBuilder::install` and asserts every `(time, posterior)` pair has bit-identical `mu` and `sigma` (compared via `f64::to_bits()`). Passes. ## Internals - Parallel path uses an `unsafe` block to concurrently write to `SkillStore` from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared `Index`), guaranteed by construction in `TimeSlice::recompute_color_groups`. Sequential path unchanged from T2. - `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to sequential inside `sweep_color_groups` to avoid task-spawn overhead. - Thread-local `ScratchArena` per rayon worker thread. ## Test plan - [x] `cargo test --features approx` — 96 tests pass (74 lib + 22 integration) - [x] `cargo test --features approx,rayon` — 97 tests pass (+1 determinism) - [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean - [x] `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean - [x] `cargo +nightly fmt --check` — clean - [x] `cargo bench --bench batch --features approx` — 23.23 µs (no regression vs T2) - [x] `cargo bench --bench history_converge --features approx,rayon` — runs on all three workloads - [x] Bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}` — verified ## Commit history 13 commits on `t3-concurrency`. Each task is self-contained and bisectable. See `git log main..t3-concurrency` for the full list. ## Deferred - **Cross-slice parallelism** (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads. - **Default-on `rayon` feature** — spec called for default-on; we keep it opt-in until the feature proves stable in production use. - **Synchronous-EP schedule with barrier merge** — alternative parallel strategy per spec Section 6. - **`MarginFactor` / `Outcome::Scored`** — T4. - **`Damped` / `Residual` schedules** — T4. - **N-team `predict_outcome`** — T4. - **`Game::custom` full ergonomics** — T4. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
logaritmisk added 13 commits 2026-04-24 12:59:13 +00:00
11-task plan for rayon-backed within-slice parallelism per
Section 6 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opt-in feature flag — users who want parallel paths build with
--features rayon. Default build remains single-threaded.

Spec Section 6 calls for default-on; we defer that flip until the
feature is stable under field use.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
Required for T3 rayon-based parallelism. Affected traits:
- Time (+ Send + Sync + 'static)
- Drift<T> (+ Send + Sync)
- Observer<T> (+ Send + Sync)
- Factor (+ Send + Sync)
- Schedule (+ Send + Sync)

All built-in impls (i64, Untimed, ConstantDrift, NullObserver,
EpsilonOrMax, TeamSumFactor, RankDiffFactor, TruncFactor,
BuiltinFactor) naturally satisfy these bounds via auto-derive.

Minor breaking change: downstream custom impls that aren't already
thread-safe will need to add the bounds.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
ColorGroups holds a partition of event indices into color groups such
that events of the same color touch no shared Index. Computed greedily
in ingestion order: each event goes into the first color whose existing
members are disjoint from the event's indices.

Used in T3 for safe within-slice parallelism — events in the same
color can run concurrently without touching each other's skills.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
TimeSlice gains a color_groups field of type ColorGroups, recomputed
whenever events change. After recompute, self.events is physically
reordered so color-0 events are first, then color-1, etc. Each color
is therefore a contiguous range of indices in self.events —
the invariant that Task 6's parallel par_iter_mut exploits.

Greedy coloring via crate::color_group::color_greedy; agent indices
come from Event::iter_agents. ColorGroups gains a color_range helper
that returns the contiguous Range<usize> for a given color.

Numerical behavior unchanged: async-EP is order-independent at
convergence, so event reordering does not affect goldens.

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Under #[cfg(feature = "rayon")], the per-iteration event sweep
processes events color-by-color: within a color, events touch
disjoint Index values by construction, so par_iter is safe.
Across colors, sequential ordering preserves async-EP semantics.

Event::compute() is a pure function returning an owned EventOutput
(new per-item likelihoods, evidence, and pre-computed new skill
likelihoods). The apply phase runs sequentially after the parallel
map, writing EventOutput values back to SkillStore and each event's
item likelihoods. This avoids shared mutable state in the hot loop.

Default build (no rayon) uses a sequential fallback that traverses
the same color-group order — behaviorally identical to the parallel
path. This keeps goldens bit-identical across feature configurations.

Scenario 3b applied: event updates read from and write to the shared
SkillStore, so the compute/apply split (Option A) was necessary.

Part of T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
The compute/apply split introduced in 3680c54 was always active — the
sequential build paid EventOutput heap-alloc overhead even without
rayon, regressing Batch::iteration from 23.46 µs to 33.79 µs (+44%).

This commit makes the split feature-gated: under cfg(feature = "rayon")
the compute/apply pattern stays (needed for par_iter); under
cfg(not(feature = "rayon")) events update SkillStore inline via
Event::iteration_direct, matching the T2 performance profile.

EventOutput, Event::compute, and Event::apply_output are now
cfg(feature = "rayon")-only. TimeSlice::sweep_color_groups has two
cfg-gated implementations sharing the same signature.

Sequential restored to 23.29 µs; parallel 34.31 µs (small-workload
overhead expected — rayon threadpool amortizes at larger scales).

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per-slice posterior collection runs in parallel via par_iter; merge
into the per-key HashMap is sequential in slice order so iteration
order and HashMap insertion order are identical to the sequential
impl. Preserves deterministic output across thread counts.

Default-feature (no rayon) build unchanged — uses the T2 sequential
impl.

Part of T3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-slice log_evidence contribution computed in parallel under
--features rayon; final reduction is sequential .into_iter().sum()
on Vec<f64>, preserving slice order so the sum is bit-identical to
the sequential T2 baseline.

Essential for the T3 acceptance criterion of identical posteriors
across RAYON_NUM_THREADS values.

Part of T3.
tests/determinism.rs runs the same deterministic 200-event history
at thread counts {1, 2, 4, 8} via rayon::ThreadPoolBuilder::install
and asserts every (time, posterior) pair has bit-identical mu and
sigma across all configurations.

Cfg-gated to the rayon feature; no-op under --features approx alone.

Verifies the T3 determinism invariant that the ordered-reduce
strategy (per-slice parallel, sequential sum) produces thread-count-
independent results.

Part of T3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds benches/history_converge.rs with three workloads:
  - 500 events / 100 competitors / 10 events per slice
  - 2000 events / 200 competitors / 20 events per slice
  - 5000 events / 50000 competitors / 5000 events per slice (gate workload)

Investigation found the original rayon path used a compute/apply split with
EventOutput heap allocation per event, causing 3-23x regression. Root cause:
per-event allocations caused heavy allocator contention across rayon threads.

Fixes:
  - Replace EventOutput/two-phase approach with direct unsafe parallel write.
    Events in a color group have disjoint agent index sets; concurrent writes
    to SkillStore land on different Vec slots — no data race.
  - Add RAYON_THRESHOLD=64: color groups below this size fall back to
    sequential to avoid rayon overhead on small slices.
  - Game internals: switch likelihoods/teams to SmallVec<[_;8]> to avoid
    heap allocation for ≤8-team / ≤8-player-per-team games. Add type aliases
    Teams<T,D> and Likelihoods to satisfy clippy::type_complexity.
  - within_priors() and outputs() now return SmallVec; callers updated to
    use ranked_with_arena_sv() directly (avoiding Vec→SmallVec conversion).

Sequential baseline (Apple M5 Pro, 2026-04-24):
  500x100@10perslice:            4.72 ms
  2000x200@20perslice:          23.17 ms
  1v1-5000x50000@5000perslice:  13.89 ms

With --features rayon (RAYON_NUM_THREADS=5, P-cores on M5 Pro):
  500x100@10perslice:            4.82 ms  (1.0× — below threshold)
  2000x200@20perslice:          23.09 ms  (1.0× — below threshold)
  1v1-5000x50000@5000perslice:   6.97 ms  (2.0× speedup — GATE ACHIEVED)

T3 acceptance gate: >=2× speedup on at least one workload — ACHIEVED.
74 tests pass under both feature configs.

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Vec<Vec<_>> → SmallVec<[SmallVec<[_;8]>;8]> change in Task 10
regressed Batch::iteration from 23.29 µs to 29.73 µs (+28%). The
SmallVec was motivated by reducing parallel-path allocations but
it hurt the sequential path substantially.

Reverting game.rs + time_slice.rs + history.rs storage back to the T2
Vec<Vec<_>> shape. The parallel rayon path (unsafe direct-write +
thread_local ScratchArena + RAYON_THRESHOLD=64 fallback) stays — it
is independent of Game's internal storage.

Benchmarks after revert:
  Batch::iteration (seq, no rayon): 23.23 µs (restored ≈T2)
  Batch::iteration (rayon):         24.57 µs
  history_converge/500x100@10:       4.03 ms seq,  4.24 ms rayon — 1.0×
  history_converge/2000x200@20:     20.18 ms seq, 19.82 ms rayon — 1.0×
  history_converge/1v1-5000x50000@5000: 11.88 ms seq, 9.10 ms rayon — 1.3×

Part of T3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Batch::iteration sequential: 23.23 µs (no regression vs T2 baseline).
Gaussian ops unchanged.

End-to-end history_converge benchmark on Apple M5 Pro:
  Workload                                        seq       rayon    speedup
  500 events / 100 competitors / 10 per slice     4.03 ms   4.24 ms  1.0x
  2000 events / 200 competitors / 20 per slice   20.18 ms  19.82 ms  1.0x
  5000 events / 50000 competitors / 1 slice      11.88 ms   9.10 ms  1.3x

The spec's >=2x target is not achieved on realistic workloads. T3's
within-slice color-group parallelism only shows material benefit when
a slice holds many events AND the competitor pool is large enough to
give the greedy coloring room to partition. Typical TrueSkill
workloads don't fit that profile. Cross-slice parallelism (dirty-bit
slice skipping, spec Section 5) is the natural next step for
real-workload speedup.

Determinism verified: bit-identical posteriors across
RAYON_NUM_THREADS={1, 2, 4, 8}.

Closes T3 of docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logaritmisk merged commit 6bf3e7e294 into main 2026-04-24 13:01:01 +00:00
logaritmisk deleted branch t3-concurrency 2026-04-24 13:01:01 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: logaritmisk/trueskill-tt#2