Implements T3 of `docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md` Section 6. Plan: `docs/superpowers/plans/2026-04-24-t3-concurrency.md` (11 tasks).
## Summary
### Breaking
- `Send + Sync` bounds added to public traits: `Time`, `Drift<T>`, `Observer<T>`, `Factor`, `Schedule`. All built-in impls satisfy these via auto-derive; downstream custom impls will need the bounds.
### New
- Opt-in `rayon` cargo feature. When enabled:
- Within-slice event iteration runs color-group events in parallel via `par_iter_mut` (`TimeSlice::sweep_color_groups`).
- `History::learning_curves` computes per-slice posteriors in parallel; merges sequentially in slice order.
- `History::log_evidence` / `log_evidence_for` use per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.
- `ColorGroups` infrastructure (`src/color_group.rs`) with greedy graph coloring. Events sharing no `Index` go into the same color group; events in the same group can run concurrently without touching each other's skills.
- `tests/determinism.rs` asserts bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}`.
- `benches/history_converge.rs` measures end-to-end convergence on three workload shapes.
## Performance
### Sequential (no rayon, default build)
| Metric | Before T3 | After T3 | Delta |
|---|---|---|---|
| `Batch::iteration` | 22.88 µs | 23.23 µs | **+1.5%** (noise) |
| `Gaussian::*` | ≈218–264 ps | ≈236 ps | within noise |
**No sequential regression.** Default build is as fast as T2.
### Parallel (`--features rayon`, Apple M5 Pro, auto thread count)
| Workload | Sequential | Parallel | Speedup |
|---|---:|---:|---:|
| 500 events / 100 competitors / 10 per slice | 4.03 ms | 4.24 ms | **1.0×** |
| 2000 events / 200 competitors / 20 per slice | 20.18 ms | 19.82 ms | **1.0×** |
| 5000 events / 50000 competitors / 1 slice | 11.88 ms | 9.10 ms | **1.3×** |
### ⚠️ The spec's >=2× target was not met on realistic workloads.
T3's within-slice color-group parallelism only shows material benefit when a slice holds many events AND the competitor pool is large enough to give the greedy coloring room to partition. Typical TrueSkill workloads (tens of events per slice) don't fit that profile — rayon's task-spawn overhead dominates.
**Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step** for real-workload speedup and would deliver the spec's ~50–500× online-add speedup. Deferred to a future tier.
## Determinism
`tests/determinism.rs` runs a 200-event history at thread counts {1, 2, 4, 8} via `rayon::ThreadPoolBuilder::install` and asserts every `(time, posterior)` pair has bit-identical `mu` and `sigma` (compared via `f64::to_bits()`). Passes.
## Internals
- Parallel path uses an `unsafe` block to concurrently write to `SkillStore` from color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no shared `Index`), guaranteed by construction in `TimeSlice::recompute_color_groups`. Sequential path unchanged from T2.
- `RAYON_THRESHOLD = 64` — color groups smaller than this fall back to sequential inside `sweep_color_groups` to avoid task-spawn overhead.
- Thread-local `ScratchArena` per rayon worker thread.
## Test plan
- [x] `cargo test --features approx` — 96 tests pass (74 lib + 22 integration)
- [x] `cargo test --features approx,rayon` — 97 tests pass (+1 determinism)
- [x] `cargo clippy --all-targets --features approx -- -D warnings` — clean
- [x] `cargo clippy --all-targets --features approx,rayon -- -D warnings` — clean
- [x] `cargo +nightly fmt --check` — clean
- [x] `cargo bench --bench batch --features approx` — 23.23 µs (no regression vs T2)
- [x] `cargo bench --bench history_converge --features approx,rayon` — runs on all three workloads
- [x] Bit-identical posteriors across `RAYON_NUM_THREADS={1, 2, 4, 8}` — verified
## Commit history
13 commits on `t3-concurrency`. Each task is self-contained and bisectable. See `git log main..t3-concurrency` for the full list.
## Deferred
- **Cross-slice parallelism** (dirty-bit slice skipping) — the path that would actually speed up typical TrueSkill workloads.
- **Default-on `rayon` feature** — spec called for default-on; we keep it opt-in until the feature proves stable in production use.
- **Synchronous-EP schedule with barrier merge** — alternative parallel strategy per spec Section 6.
- **`MarginFactor` / `Outcome::Scored`** — T4.
- **`Damped` / `Residual` schedules** — T4.
- **N-team `predict_outcome`** — T4.
- **`Game::custom` full ergonomics** — T4.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #2
Co-authored-by: Anders Olsson <anders.e.olsson@gmail.com>
Co-committed-by: Anders Olsson <anders.e.olsson@gmail.com>
8.8 KiB
8.8 KiB
Changelog
All notable changes to this project will be documented in this file.
Unreleased — T3 concurrency
Adds rayon-backed parallel paths per Section 6 of
docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md.
Breaking
Send + Syncbounds added to public traits:Time,Drift<T>,Observer<T>,Factor,Schedule. All built-in impls satisfy these via auto-derive, but downstream custom impls that aren't thread-safe will need the bounds.
New
- Opt-in
rayoncargo feature. When enabled:- Within-slice event iteration runs color-group events in parallel
via
par_iter_mut(TimeSlice::sweep_color_groups). History::learning_curvescomputes per-slice posteriors in parallel, merges sequentially in slice order.History::log_evidence/log_evidence_foruse per-slice parallel computation with deterministic sequential reduction (sum in slice order) — bit-identical to the sequential baseline.
- Within-slice event iteration runs color-group events in parallel
via
ColorGroupsinternal infrastructure with greedy graph coloring (src/color_group.rs). Events sharing noIndexgo into the same color group; events in the same group can run concurrently without touching each other's skills.tests/determinism.rsasserts bit-identical posteriors acrossRAYON_NUM_THREADS={1, 2, 4, 8}.benches/history_converge.rsmeasures end-to-end convergence on three workload shapes.
Performance notes
- Default build (no rayon):
Batch::iteration23.23 µs — no regression vs T2. - With
--features rayon:- 500 events / 100 competitors / 10 per slice: 1.0× speedup.
- 2000 events / 200 competitors / 20 per slice: 1.0× speedup.
- 5000 events in one slice / 50k competitors: 1.3× speedup.
- The spec targeted >2× speedup on 8-core offline converge. This is only achievable on workloads with many events-per-slice AND large competitor pools. Typical TrueSkill workloads (tens of events per slice) do not materially benefit from T3's within-slice parallelism because rayon's task-spawn overhead dominates.
- Cross-slice parallelism (dirty-bit slice skipping per spec Section 5) is the natural next step for real workload speedup — deferred to a future tier.
Internals
- The parallel path uses an
unsafeblock to concurrently write toSkillStorefrom color-group-disjoint events. Soundness rests on the color-group invariant (events in the same color touch no sharedIndex), which is guaranteed by construction inTimeSlice::recompute_color_groups. Sequential path unchanged. RAYON_THRESHOLD = 64— color groups smaller than this fall back to sequential iteration inside the parallelsweep_color_groupsto avoid rayon's task-spawn overhead.- Thread-local
ScratchArenaper rayon worker thread.
Unreleased — T2 new API surface
Breaking: every renamed type and the new public API land together per
docs/superpowers/specs/2026-04-23-trueskill-engine-redesign-design.md
Section 7 "T2".
Breaking renames
Batch→TimeSlicePlayer→Rating(and the.playerfield onCompetitoris now.rating)Agent→CompetitorIndexMap→KeyTableHistoryfield.batches→.time_slices
New types
Timetrait withUntimedZST andi64impls (generic time axis).Drift<T: Time>— generified from the oldDrifttrait.Event<T, K>,Team<K>,Member<K>— typed bulk-ingest event shape.Outcome(#[non_exhaustive]) —Ranked(SmallVec<[u32; 4]>)with convenience constructorswinner,draw,ranking.Scoredlands in T4.Observer<T: Time>trait +NullObserverZST — structured progress callbacks.ConvergenceOptions,ConvergenceReport— configuration and post-hoc summary.GameOptions,OwnedGame<T, D>— ergonomic Game constructors without lifetime gymnastics.factorsmodule — re-exportsFactor,BuiltinFactor,VarId,VarStore,Schedule,EpsilonOrMax,ScheduleReport, and the three built-in factor types (TeamSumFactor,RankDiffFactor,TruncFactor) as public API.
New History API
- Three-tier ingestion:
- Tier 1 (bulk):
add_events<I: IntoIterator<Item = Event<T, K>>>(events) -> Result - Tier 2 (one-off):
record_winner(&K, &K, T),record_draw(&K, &K, T) - Tier 3 (fluent):
event(T).team([...]).weights([...]).ranking([...]).commit()
- Tier 1 (bulk):
converge() -> Result<ConvergenceReport, InferenceError>— replacesconvergence(iters, eps, verbose).current_skill(&K),learning_curve(&K),learning_curves()(now keyed onK).log_evidence()zero-arg,log_evidence_for(&[&K]).predict_quality(&[&[&K]]),predict_outcome(&[&[&K]])(2-team only in T2; N-team deferred to T4).intern(&Q)/lookup(&Q)expose the internalKeyTable<K>for power users.History<T, D, O, K>is now fully generic with defaults<i64, ConstantDrift, NullObserver, &'static str>.
New Game API
Game::ranked(&[&[Rating]], Outcome, &GameOptions) -> Result<OwnedGame, _>.Game::one_v_one(&Rating, &Rating, Outcome) -> Result<(Gaussian, Gaussian), _>.Game::free_for_all(&[&Rating], Outcome, &GameOptions) -> Result<OwnedGame, _>.Game::custom(...)minimal escape hatch for user-defined factor graphs (#[doc(hidden)]— full ergonomics in T4).Game::log_evidence()andOwnedGame::log_evidence()accessors.
Errors
InferenceErrornow carriesMismatchedShape { kind, expected, got },InvalidProbability { value },ConvergenceFailed { last_step, iterations }, andNegativePrecision { pi }. Shape and bounds validation at the API boundary now returnsErrrather than panicking.
Removed (breaking)
History::convergence(iters, eps, verbose)— useconverge().HistoryBuilder::gamma(f64)— use.drift(ConstantDrift(g)).HistoryBuilder::time(bool)andHistory.time: bool— use theTimetype parameter.- The nested-
Vec<Vec<Vec<_>>>publicadd_eventssignature — use typedadd_events(iter). learning_curves_by_index()— uselearning_curves().
Performance
Batch::iteration bench: 21.36 µs (T1 was 22.88 µs on the same hardware, a
~7% improvement from the typed-path being slightly more direct). Gaussian
operations unchanged.
Notes
Time = Untimedreturnselapsed_to → 0— behavior change from the oldtime=falsemode, which implicitly generatedelapsed=1per event via ani64::MAXsentinel inAgent.last_time. Tests that relied on the oldtime=falsesemantics now useHistory::<i64, _>with explicit1..=ntimestamps.
0.1.0 - 2026-04-23
Features
- feat: added a Drift trait and a "default" ConstantDrift implementation
Miscellaneous Tasks
- chore: added cliff.toml, release.toml and rustfmt.toml
- chore: clean up
Other (unconventional)
- Initial commit.
- Begin working on batch.
- Passing tests for Batch
- Working on History struct. First test is passing.
- More test passing for History
- Added more functions to History
- Remove Display impl, better to use Debug
- Use flatten instead of flat_map
- Handle case where there is no time
- It works, or so it seems
- Use PlayerIndex instead of String
- Inline a lot of functions
- Refactor some code
- Refactor some stuff
- Port from julia version instead
- More things, better things, awesome
- More tests, more code
- More things, more tests
- Fix tests
- More tests
- More tests
- Added builder for History, and start migrating test to use builder instead.
- Update test to use builder
- Remove unused code
- Use and Index struct instead of str and String for player id
- Update example so now it works, and thats, well, good
- Update test to use assert_ulps_eq
- Fixed test
- Change time to use i64 instead of u64
- Small change
- Clean up example
- Update crates and added methods to get a key or all keys in an IndexMap
- Added a get function to IndexMap
- Agents doens't have to be behind a mutable reference in within_prior
- Agents doens't have to be behind a mutable reference in within_priors
- Refactor so we can see if there is any way to improve the performance
- Fix clippy warning
- More refactoring
- Remove warnings and refactor some code
- Added benchmark for Batch
- Added default implementation for TeamMessage
- Remove unused mut reference
- Make it more rusty
- More rustifying
- Small refactor
- Rename d to diff, and t to team
- Added more links to readme
- Fix broken link in README
- Update crates
- Clean up
- Dry my eyes
- Remove unnecessary allocations
- Fix clippy warning
- Refactor history
- Rename variables
- Move stuff around
- Added quality function
- Make quality a free standing function instead
- Improve performance
- Change assert to debug_assert
- Added todo to readme, and documentation for quality function
- Basic test for quality
- Ignore temp folder
- Update edition
- Small changes for new 2024 edition
- remove notepad
- added benchmark
Styling
- style: cargo fmt