docs: add xy MCP supervisor design spec

Approved design for the MVP: single xy binary with a Cargo workspace
(xy-protocol, xy-supervisor, xy-ipc, xy), Unix socket + newline-delimited
JSON-RPC, per-server KDL configs at XDG paths (XDG on macOS too via
etcetera), supervisor-per-server task model with per-server restart policy,
log capture to disk + ring buffer + broadcast for follow.

MVP commands: daemon, list, status, start/stop/restart (name|--all),
reload, logs. Process-alive supervision only; HTTP/MCP-aware probes,
container isolation, launchd integration, and TUI deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 10:51:57 +02:00
parent cd2746cc3d
commit 8f7200aa25
@@ -0,0 +1,352 @@
# xy — HTTP MCP Server Supervisor
**Date:** 2026-05-25
**Status:** Approved — ready for implementation planning
## Problem
HTTP-based MCP servers (currently two, more likely) need a long-running parent
process so they survive terminal closures and can be inspected, restarted, and
upgraded without ad-hoc terminal tabs. Today they're launched manually and
their lifetime is coupled to a terminal window.
## Goals (MVP)
- Run as a background daemon on macOS.
- Auto-launch every configured MCP server when the daemon starts.
- Provide a CLI to start, stop, restart, reload, list, and tail logs.
- Per-server restart policy with backoff.
- Capture stdout/stderr to rotating log files and an in-memory ring buffer.
## Non-goals (deferred)
- Container isolation (planned for a later phase).
- TUI dashboard.
- macOS status bar app.
- HTTP/MCP-level health probes.
- Auto-start at login via launchd (manual daemon launch only for MVP).
- Remote management (everything is local-socket only).
## Architecture
```
┌──────────────────────────────┐
│ xy daemon (process) │
│ │
xy CLI ──────►│ JSON-RPC server │
(Unix socket) │ │ │
│ ▼ │
│ Command handlers │
│ │ │
│ ▼ │
│ Supervisor (one task per │
│ managed server): │
│ spawn → wait → restart │
│ per per-server policy │
│ │ │
│ ▼ │
│ Log capture: stdout/stderr ──►│──► $XDG_STATE_HOME/xy/logs/<name>.log
│ Ring buffer (in RAM) │
└──────────────────────────────┘
Child MCP server processes
(HTTP, fixed port from KDL)
```
### Filesystem layout
XDG semantics on both Linux and macOS (no `~/Library/Application Support`).
Use the `etcetera` crate's `Xdg` strategy, or hand-rolled env-var resolution.
| Purpose | Path |
|---------------|---------------------------------------------------------|
| Configs | `${XDG_CONFIG_HOME:-~/.config}/xy/servers/*.kdl` |
| Logs | `${XDG_STATE_HOME:-~/.local/state}/xy/logs/<name>.log` |
| Socket | `${XDG_RUNTIME_DIR}/xy.sock` if set, else `${XDG_STATE_HOME:-~/.local/state}/xy/xy.sock` |
| Pidfile | `${XDG_STATE_HOME:-~/.local/state}/xy/xy.pid` |
Socket permissions: `0600`.
### Concurrency model
Tokio multi-thread runtime. Each managed server owns one **supervisor task**
that holds the canonical state for that server. RPC handlers communicate with
supervisor tasks via channels:
- `mpsc::Sender<SupervisorCmd>``Start`, `Stop`, `Restart`, `Shutdown`, `Reconfigure(ServerConfig)`.
- `watch::Receiver<ServerState>` — outsiders observe current state without locks.
- `broadcast::Sender<LogLine>` — live `logs --follow` subscribers.
No shared mutexes for server state; the supervisor task is the owner.
## Crate layout (Cargo workspace)
```
xy/
├── Cargo.toml # workspace manifest
├── crates/
│ ├── xy-protocol/ # JSON-RPC types + KDL config schema (lib)
│ ├── xy-supervisor/ # process lifecycle, restart policy, log capture (lib)
│ ├── xy-ipc/ # socket framing + JSON-RPC client/server (lib)
│ └── xy/ # binary: clap CLI + daemon command, wires it all together
└── docs/superpowers/specs/
```
Single `xy` binary; `xy daemon` runs the supervisor in-process, all other
subcommands act as JSON-RPC clients.
### Dependencies
- `tokio` (features: rt-multi-thread, net, process, signal, sync, fs, io-util, macros)
- `clap` with `derive` feature
- `serde`, `serde_json`
- `kdl` (KDL parser) with a small typed schema wrapper
- `tracing`, `tracing-subscriber` (env-filter)
- `thiserror` (libraries), `anyhow` (binary)
- `etcetera` (XDG paths, works correctly on macOS)
- `nix` (SIGTERM/SIGKILL, process groups)
Format with `cargo +nightly fmt`. Lint with `cargo clippy --all-targets -- -D warnings`.
## KDL config schema
One file per server: `${XDG_CONFIG_HOME}/xy/servers/<name>.kdl`. Filename stem
is the canonical server name; the file itself does not repeat it.
Example `~/.config/xy/servers/insikt.kdl`:
```kdl
command "/Users/olsson/.cargo/bin/insikt-mcp"
args "--http" "--port" "8421"
port 8421
env {
RUST_LOG "info"
INSIKT_DATA_DIR "/Users/olsson/.local/share/insikt"
}
working-dir "/Users/olsson/Laboratory/insikt"
restart {
policy "on-failure" // "always" | "on-failure" | "never"
backoff-initial "1s"
backoff-max "30s"
max-retries-per-minute 5
}
stop {
grace "10s" // SIGTERM, then SIGKILL after this
}
```
### Field semantics
| Field | Required | Default | Notes |
|---|---|---|---|
| `command` | yes | — | Absolute path to executable. |
| `args` | no | `[]` | String list. |
| `port` | yes | — | Informational; xy doesn't bind it. Used for `list` display and load-time conflict detection across configs. |
| `env` | no | `{}` | Merged onto inherited parent env; KDL wins on conflict. |
| `working-dir` | no | daemon's cwd | Process working directory. |
| `restart.policy` | no | `on-failure` | `always` \| `on-failure` \| `never`. |
| `restart.backoff-initial` | no | `1s` | Humantime duration. |
| `restart.backoff-max` | no | `30s` | Cap for exponential backoff. |
| `restart.max-retries-per-minute` | no | `5` | Sliding-60s window. Exceeded → `failed`. |
| `stop.grace` | no | `10s` | SIGTERM → wait → SIGKILL window. |
### Validation at load
- Every file must parse and produce a complete `ServerConfig`.
- No two configs may declare the same `port`.
- `command` must exist and be executable (warn but allow if not — child spawn will fail and supervisor will mark `failed`).
Validation failures at daemon startup are **fatal** (exit non-zero). Failures
during `reload` are returned to the CLI client as JSON-RPC errors; the daemon
keeps running.
## JSON-RPC protocol
Transport: Unix socket, newline-delimited JSON (one JSON-RPC 2.0 message per line).
### Methods
| Method | Params | Result |
|----------|-------------------------------|--------|
| `list` | — | `[{name, state, pid?, port, uptime_secs?, restart_count, last_exit?}]` |
| `status` | `{name}` | single entry as above + recent state transitions |
| `start` | `{name}` or `{all: true}` | `{started: [...], already_running: [...]}` |
| `stop` | `{name}` or `{all: true}` | `{stopped: [...], not_running: [...]}` |
| `restart`| `{name}` or `{all: true}` | `{restarted: [...]}` |
| `reload` | — | `{added: [...], removed: [...], changed: [...], unchanged: [...]}` |
| `logs` | `{name, tail?: u32, follow?: bool}` | Initial response `{subscription_id}`; the daemon then sends JSON-RPC notifications `log` `{subscription_id, name, stream, line, ts}` for each line. A final `log_end` notification `{subscription_id}` closes the stream. For non-`follow`, `log_end` fires after the buffered tail. For `follow`, the stream stays open until the client closes the connection or calls `logs_cancel {subscription_id}`. |
### Server states
`stopped` | `starting` | `running` | `restarting` | `failed` | `stopping`
### `reload` semantics
Diff current in-memory configs against on-disk config dir:
- **Added** (new file): start.
- **Removed** (file gone): stop running process.
- **Changed** (content hash differs): stop, then start with new config.
- **Unchanged**: leave alone.
### Error codes
Standard JSON-RPC error objects with our codes:
| Code | Name |
|----------|-------------------|
| `-32001` | `ServerNotFound` |
| `-32002` | `PortConflict` |
| `-32003` | `ConfigInvalid` |
| `-32004` | `AlreadyRunning` |
| `-32005` | `NotRunning` |
| `-32006` | `SpawnFailed` |
## Supervisor state machine
```
stopped ─── start ──► starting
▲ │
│ (spawn)
(stop_cmd) │
│ ▼
stopping ◄── stop ─── running ─── child_exit ──► (eval policy)
│ ▲ │
(SIGTERM, │ │
grace timer, (spawn ok) │
SIGKILL) │ │
│ │ ┌─ restart ─► restarting ──┐
▼ │ │ │
stopped └──────────────┤ │
│ │
└─ no-restart / cap hit ──► failed
start ────┘
reload
```
### Spawn flow
1. Open / rotate log file (append mode; size threshold 10 MB, keep last 5 generations).
2. Build `tokio::process::Command`:
- `command`, `args`, merged env, `working-dir`.
- `kill_on_drop(true)`.
- `process_group(0)` — own process group so signals don't leak.
3. Spawn. Pipe stdout and stderr.
4. Spin up two log pumps per child:
- `stdout_pump`: line-buffered → log file + ring buffer + broadcast channel.
- `stderr_pump`: same, tagged `stderr`.
5. `await child.wait()`. On exit, evaluate restart policy.
### Stop flow
1. Send `SIGTERM` to the process group.
2. Start grace timer (`stop.grace`).
3. On timer fire, `SIGKILL` the process group.
4. `await child.wait()`.
5. Close log pumps, transition to `stopped`.
### Shutdown (daemon receives SIGTERM/SIGINT)
Broadcast `Shutdown` to all supervisor tasks → each runs its stop flow in
parallel → daemon awaits all with an outer deadline of `2 × max(stop.grace)`
across configs → exit `0`.
### Daemon boot
1. Resolve XDG paths, create state directories if missing.
2. Acquire pidfile (fail if another daemon is alive).
3. Load and validate all configs. Fatal on any failure.
4. Bind Unix socket (0600 perms).
5. Spawn one supervisor task per config; send each an immediate `Start`
(auto-launch behavior).
6. Serve JSON-RPC until shutdown signal.
## Log handling
Per server:
- **Disk file** at `${XDG_STATE_HOME}/xy/logs/<name>.log`. Combined stdout+stderr with a leading tag per line: `[out]` / `[err]`. Size-based rotation: when current file ≥ 10 MB, rename to `<name>.log.1` (shifting older generations), open fresh. Keep at most 5 generations.
- **Ring buffer** in RAM, ~1 MB per server, holds the most recent log lines. Source for `logs --tail` without re-reading disk.
- **Broadcast channel** (`tokio::sync::broadcast`) for live `logs --follow` subscribers. Lagged subscribers are dropped with a warning.
## CLI surface
`clap` with `derive`. Subcommand structure:
```
xy daemon # foreground daemon (logs to stderr)
xy list # all configured servers + state
xy status <name> # single server detail
xy start <name|--all>
xy stop <name|--all>
xy restart <name|--all>
xy reload
xy logs <name> [--tail N] [--follow]
```
CLI exit codes:
| Code | Meaning |
|------|---------|
| 0 | success |
| 1 | operational error (server not found, port conflict on reload) |
| 2 | daemon unreachable (socket missing or refused) |
| 3 | config invalid |
## Error handling
Two layers, kept separate:
- **Libraries** (`xy-protocol`, `xy-supervisor`, `xy-ipc`): `thiserror` enums per crate. Callers can match on variants (e.g., `SupervisorError::AlreadyRunning`, `ConfigError::DuplicatePort { name_a, name_b, port }`).
- **Binary** (`xy`): `anyhow` for top-level startup and CLI reporting. The IPC layer has one match site translating typed errors into JSON-RPC error objects.
### Fatal vs non-fatal
| Class | Examples | Behavior |
|---|---|---|
| Fatal at daemon startup | socket bind fails; state dir uncreatable; any config invalid; duplicate port | exit non-zero, log to stderr |
| Non-fatal at runtime | child spawn fails; restart cap hit; log file write fails | log, mark server `failed` (or degrade log subsystem), daemon keeps running |
## Testing strategy
### Unit tests
- **`xy-supervisor`**: state-machine transitions using a mock `ChildHandle` trait so tests don't actually spawn processes. Cases:
- Restart policy decisions (`always` / `on-failure` / `never` × clean/dirty exit).
- Backoff math (initial, exponential, cap).
- Retry window (sliding 60s) → `failed` transition.
- Stop flow: grace timer expires → SIGKILL escalation.
- **`xy-protocol`**: KDL parse cases (minimal, full, invalid). JSON-RPC envelope round-trips. Error-code mapping.
### Integration tests
In `crates/xy/tests/`:
- Spin up the real daemon on a temp socket with temp state and config dirs (per-test `XDG_*` env via `tempfile`).
- Use tiny long-running test-only binaries built in the workspace:
- `xy-test-sleep-server`: sleeps until SIGTERM, prints periodic lines.
- `xy-test-exit-immediately`: exits non-zero immediately, used for failure-mode tests.
- Drive the real CLI subcommands; assert on `list` output and observable state transitions.
### CI
```
cargo +nightly fmt --check
cargo clippy --all-targets -- -D warnings
cargo test --all
```
## Future work (out of scope for MVP)
- Container isolation (rootless podman / Docker backend per server).
- HTTP/MCP-aware health probes.
- launchd LaunchAgent install command.
- TUI dashboard (would reuse `xy-protocol` over the same socket).
- macOS status bar app (same).
- Optional auth on the socket if it ever leaves the user's machine.