diff --git a/docs/superpowers/specs/2026-05-25-xy-mcp-supervisor-design.md b/docs/superpowers/specs/2026-05-25-xy-mcp-supervisor-design.md new file mode 100644 index 0000000..da11e73 --- /dev/null +++ b/docs/superpowers/specs/2026-05-25-xy-mcp-supervisor-design.md @@ -0,0 +1,352 @@ +# xy — HTTP MCP Server Supervisor + +**Date:** 2026-05-25 +**Status:** Approved — ready for implementation planning + +## Problem + +HTTP-based MCP servers (currently two, more likely) need a long-running parent +process so they survive terminal closures and can be inspected, restarted, and +upgraded without ad-hoc terminal tabs. Today they're launched manually and +their lifetime is coupled to a terminal window. + +## Goals (MVP) + +- Run as a background daemon on macOS. +- Auto-launch every configured MCP server when the daemon starts. +- Provide a CLI to start, stop, restart, reload, list, and tail logs. +- Per-server restart policy with backoff. +- Capture stdout/stderr to rotating log files and an in-memory ring buffer. + +## Non-goals (deferred) + +- Container isolation (planned for a later phase). +- TUI dashboard. +- macOS status bar app. +- HTTP/MCP-level health probes. +- Auto-start at login via launchd (manual daemon launch only for MVP). +- Remote management (everything is local-socket only). + +## Architecture + +``` + ┌──────────────────────────────┐ + │ xy daemon (process) │ + │ │ + xy CLI ──────►│ JSON-RPC server │ + (Unix socket) │ │ │ + │ ▼ │ + │ Command handlers │ + │ │ │ + │ ▼ │ + │ Supervisor (one task per │ + │ managed server): │ + │ spawn → wait → restart │ + │ per per-server policy │ + │ │ │ + │ ▼ │ + │ Log capture: stdout/stderr ──►│──► $XDG_STATE_HOME/xy/logs/.log + │ Ring buffer (in RAM) │ + └──────────────────────────────┘ + │ + ▼ + Child MCP server processes + (HTTP, fixed port from KDL) +``` + +### Filesystem layout + +XDG semantics on both Linux and macOS (no `~/Library/Application Support`). +Use the `etcetera` crate's `Xdg` strategy, or hand-rolled env-var resolution. + +| Purpose | Path | +|---------------|---------------------------------------------------------| +| Configs | `${XDG_CONFIG_HOME:-~/.config}/xy/servers/*.kdl` | +| Logs | `${XDG_STATE_HOME:-~/.local/state}/xy/logs/.log` | +| Socket | `${XDG_RUNTIME_DIR}/xy.sock` if set, else `${XDG_STATE_HOME:-~/.local/state}/xy/xy.sock` | +| Pidfile | `${XDG_STATE_HOME:-~/.local/state}/xy/xy.pid` | + +Socket permissions: `0600`. + +### Concurrency model + +Tokio multi-thread runtime. Each managed server owns one **supervisor task** +that holds the canonical state for that server. RPC handlers communicate with +supervisor tasks via channels: + +- `mpsc::Sender` — `Start`, `Stop`, `Restart`, `Shutdown`, `Reconfigure(ServerConfig)`. +- `watch::Receiver` — outsiders observe current state without locks. +- `broadcast::Sender` — live `logs --follow` subscribers. + +No shared mutexes for server state; the supervisor task is the owner. + +## Crate layout (Cargo workspace) + +``` +xy/ +├── Cargo.toml # workspace manifest +├── crates/ +│ ├── xy-protocol/ # JSON-RPC types + KDL config schema (lib) +│ ├── xy-supervisor/ # process lifecycle, restart policy, log capture (lib) +│ ├── xy-ipc/ # socket framing + JSON-RPC client/server (lib) +│ └── xy/ # binary: clap CLI + daemon command, wires it all together +└── docs/superpowers/specs/ +``` + +Single `xy` binary; `xy daemon` runs the supervisor in-process, all other +subcommands act as JSON-RPC clients. + +### Dependencies + +- `tokio` (features: rt-multi-thread, net, process, signal, sync, fs, io-util, macros) +- `clap` with `derive` feature +- `serde`, `serde_json` +- `kdl` (KDL parser) with a small typed schema wrapper +- `tracing`, `tracing-subscriber` (env-filter) +- `thiserror` (libraries), `anyhow` (binary) +- `etcetera` (XDG paths, works correctly on macOS) +- `nix` (SIGTERM/SIGKILL, process groups) + +Format with `cargo +nightly fmt`. Lint with `cargo clippy --all-targets -- -D warnings`. + +## KDL config schema + +One file per server: `${XDG_CONFIG_HOME}/xy/servers/.kdl`. Filename stem +is the canonical server name; the file itself does not repeat it. + +Example `~/.config/xy/servers/insikt.kdl`: + +```kdl +command "/Users/olsson/.cargo/bin/insikt-mcp" +args "--http" "--port" "8421" +port 8421 + +env { + RUST_LOG "info" + INSIKT_DATA_DIR "/Users/olsson/.local/share/insikt" +} + +working-dir "/Users/olsson/Laboratory/insikt" + +restart { + policy "on-failure" // "always" | "on-failure" | "never" + backoff-initial "1s" + backoff-max "30s" + max-retries-per-minute 5 +} + +stop { + grace "10s" // SIGTERM, then SIGKILL after this +} +``` + +### Field semantics + +| Field | Required | Default | Notes | +|---|---|---|---| +| `command` | yes | — | Absolute path to executable. | +| `args` | no | `[]` | String list. | +| `port` | yes | — | Informational; xy doesn't bind it. Used for `list` display and load-time conflict detection across configs. | +| `env` | no | `{}` | Merged onto inherited parent env; KDL wins on conflict. | +| `working-dir` | no | daemon's cwd | Process working directory. | +| `restart.policy` | no | `on-failure` | `always` \| `on-failure` \| `never`. | +| `restart.backoff-initial` | no | `1s` | Humantime duration. | +| `restart.backoff-max` | no | `30s` | Cap for exponential backoff. | +| `restart.max-retries-per-minute` | no | `5` | Sliding-60s window. Exceeded → `failed`. | +| `stop.grace` | no | `10s` | SIGTERM → wait → SIGKILL window. | + +### Validation at load + +- Every file must parse and produce a complete `ServerConfig`. +- No two configs may declare the same `port`. +- `command` must exist and be executable (warn but allow if not — child spawn will fail and supervisor will mark `failed`). + +Validation failures at daemon startup are **fatal** (exit non-zero). Failures +during `reload` are returned to the CLI client as JSON-RPC errors; the daemon +keeps running. + +## JSON-RPC protocol + +Transport: Unix socket, newline-delimited JSON (one JSON-RPC 2.0 message per line). + +### Methods + +| Method | Params | Result | +|----------|-------------------------------|--------| +| `list` | — | `[{name, state, pid?, port, uptime_secs?, restart_count, last_exit?}]` | +| `status` | `{name}` | single entry as above + recent state transitions | +| `start` | `{name}` or `{all: true}` | `{started: [...], already_running: [...]}` | +| `stop` | `{name}` or `{all: true}` | `{stopped: [...], not_running: [...]}` | +| `restart`| `{name}` or `{all: true}` | `{restarted: [...]}` | +| `reload` | — | `{added: [...], removed: [...], changed: [...], unchanged: [...]}` | +| `logs` | `{name, tail?: u32, follow?: bool}` | Initial response `{subscription_id}`; the daemon then sends JSON-RPC notifications `log` `{subscription_id, name, stream, line, ts}` for each line. A final `log_end` notification `{subscription_id}` closes the stream. For non-`follow`, `log_end` fires after the buffered tail. For `follow`, the stream stays open until the client closes the connection or calls `logs_cancel {subscription_id}`. | + +### Server states + +`stopped` | `starting` | `running` | `restarting` | `failed` | `stopping` + +### `reload` semantics + +Diff current in-memory configs against on-disk config dir: + +- **Added** (new file): start. +- **Removed** (file gone): stop running process. +- **Changed** (content hash differs): stop, then start with new config. +- **Unchanged**: leave alone. + +### Error codes + +Standard JSON-RPC error objects with our codes: + +| Code | Name | +|----------|-------------------| +| `-32001` | `ServerNotFound` | +| `-32002` | `PortConflict` | +| `-32003` | `ConfigInvalid` | +| `-32004` | `AlreadyRunning` | +| `-32005` | `NotRunning` | +| `-32006` | `SpawnFailed` | + +## Supervisor state machine + +``` + stopped ─── start ──► starting + ▲ │ + │ (spawn) + (stop_cmd) │ + │ ▼ + stopping ◄── stop ─── running ─── child_exit ──► (eval policy) + │ ▲ │ + (SIGTERM, │ │ + grace timer, (spawn ok) │ + SIGKILL) │ │ + │ │ ┌─ restart ─► restarting ──┐ + ▼ │ │ │ + stopped └──────────────┤ │ + │ │ + └─ no-restart / cap hit ──► failed + │ + start ────┘ + reload +``` + +### Spawn flow + +1. Open / rotate log file (append mode; size threshold 10 MB, keep last 5 generations). +2. Build `tokio::process::Command`: + - `command`, `args`, merged env, `working-dir`. + - `kill_on_drop(true)`. + - `process_group(0)` — own process group so signals don't leak. +3. Spawn. Pipe stdout and stderr. +4. Spin up two log pumps per child: + - `stdout_pump`: line-buffered → log file + ring buffer + broadcast channel. + - `stderr_pump`: same, tagged `stderr`. +5. `await child.wait()`. On exit, evaluate restart policy. + +### Stop flow + +1. Send `SIGTERM` to the process group. +2. Start grace timer (`stop.grace`). +3. On timer fire, `SIGKILL` the process group. +4. `await child.wait()`. +5. Close log pumps, transition to `stopped`. + +### Shutdown (daemon receives SIGTERM/SIGINT) + +Broadcast `Shutdown` to all supervisor tasks → each runs its stop flow in +parallel → daemon awaits all with an outer deadline of `2 × max(stop.grace)` +across configs → exit `0`. + +### Daemon boot + +1. Resolve XDG paths, create state directories if missing. +2. Acquire pidfile (fail if another daemon is alive). +3. Load and validate all configs. Fatal on any failure. +4. Bind Unix socket (0600 perms). +5. Spawn one supervisor task per config; send each an immediate `Start` + (auto-launch behavior). +6. Serve JSON-RPC until shutdown signal. + +## Log handling + +Per server: + +- **Disk file** at `${XDG_STATE_HOME}/xy/logs/.log`. Combined stdout+stderr with a leading tag per line: `[out]` / `[err]`. Size-based rotation: when current file ≥ 10 MB, rename to `.log.1` (shifting older generations), open fresh. Keep at most 5 generations. +- **Ring buffer** in RAM, ~1 MB per server, holds the most recent log lines. Source for `logs --tail` without re-reading disk. +- **Broadcast channel** (`tokio::sync::broadcast`) for live `logs --follow` subscribers. Lagged subscribers are dropped with a warning. + +## CLI surface + +`clap` with `derive`. Subcommand structure: + +``` +xy daemon # foreground daemon (logs to stderr) +xy list # all configured servers + state +xy status # single server detail +xy start +xy stop +xy restart +xy reload +xy logs [--tail N] [--follow] +``` + +CLI exit codes: + +| Code | Meaning | +|------|---------| +| 0 | success | +| 1 | operational error (server not found, port conflict on reload) | +| 2 | daemon unreachable (socket missing or refused) | +| 3 | config invalid | + +## Error handling + +Two layers, kept separate: + +- **Libraries** (`xy-protocol`, `xy-supervisor`, `xy-ipc`): `thiserror` enums per crate. Callers can match on variants (e.g., `SupervisorError::AlreadyRunning`, `ConfigError::DuplicatePort { name_a, name_b, port }`). +- **Binary** (`xy`): `anyhow` for top-level startup and CLI reporting. The IPC layer has one match site translating typed errors into JSON-RPC error objects. + +### Fatal vs non-fatal + +| Class | Examples | Behavior | +|---|---|---| +| Fatal at daemon startup | socket bind fails; state dir uncreatable; any config invalid; duplicate port | exit non-zero, log to stderr | +| Non-fatal at runtime | child spawn fails; restart cap hit; log file write fails | log, mark server `failed` (or degrade log subsystem), daemon keeps running | + +## Testing strategy + +### Unit tests + +- **`xy-supervisor`**: state-machine transitions using a mock `ChildHandle` trait so tests don't actually spawn processes. Cases: + - Restart policy decisions (`always` / `on-failure` / `never` × clean/dirty exit). + - Backoff math (initial, exponential, cap). + - Retry window (sliding 60s) → `failed` transition. + - Stop flow: grace timer expires → SIGKILL escalation. +- **`xy-protocol`**: KDL parse cases (minimal, full, invalid). JSON-RPC envelope round-trips. Error-code mapping. + +### Integration tests + +In `crates/xy/tests/`: + +- Spin up the real daemon on a temp socket with temp state and config dirs (per-test `XDG_*` env via `tempfile`). +- Use tiny long-running test-only binaries built in the workspace: + - `xy-test-sleep-server`: sleeps until SIGTERM, prints periodic lines. + - `xy-test-exit-immediately`: exits non-zero immediately, used for failure-mode tests. +- Drive the real CLI subcommands; assert on `list` output and observable state transitions. + +### CI + +``` +cargo +nightly fmt --check +cargo clippy --all-targets -- -D warnings +cargo test --all +``` + +## Future work (out of scope for MVP) + +- Container isolation (rootless podman / Docker backend per server). +- HTTP/MCP-aware health probes. +- launchd LaunchAgent install command. +- TUI dashboard (would reuse `xy-protocol` over the same socket). +- macOS status bar app (same). +- Optional auth on the socket if it ever leaves the user's machine.