docs: add xy MCP supervisor design spec
Approved design for the MVP: single xy binary with a Cargo workspace (xy-protocol, xy-supervisor, xy-ipc, xy), Unix socket + newline-delimited JSON-RPC, per-server KDL configs at XDG paths (XDG on macOS too via etcetera), supervisor-per-server task model with per-server restart policy, log capture to disk + ring buffer + broadcast for follow. MVP commands: daemon, list, status, start/stop/restart (name|--all), reload, logs. Process-alive supervision only; HTTP/MCP-aware probes, container isolation, launchd integration, and TUI deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,352 @@
|
||||
# xy — HTTP MCP Server Supervisor
|
||||
|
||||
**Date:** 2026-05-25
|
||||
**Status:** Approved — ready for implementation planning
|
||||
|
||||
## Problem
|
||||
|
||||
HTTP-based MCP servers (currently two, more likely) need a long-running parent
|
||||
process so they survive terminal closures and can be inspected, restarted, and
|
||||
upgraded without ad-hoc terminal tabs. Today they're launched manually and
|
||||
their lifetime is coupled to a terminal window.
|
||||
|
||||
## Goals (MVP)
|
||||
|
||||
- Run as a background daemon on macOS.
|
||||
- Auto-launch every configured MCP server when the daemon starts.
|
||||
- Provide a CLI to start, stop, restart, reload, list, and tail logs.
|
||||
- Per-server restart policy with backoff.
|
||||
- Capture stdout/stderr to rotating log files and an in-memory ring buffer.
|
||||
|
||||
## Non-goals (deferred)
|
||||
|
||||
- Container isolation (planned for a later phase).
|
||||
- TUI dashboard.
|
||||
- macOS status bar app.
|
||||
- HTTP/MCP-level health probes.
|
||||
- Auto-start at login via launchd (manual daemon launch only for MVP).
|
||||
- Remote management (everything is local-socket only).
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────┐
|
||||
│ xy daemon (process) │
|
||||
│ │
|
||||
xy CLI ──────►│ JSON-RPC server │
|
||||
(Unix socket) │ │ │
|
||||
│ ▼ │
|
||||
│ Command handlers │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Supervisor (one task per │
|
||||
│ managed server): │
|
||||
│ spawn → wait → restart │
|
||||
│ per per-server policy │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Log capture: stdout/stderr ──►│──► $XDG_STATE_HOME/xy/logs/<name>.log
|
||||
│ Ring buffer (in RAM) │
|
||||
└──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
Child MCP server processes
|
||||
(HTTP, fixed port from KDL)
|
||||
```
|
||||
|
||||
### Filesystem layout
|
||||
|
||||
XDG semantics on both Linux and macOS (no `~/Library/Application Support`).
|
||||
Use the `etcetera` crate's `Xdg` strategy, or hand-rolled env-var resolution.
|
||||
|
||||
| Purpose | Path |
|
||||
|---------------|---------------------------------------------------------|
|
||||
| Configs | `${XDG_CONFIG_HOME:-~/.config}/xy/servers/*.kdl` |
|
||||
| Logs | `${XDG_STATE_HOME:-~/.local/state}/xy/logs/<name>.log` |
|
||||
| Socket | `${XDG_RUNTIME_DIR}/xy.sock` if set, else `${XDG_STATE_HOME:-~/.local/state}/xy/xy.sock` |
|
||||
| Pidfile | `${XDG_STATE_HOME:-~/.local/state}/xy/xy.pid` |
|
||||
|
||||
Socket permissions: `0600`.
|
||||
|
||||
### Concurrency model
|
||||
|
||||
Tokio multi-thread runtime. Each managed server owns one **supervisor task**
|
||||
that holds the canonical state for that server. RPC handlers communicate with
|
||||
supervisor tasks via channels:
|
||||
|
||||
- `mpsc::Sender<SupervisorCmd>` — `Start`, `Stop`, `Restart`, `Shutdown`, `Reconfigure(ServerConfig)`.
|
||||
- `watch::Receiver<ServerState>` — outsiders observe current state without locks.
|
||||
- `broadcast::Sender<LogLine>` — live `logs --follow` subscribers.
|
||||
|
||||
No shared mutexes for server state; the supervisor task is the owner.
|
||||
|
||||
## Crate layout (Cargo workspace)
|
||||
|
||||
```
|
||||
xy/
|
||||
├── Cargo.toml # workspace manifest
|
||||
├── crates/
|
||||
│ ├── xy-protocol/ # JSON-RPC types + KDL config schema (lib)
|
||||
│ ├── xy-supervisor/ # process lifecycle, restart policy, log capture (lib)
|
||||
│ ├── xy-ipc/ # socket framing + JSON-RPC client/server (lib)
|
||||
│ └── xy/ # binary: clap CLI + daemon command, wires it all together
|
||||
└── docs/superpowers/specs/
|
||||
```
|
||||
|
||||
Single `xy` binary; `xy daemon` runs the supervisor in-process, all other
|
||||
subcommands act as JSON-RPC clients.
|
||||
|
||||
### Dependencies
|
||||
|
||||
- `tokio` (features: rt-multi-thread, net, process, signal, sync, fs, io-util, macros)
|
||||
- `clap` with `derive` feature
|
||||
- `serde`, `serde_json`
|
||||
- `kdl` (KDL parser) with a small typed schema wrapper
|
||||
- `tracing`, `tracing-subscriber` (env-filter)
|
||||
- `thiserror` (libraries), `anyhow` (binary)
|
||||
- `etcetera` (XDG paths, works correctly on macOS)
|
||||
- `nix` (SIGTERM/SIGKILL, process groups)
|
||||
|
||||
Format with `cargo +nightly fmt`. Lint with `cargo clippy --all-targets -- -D warnings`.
|
||||
|
||||
## KDL config schema
|
||||
|
||||
One file per server: `${XDG_CONFIG_HOME}/xy/servers/<name>.kdl`. Filename stem
|
||||
is the canonical server name; the file itself does not repeat it.
|
||||
|
||||
Example `~/.config/xy/servers/insikt.kdl`:
|
||||
|
||||
```kdl
|
||||
command "/Users/olsson/.cargo/bin/insikt-mcp"
|
||||
args "--http" "--port" "8421"
|
||||
port 8421
|
||||
|
||||
env {
|
||||
RUST_LOG "info"
|
||||
INSIKT_DATA_DIR "/Users/olsson/.local/share/insikt"
|
||||
}
|
||||
|
||||
working-dir "/Users/olsson/Laboratory/insikt"
|
||||
|
||||
restart {
|
||||
policy "on-failure" // "always" | "on-failure" | "never"
|
||||
backoff-initial "1s"
|
||||
backoff-max "30s"
|
||||
max-retries-per-minute 5
|
||||
}
|
||||
|
||||
stop {
|
||||
grace "10s" // SIGTERM, then SIGKILL after this
|
||||
}
|
||||
```
|
||||
|
||||
### Field semantics
|
||||
|
||||
| Field | Required | Default | Notes |
|
||||
|---|---|---|---|
|
||||
| `command` | yes | — | Absolute path to executable. |
|
||||
| `args` | no | `[]` | String list. |
|
||||
| `port` | yes | — | Informational; xy doesn't bind it. Used for `list` display and load-time conflict detection across configs. |
|
||||
| `env` | no | `{}` | Merged onto inherited parent env; KDL wins on conflict. |
|
||||
| `working-dir` | no | daemon's cwd | Process working directory. |
|
||||
| `restart.policy` | no | `on-failure` | `always` \| `on-failure` \| `never`. |
|
||||
| `restart.backoff-initial` | no | `1s` | Humantime duration. |
|
||||
| `restart.backoff-max` | no | `30s` | Cap for exponential backoff. |
|
||||
| `restart.max-retries-per-minute` | no | `5` | Sliding-60s window. Exceeded → `failed`. |
|
||||
| `stop.grace` | no | `10s` | SIGTERM → wait → SIGKILL window. |
|
||||
|
||||
### Validation at load
|
||||
|
||||
- Every file must parse and produce a complete `ServerConfig`.
|
||||
- No two configs may declare the same `port`.
|
||||
- `command` must exist and be executable (warn but allow if not — child spawn will fail and supervisor will mark `failed`).
|
||||
|
||||
Validation failures at daemon startup are **fatal** (exit non-zero). Failures
|
||||
during `reload` are returned to the CLI client as JSON-RPC errors; the daemon
|
||||
keeps running.
|
||||
|
||||
## JSON-RPC protocol
|
||||
|
||||
Transport: Unix socket, newline-delimited JSON (one JSON-RPC 2.0 message per line).
|
||||
|
||||
### Methods
|
||||
|
||||
| Method | Params | Result |
|
||||
|----------|-------------------------------|--------|
|
||||
| `list` | — | `[{name, state, pid?, port, uptime_secs?, restart_count, last_exit?}]` |
|
||||
| `status` | `{name}` | single entry as above + recent state transitions |
|
||||
| `start` | `{name}` or `{all: true}` | `{started: [...], already_running: [...]}` |
|
||||
| `stop` | `{name}` or `{all: true}` | `{stopped: [...], not_running: [...]}` |
|
||||
| `restart`| `{name}` or `{all: true}` | `{restarted: [...]}` |
|
||||
| `reload` | — | `{added: [...], removed: [...], changed: [...], unchanged: [...]}` |
|
||||
| `logs` | `{name, tail?: u32, follow?: bool}` | Initial response `{subscription_id}`; the daemon then sends JSON-RPC notifications `log` `{subscription_id, name, stream, line, ts}` for each line. A final `log_end` notification `{subscription_id}` closes the stream. For non-`follow`, `log_end` fires after the buffered tail. For `follow`, the stream stays open until the client closes the connection or calls `logs_cancel {subscription_id}`. |
|
||||
|
||||
### Server states
|
||||
|
||||
`stopped` | `starting` | `running` | `restarting` | `failed` | `stopping`
|
||||
|
||||
### `reload` semantics
|
||||
|
||||
Diff current in-memory configs against on-disk config dir:
|
||||
|
||||
- **Added** (new file): start.
|
||||
- **Removed** (file gone): stop running process.
|
||||
- **Changed** (content hash differs): stop, then start with new config.
|
||||
- **Unchanged**: leave alone.
|
||||
|
||||
### Error codes
|
||||
|
||||
Standard JSON-RPC error objects with our codes:
|
||||
|
||||
| Code | Name |
|
||||
|----------|-------------------|
|
||||
| `-32001` | `ServerNotFound` |
|
||||
| `-32002` | `PortConflict` |
|
||||
| `-32003` | `ConfigInvalid` |
|
||||
| `-32004` | `AlreadyRunning` |
|
||||
| `-32005` | `NotRunning` |
|
||||
| `-32006` | `SpawnFailed` |
|
||||
|
||||
## Supervisor state machine
|
||||
|
||||
```
|
||||
stopped ─── start ──► starting
|
||||
▲ │
|
||||
│ (spawn)
|
||||
(stop_cmd) │
|
||||
│ ▼
|
||||
stopping ◄── stop ─── running ─── child_exit ──► (eval policy)
|
||||
│ ▲ │
|
||||
(SIGTERM, │ │
|
||||
grace timer, (spawn ok) │
|
||||
SIGKILL) │ │
|
||||
│ │ ┌─ restart ─► restarting ──┐
|
||||
▼ │ │ │
|
||||
stopped └──────────────┤ │
|
||||
│ │
|
||||
└─ no-restart / cap hit ──► failed
|
||||
│
|
||||
start ────┘
|
||||
reload
|
||||
```
|
||||
|
||||
### Spawn flow
|
||||
|
||||
1. Open / rotate log file (append mode; size threshold 10 MB, keep last 5 generations).
|
||||
2. Build `tokio::process::Command`:
|
||||
- `command`, `args`, merged env, `working-dir`.
|
||||
- `kill_on_drop(true)`.
|
||||
- `process_group(0)` — own process group so signals don't leak.
|
||||
3. Spawn. Pipe stdout and stderr.
|
||||
4. Spin up two log pumps per child:
|
||||
- `stdout_pump`: line-buffered → log file + ring buffer + broadcast channel.
|
||||
- `stderr_pump`: same, tagged `stderr`.
|
||||
5. `await child.wait()`. On exit, evaluate restart policy.
|
||||
|
||||
### Stop flow
|
||||
|
||||
1. Send `SIGTERM` to the process group.
|
||||
2. Start grace timer (`stop.grace`).
|
||||
3. On timer fire, `SIGKILL` the process group.
|
||||
4. `await child.wait()`.
|
||||
5. Close log pumps, transition to `stopped`.
|
||||
|
||||
### Shutdown (daemon receives SIGTERM/SIGINT)
|
||||
|
||||
Broadcast `Shutdown` to all supervisor tasks → each runs its stop flow in
|
||||
parallel → daemon awaits all with an outer deadline of `2 × max(stop.grace)`
|
||||
across configs → exit `0`.
|
||||
|
||||
### Daemon boot
|
||||
|
||||
1. Resolve XDG paths, create state directories if missing.
|
||||
2. Acquire pidfile (fail if another daemon is alive).
|
||||
3. Load and validate all configs. Fatal on any failure.
|
||||
4. Bind Unix socket (0600 perms).
|
||||
5. Spawn one supervisor task per config; send each an immediate `Start`
|
||||
(auto-launch behavior).
|
||||
6. Serve JSON-RPC until shutdown signal.
|
||||
|
||||
## Log handling
|
||||
|
||||
Per server:
|
||||
|
||||
- **Disk file** at `${XDG_STATE_HOME}/xy/logs/<name>.log`. Combined stdout+stderr with a leading tag per line: `[out]` / `[err]`. Size-based rotation: when current file ≥ 10 MB, rename to `<name>.log.1` (shifting older generations), open fresh. Keep at most 5 generations.
|
||||
- **Ring buffer** in RAM, ~1 MB per server, holds the most recent log lines. Source for `logs --tail` without re-reading disk.
|
||||
- **Broadcast channel** (`tokio::sync::broadcast`) for live `logs --follow` subscribers. Lagged subscribers are dropped with a warning.
|
||||
|
||||
## CLI surface
|
||||
|
||||
`clap` with `derive`. Subcommand structure:
|
||||
|
||||
```
|
||||
xy daemon # foreground daemon (logs to stderr)
|
||||
xy list # all configured servers + state
|
||||
xy status <name> # single server detail
|
||||
xy start <name|--all>
|
||||
xy stop <name|--all>
|
||||
xy restart <name|--all>
|
||||
xy reload
|
||||
xy logs <name> [--tail N] [--follow]
|
||||
```
|
||||
|
||||
CLI exit codes:
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 0 | success |
|
||||
| 1 | operational error (server not found, port conflict on reload) |
|
||||
| 2 | daemon unreachable (socket missing or refused) |
|
||||
| 3 | config invalid |
|
||||
|
||||
## Error handling
|
||||
|
||||
Two layers, kept separate:
|
||||
|
||||
- **Libraries** (`xy-protocol`, `xy-supervisor`, `xy-ipc`): `thiserror` enums per crate. Callers can match on variants (e.g., `SupervisorError::AlreadyRunning`, `ConfigError::DuplicatePort { name_a, name_b, port }`).
|
||||
- **Binary** (`xy`): `anyhow` for top-level startup and CLI reporting. The IPC layer has one match site translating typed errors into JSON-RPC error objects.
|
||||
|
||||
### Fatal vs non-fatal
|
||||
|
||||
| Class | Examples | Behavior |
|
||||
|---|---|---|
|
||||
| Fatal at daemon startup | socket bind fails; state dir uncreatable; any config invalid; duplicate port | exit non-zero, log to stderr |
|
||||
| Non-fatal at runtime | child spawn fails; restart cap hit; log file write fails | log, mark server `failed` (or degrade log subsystem), daemon keeps running |
|
||||
|
||||
## Testing strategy
|
||||
|
||||
### Unit tests
|
||||
|
||||
- **`xy-supervisor`**: state-machine transitions using a mock `ChildHandle` trait so tests don't actually spawn processes. Cases:
|
||||
- Restart policy decisions (`always` / `on-failure` / `never` × clean/dirty exit).
|
||||
- Backoff math (initial, exponential, cap).
|
||||
- Retry window (sliding 60s) → `failed` transition.
|
||||
- Stop flow: grace timer expires → SIGKILL escalation.
|
||||
- **`xy-protocol`**: KDL parse cases (minimal, full, invalid). JSON-RPC envelope round-trips. Error-code mapping.
|
||||
|
||||
### Integration tests
|
||||
|
||||
In `crates/xy/tests/`:
|
||||
|
||||
- Spin up the real daemon on a temp socket with temp state and config dirs (per-test `XDG_*` env via `tempfile`).
|
||||
- Use tiny long-running test-only binaries built in the workspace:
|
||||
- `xy-test-sleep-server`: sleeps until SIGTERM, prints periodic lines.
|
||||
- `xy-test-exit-immediately`: exits non-zero immediately, used for failure-mode tests.
|
||||
- Drive the real CLI subcommands; assert on `list` output and observable state transitions.
|
||||
|
||||
### CI
|
||||
|
||||
```
|
||||
cargo +nightly fmt --check
|
||||
cargo clippy --all-targets -- -D warnings
|
||||
cargo test --all
|
||||
```
|
||||
|
||||
## Future work (out of scope for MVP)
|
||||
|
||||
- Container isolation (rootless podman / Docker backend per server).
|
||||
- HTTP/MCP-aware health probes.
|
||||
- launchd LaunchAgent install command.
|
||||
- TUI dashboard (would reuse `xy-protocol` over the same socket).
|
||||
- macOS status bar app (same).
|
||||
- Optional auth on the socket if it ever leaves the user's machine.
|
||||
Reference in New Issue
Block a user