Files
xy/docs/superpowers/specs/2026-05-25-xy-mcp-supervisor-design.md
logaritmisk 8f7200aa25 docs: add xy MCP supervisor design spec
Approved design for the MVP: single xy binary with a Cargo workspace
(xy-protocol, xy-supervisor, xy-ipc, xy), Unix socket + newline-delimited
JSON-RPC, per-server KDL configs at XDG paths (XDG on macOS too via
etcetera), supervisor-per-server task model with per-server restart policy,
log capture to disk + ring buffer + broadcast for follow.

MVP commands: daemon, list, status, start/stop/restart (name|--all),
reload, logs. Process-alive supervision only; HTTP/MCP-aware probes,
container isolation, launchd integration, and TUI deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 10:51:57 +02:00

15 KiB
Raw Permalink Blame History

xy — HTTP MCP Server Supervisor

Date: 2026-05-25 Status: Approved — ready for implementation planning

Problem

HTTP-based MCP servers (currently two, more likely) need a long-running parent process so they survive terminal closures and can be inspected, restarted, and upgraded without ad-hoc terminal tabs. Today they're launched manually and their lifetime is coupled to a terminal window.

Goals (MVP)

  • Run as a background daemon on macOS.
  • Auto-launch every configured MCP server when the daemon starts.
  • Provide a CLI to start, stop, restart, reload, list, and tail logs.
  • Per-server restart policy with backoff.
  • Capture stdout/stderr to rotating log files and an in-memory ring buffer.

Non-goals (deferred)

  • Container isolation (planned for a later phase).
  • TUI dashboard.
  • macOS status bar app.
  • HTTP/MCP-level health probes.
  • Auto-start at login via launchd (manual daemon launch only for MVP).
  • Remote management (everything is local-socket only).

Architecture

                 ┌──────────────────────────────┐
                 │     xy daemon (process)       │
                 │                                │
   xy CLI ──────►│  JSON-RPC server               │
  (Unix socket)  │       │                        │
                 │       ▼                        │
                 │  Command handlers              │
                 │       │                        │
                 │       ▼                        │
                 │  Supervisor (one task per      │
                 │   managed server):             │
                 │     spawn → wait → restart     │
                 │     per per-server policy      │
                 │       │                        │
                 │       ▼                        │
                 │  Log capture: stdout/stderr ──►│──► $XDG_STATE_HOME/xy/logs/<name>.log
                 │  Ring buffer (in RAM)          │
                 └──────────────────────────────┘
                            │
                            ▼
                  Child MCP server processes
                  (HTTP, fixed port from KDL)

Filesystem layout

XDG semantics on both Linux and macOS (no ~/Library/Application Support). Use the etcetera crate's Xdg strategy, or hand-rolled env-var resolution.

Purpose Path
Configs ${XDG_CONFIG_HOME:-~/.config}/xy/servers/*.kdl
Logs ${XDG_STATE_HOME:-~/.local/state}/xy/logs/<name>.log
Socket ${XDG_RUNTIME_DIR}/xy.sock if set, else ${XDG_STATE_HOME:-~/.local/state}/xy/xy.sock
Pidfile ${XDG_STATE_HOME:-~/.local/state}/xy/xy.pid

Socket permissions: 0600.

Concurrency model

Tokio multi-thread runtime. Each managed server owns one supervisor task that holds the canonical state for that server. RPC handlers communicate with supervisor tasks via channels:

  • mpsc::Sender<SupervisorCmd>Start, Stop, Restart, Shutdown, Reconfigure(ServerConfig).
  • watch::Receiver<ServerState> — outsiders observe current state without locks.
  • broadcast::Sender<LogLine> — live logs --follow subscribers.

No shared mutexes for server state; the supervisor task is the owner.

Crate layout (Cargo workspace)

xy/
├── Cargo.toml                       # workspace manifest
├── crates/
│   ├── xy-protocol/                 # JSON-RPC types + KDL config schema (lib)
│   ├── xy-supervisor/               # process lifecycle, restart policy, log capture (lib)
│   ├── xy-ipc/                      # socket framing + JSON-RPC client/server (lib)
│   └── xy/                          # binary: clap CLI + daemon command, wires it all together
└── docs/superpowers/specs/

Single xy binary; xy daemon runs the supervisor in-process, all other subcommands act as JSON-RPC clients.

Dependencies

  • tokio (features: rt-multi-thread, net, process, signal, sync, fs, io-util, macros)
  • clap with derive feature
  • serde, serde_json
  • kdl (KDL parser) with a small typed schema wrapper
  • tracing, tracing-subscriber (env-filter)
  • thiserror (libraries), anyhow (binary)
  • etcetera (XDG paths, works correctly on macOS)
  • nix (SIGTERM/SIGKILL, process groups)

Format with cargo +nightly fmt. Lint with cargo clippy --all-targets -- -D warnings.

KDL config schema

One file per server: ${XDG_CONFIG_HOME}/xy/servers/<name>.kdl. Filename stem is the canonical server name; the file itself does not repeat it.

Example ~/.config/xy/servers/insikt.kdl:

command "/Users/olsson/.cargo/bin/insikt-mcp"
args "--http" "--port" "8421"
port 8421

env {
    RUST_LOG "info"
    INSIKT_DATA_DIR "/Users/olsson/.local/share/insikt"
}

working-dir "/Users/olsson/Laboratory/insikt"

restart {
    policy "on-failure"      // "always" | "on-failure" | "never"
    backoff-initial "1s"
    backoff-max "30s"
    max-retries-per-minute 5
}

stop {
    grace "10s"              // SIGTERM, then SIGKILL after this
}

Field semantics

Field Required Default Notes
command yes Absolute path to executable.
args no [] String list.
port yes Informational; xy doesn't bind it. Used for list display and load-time conflict detection across configs.
env no {} Merged onto inherited parent env; KDL wins on conflict.
working-dir no daemon's cwd Process working directory.
restart.policy no on-failure always | on-failure | never.
restart.backoff-initial no 1s Humantime duration.
restart.backoff-max no 30s Cap for exponential backoff.
restart.max-retries-per-minute no 5 Sliding-60s window. Exceeded → failed.
stop.grace no 10s SIGTERM → wait → SIGKILL window.

Validation at load

  • Every file must parse and produce a complete ServerConfig.
  • No two configs may declare the same port.
  • command must exist and be executable (warn but allow if not — child spawn will fail and supervisor will mark failed).

Validation failures at daemon startup are fatal (exit non-zero). Failures during reload are returned to the CLI client as JSON-RPC errors; the daemon keeps running.

JSON-RPC protocol

Transport: Unix socket, newline-delimited JSON (one JSON-RPC 2.0 message per line).

Methods

Method Params Result
list [{name, state, pid?, port, uptime_secs?, restart_count, last_exit?}]
status {name} single entry as above + recent state transitions
start {name} or {all: true} {started: [...], already_running: [...]}
stop {name} or {all: true} {stopped: [...], not_running: [...]}
restart {name} or {all: true} {restarted: [...]}
reload {added: [...], removed: [...], changed: [...], unchanged: [...]}
logs {name, tail?: u32, follow?: bool} Initial response {subscription_id}; the daemon then sends JSON-RPC notifications log {subscription_id, name, stream, line, ts} for each line. A final log_end notification {subscription_id} closes the stream. For non-follow, log_end fires after the buffered tail. For follow, the stream stays open until the client closes the connection or calls logs_cancel {subscription_id}.

Server states

stopped | starting | running | restarting | failed | stopping

reload semantics

Diff current in-memory configs against on-disk config dir:

  • Added (new file): start.
  • Removed (file gone): stop running process.
  • Changed (content hash differs): stop, then start with new config.
  • Unchanged: leave alone.

Error codes

Standard JSON-RPC error objects with our codes:

Code Name
-32001 ServerNotFound
-32002 PortConflict
-32003 ConfigInvalid
-32004 AlreadyRunning
-32005 NotRunning
-32006 SpawnFailed

Supervisor state machine

                          stopped ─── start ──► starting
                             ▲                     │
                             │                  (spawn)
                          (stop_cmd)               │
                             │                     ▼
                          stopping ◄── stop ─── running ─── child_exit ──► (eval policy)
                             │                     ▲                            │
                          (SIGTERM,                │                            │
                           grace timer,        (spawn ok)                       │
                           SIGKILL)                │                            │
                             │                    │              ┌─ restart ─► restarting ──┐
                             ▼                    │              │                          │
                          stopped                 └──────────────┤                          │
                                                                 │                          │
                                                                 └─ no-restart / cap hit ──► failed
                                                                                            │
                                                                                  start ────┘
                                                                                  reload

Spawn flow

  1. Open / rotate log file (append mode; size threshold 10 MB, keep last 5 generations).
  2. Build tokio::process::Command:
    • command, args, merged env, working-dir.
    • kill_on_drop(true).
    • process_group(0) — own process group so signals don't leak.
  3. Spawn. Pipe stdout and stderr.
  4. Spin up two log pumps per child:
    • stdout_pump: line-buffered → log file + ring buffer + broadcast channel.
    • stderr_pump: same, tagged stderr.
  5. await child.wait(). On exit, evaluate restart policy.

Stop flow

  1. Send SIGTERM to the process group.
  2. Start grace timer (stop.grace).
  3. On timer fire, SIGKILL the process group.
  4. await child.wait().
  5. Close log pumps, transition to stopped.

Shutdown (daemon receives SIGTERM/SIGINT)

Broadcast Shutdown to all supervisor tasks → each runs its stop flow in parallel → daemon awaits all with an outer deadline of 2 × max(stop.grace) across configs → exit 0.

Daemon boot

  1. Resolve XDG paths, create state directories if missing.
  2. Acquire pidfile (fail if another daemon is alive).
  3. Load and validate all configs. Fatal on any failure.
  4. Bind Unix socket (0600 perms).
  5. Spawn one supervisor task per config; send each an immediate Start (auto-launch behavior).
  6. Serve JSON-RPC until shutdown signal.

Log handling

Per server:

  • Disk file at ${XDG_STATE_HOME}/xy/logs/<name>.log. Combined stdout+stderr with a leading tag per line: [out] / [err]. Size-based rotation: when current file ≥ 10 MB, rename to <name>.log.1 (shifting older generations), open fresh. Keep at most 5 generations.
  • Ring buffer in RAM, ~1 MB per server, holds the most recent log lines. Source for logs --tail without re-reading disk.
  • Broadcast channel (tokio::sync::broadcast) for live logs --follow subscribers. Lagged subscribers are dropped with a warning.

CLI surface

clap with derive. Subcommand structure:

xy daemon                     # foreground daemon (logs to stderr)
xy list                       # all configured servers + state
xy status <name>              # single server detail
xy start  <name|--all>
xy stop   <name|--all>
xy restart <name|--all>
xy reload
xy logs <name> [--tail N] [--follow]

CLI exit codes:

Code Meaning
0 success
1 operational error (server not found, port conflict on reload)
2 daemon unreachable (socket missing or refused)
3 config invalid

Error handling

Two layers, kept separate:

  • Libraries (xy-protocol, xy-supervisor, xy-ipc): thiserror enums per crate. Callers can match on variants (e.g., SupervisorError::AlreadyRunning, ConfigError::DuplicatePort { name_a, name_b, port }).
  • Binary (xy): anyhow for top-level startup and CLI reporting. The IPC layer has one match site translating typed errors into JSON-RPC error objects.

Fatal vs non-fatal

Class Examples Behavior
Fatal at daemon startup socket bind fails; state dir uncreatable; any config invalid; duplicate port exit non-zero, log to stderr
Non-fatal at runtime child spawn fails; restart cap hit; log file write fails log, mark server failed (or degrade log subsystem), daemon keeps running

Testing strategy

Unit tests

  • xy-supervisor: state-machine transitions using a mock ChildHandle trait so tests don't actually spawn processes. Cases:
    • Restart policy decisions (always / on-failure / never × clean/dirty exit).
    • Backoff math (initial, exponential, cap).
    • Retry window (sliding 60s) → failed transition.
    • Stop flow: grace timer expires → SIGKILL escalation.
  • xy-protocol: KDL parse cases (minimal, full, invalid). JSON-RPC envelope round-trips. Error-code mapping.

Integration tests

In crates/xy/tests/:

  • Spin up the real daemon on a temp socket with temp state and config dirs (per-test XDG_* env via tempfile).
  • Use tiny long-running test-only binaries built in the workspace:
    • xy-test-sleep-server: sleeps until SIGTERM, prints periodic lines.
    • xy-test-exit-immediately: exits non-zero immediately, used for failure-mode tests.
  • Drive the real CLI subcommands; assert on list output and observable state transitions.

CI

cargo +nightly fmt --check
cargo clippy --all-targets -- -D warnings
cargo test --all

Future work (out of scope for MVP)

  • Container isolation (rootless podman / Docker backend per server).
  • HTTP/MCP-aware health probes.
  • launchd LaunchAgent install command.
  • TUI dashboard (would reuse xy-protocol over the same socket).
  • macOS status bar app (same).
  • Optional auth on the socket if it ever leaves the user's machine.