Files

T

logaritmisk 8f7200aa25 docs: add xy MCP supervisor design spec

Approved design for the MVP: single xy binary with a Cargo workspace
(xy-protocol, xy-supervisor, xy-ipc, xy), Unix socket + newline-delimited
JSON-RPC, per-server KDL configs at XDG paths (XDG on macOS too via
etcetera), supervisor-per-server task model with per-server restart policy,
log capture to disk + ring buffer + broadcast for follow.

MVP commands: daemon, list, status, start/stop/restart (name|--all),
reload, logs. Process-alive supervision only; HTTP/MCP-aware probes,
container isolation, launchd integration, and TUI deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 10:51:57 +02:00

15 KiB

Raw Permalink Blame History

xy — HTTP MCP Server Supervisor

Date: 2026-05-25 Status: Approved — ready for implementation planning

Problem

HTTP-based MCP servers (currently two, more likely) need a long-running parent process so they survive terminal closures and can be inspected, restarted, and upgraded without ad-hoc terminal tabs. Today they're launched manually and their lifetime is coupled to a terminal window.

Goals (MVP)

Run as a background daemon on macOS.
Auto-launch every configured MCP server when the daemon starts.
Provide a CLI to start, stop, restart, reload, list, and tail logs.
Per-server restart policy with backoff.
Capture stdout/stderr to rotating log files and an in-memory ring buffer.

Non-goals (deferred)

Container isolation (planned for a later phase).
TUI dashboard.
macOS status bar app.
HTTP/MCP-level health probes.
Auto-start at login via launchd (manual daemon launch only for MVP).
Remote management (everything is local-socket only).

Architecture

                 ┌──────────────────────────────┐
                 │     xy daemon (process)       │
                 │                                │
   xy CLI ──────►│  JSON-RPC server               │
  (Unix socket)  │       │                        │
                 │       ▼                        │
                 │  Command handlers              │
                 │       │                        │
                 │       ▼                        │
                 │  Supervisor (one task per      │
                 │   managed server):             │
                 │     spawn → wait → restart     │
                 │     per per-server policy      │
                 │       │                        │
                 │       ▼                        │
                 │  Log capture: stdout/stderr ──►│──► $XDG_STATE_HOME/xy/logs/<name>.log
                 │  Ring buffer (in RAM)          │
                 └──────────────────────────────┘
                            │
                            ▼
                  Child MCP server processes
                  (HTTP, fixed port from KDL)

Filesystem layout

XDG semantics on both Linux and macOS (no ~/Library/Application Support). Use the etcetera crate's Xdg strategy, or hand-rolled env-var resolution.

Purpose	Path
Configs	`${XDG_CONFIG_HOME:-~/.config}/xy/servers/*.kdl`
Logs	`${XDG_STATE_HOME:-~/.local/state}/xy/logs/<name>.log`
Socket	`${XDG_RUNTIME_DIR}/xy.sock` if set, else `${XDG_STATE_HOME:-~/.local/state}/xy/xy.sock`
Pidfile	`${XDG_STATE_HOME:-~/.local/state}/xy/xy.pid`

Socket permissions: 0600.

Concurrency model

Tokio multi-thread runtime. Each managed server owns one supervisor task that holds the canonical state for that server. RPC handlers communicate with supervisor tasks via channels:

mpsc::Sender<SupervisorCmd> — Start, Stop, Restart, Shutdown, Reconfigure(ServerConfig).
watch::Receiver<ServerState> — outsiders observe current state without locks.
broadcast::Sender<LogLine> — live logs --follow subscribers.

No shared mutexes for server state; the supervisor task is the owner.

Crate layout (Cargo workspace)

xy/
├── Cargo.toml                       # workspace manifest
├── crates/
│   ├── xy-protocol/                 # JSON-RPC types + KDL config schema (lib)
│   ├── xy-supervisor/               # process lifecycle, restart policy, log capture (lib)
│   ├── xy-ipc/                      # socket framing + JSON-RPC client/server (lib)
│   └── xy/                          # binary: clap CLI + daemon command, wires it all together
└── docs/superpowers/specs/

Single xy binary; xy daemon runs the supervisor in-process, all other subcommands act as JSON-RPC clients.

Dependencies

tokio (features: rt-multi-thread, net, process, signal, sync, fs, io-util, macros)
clap with derive feature
serde, serde_json
kdl (KDL parser) with a small typed schema wrapper
tracing, tracing-subscriber (env-filter)
thiserror (libraries), anyhow (binary)
etcetera (XDG paths, works correctly on macOS)
nix (SIGTERM/SIGKILL, process groups)

Format with cargo +nightly fmt. Lint with cargo clippy --all-targets -- -D warnings.

KDL config schema

One file per server: ${XDG_CONFIG_HOME}/xy/servers/<name>.kdl. Filename stem is the canonical server name; the file itself does not repeat it.

Example ~/.config/xy/servers/insikt.kdl:

command "/Users/olsson/.cargo/bin/insikt-mcp"
args "--http" "--port" "8421"
port 8421

env {
    RUST_LOG "info"
    INSIKT_DATA_DIR "/Users/olsson/.local/share/insikt"
}

working-dir "/Users/olsson/Laboratory/insikt"

restart {
    policy "on-failure"      // "always" | "on-failure" | "never"
    backoff-initial "1s"
    backoff-max "30s"
    max-retries-per-minute 5
}

stop {
    grace "10s"              // SIGTERM, then SIGKILL after this
}

Field semantics

Field	Required	Default	Notes
`command`	yes	—	Absolute path to executable.
`args`	no	`[]`	String list.
`port`	yes	—	Informational; xy doesn't bind it. Used for `list` display and load-time conflict detection across configs.
`env`	no	`{}`	Merged onto inherited parent env; KDL wins on conflict.
`working-dir`	no	daemon's cwd	Process working directory.
`restart.policy`	no	`on-failure`	`always` \| `on-failure` \| `never`.
`restart.backoff-initial`	no	`1s`	Humantime duration.
`restart.backoff-max`	no	`30s`	Cap for exponential backoff.
`restart.max-retries-per-minute`	no	`5`	Sliding-60s window. Exceeded → `failed`.
`stop.grace`	no	`10s`	SIGTERM → wait → SIGKILL window.

Validation at load

Every file must parse and produce a complete ServerConfig.
No two configs may declare the same port.
command must exist and be executable (warn but allow if not — child spawn will fail and supervisor will mark failed).

Validation failures at daemon startup are fatal (exit non-zero). Failures during reload are returned to the CLI client as JSON-RPC errors; the daemon keeps running.

JSON-RPC protocol

Transport: Unix socket, newline-delimited JSON (one JSON-RPC 2.0 message per line).

Methods

Method	Params	Result
`list`	—	`[{name, state, pid?, port, uptime_secs?, restart_count, last_exit?}]`
`status`	`{name}`	single entry as above + recent state transitions
`start`	`{name}` or `{all: true}`	`{started: [...], already_running: [...]}`
`stop`	`{name}` or `{all: true}`	`{stopped: [...], not_running: [...]}`
`restart`	`{name}` or `{all: true}`	`{restarted: [...]}`
`reload`	—	`{added: [...], removed: [...], changed: [...], unchanged: [...]}`
`logs`	`{name, tail?: u32, follow?: bool}`	Initial response `{subscription_id}`; the daemon then sends JSON-RPC notifications `log` `{subscription_id, name, stream, line, ts}` for each line. A final `log_end` notification `{subscription_id}` closes the stream. For non-`follow`, `log_end` fires after the buffered tail. For `follow`, the stream stays open until the client closes the connection or calls `logs_cancel {subscription_id}`.

Server states

`reload` semantics

Diff current in-memory configs against on-disk config dir:

Added (new file): start.
Removed (file gone): stop running process.
Changed (content hash differs): stop, then start with new config.
Unchanged: leave alone.

Error codes

Standard JSON-RPC error objects with our codes:

Code	Name
`-32001`	`ServerNotFound`
`-32002`	`PortConflict`
`-32003`	`ConfigInvalid`
`-32004`	`AlreadyRunning`
`-32005`	`NotRunning`
`-32006`	`SpawnFailed`

Supervisor state machine

                          stopped ─── start ──► starting
                             ▲                     │
                             │                  (spawn)
                          (stop_cmd)               │
                             │                     ▼
                          stopping ◄── stop ─── running ─── child_exit ──► (eval policy)
                             │                     ▲                            │
                          (SIGTERM,                │                            │
                           grace timer,        (spawn ok)                       │
                           SIGKILL)                │                            │
                             │                    │              ┌─ restart ─► restarting ──┐
                             ▼                    │              │                          │
                          stopped                 └──────────────┤                          │
                                                                 │                          │
                                                                 └─ no-restart / cap hit ──► failed
                                                                                            │
                                                                                  start ────┘
                                                                                  reload

Spawn flow

Open / rotate log file (append mode; size threshold 10 MB, keep last 5 generations).
Build tokio::process::Command:
- command, args, merged env, working-dir.
- kill_on_drop(true).
- process_group(0) — own process group so signals don't leak.
Spawn. Pipe stdout and stderr.
Spin up two log pumps per child:
- stdout_pump: line-buffered → log file + ring buffer + broadcast channel.
- stderr_pump: same, tagged stderr.
await child.wait(). On exit, evaluate restart policy.

Stop flow

Send SIGTERM to the process group.
Start grace timer (stop.grace).
On timer fire, SIGKILL the process group.
await child.wait().
Close log pumps, transition to stopped.

Shutdown (daemon receives SIGTERM/SIGINT)

Broadcast Shutdown to all supervisor tasks → each runs its stop flow in parallel → daemon awaits all with an outer deadline of 2 × max(stop.grace) across configs → exit 0.

Daemon boot

Resolve XDG paths, create state directories if missing.
Acquire pidfile (fail if another daemon is alive).
Load and validate all configs. Fatal on any failure.
Bind Unix socket (0600 perms).
Spawn one supervisor task per config; send each an immediate Start (auto-launch behavior).
Serve JSON-RPC until shutdown signal.

Log handling

Per server:

Disk file at ${XDG_STATE_HOME}/xy/logs/<name>.log. Combined stdout+stderr with a leading tag per line: [out] / [err]. Size-based rotation: when current file ≥ 10 MB, rename to <name>.log.1 (shifting older generations), open fresh. Keep at most 5 generations.
Ring buffer in RAM, ~1 MB per server, holds the most recent log lines. Source for logs --tail without re-reading disk.
Broadcast channel (tokio::sync::broadcast) for live logs --follow subscribers. Lagged subscribers are dropped with a warning.

CLI surface

clap with derive. Subcommand structure:

xy daemon                     # foreground daemon (logs to stderr)
xy list                       # all configured servers + state
xy status <name>              # single server detail
xy start  <name|--all>
xy stop   <name|--all>
xy restart <name|--all>
xy reload
xy logs <name> [--tail N] [--follow]

CLI exit codes:

Code	Meaning
0	success
1	operational error (server not found, port conflict on reload)
2	daemon unreachable (socket missing or refused)
3	config invalid

Error handling

Two layers, kept separate:

Libraries (xy-protocol, xy-supervisor, xy-ipc): thiserror enums per crate. Callers can match on variants (e.g., SupervisorError::AlreadyRunning, ConfigError::DuplicatePort { name_a, name_b, port }).
Binary (xy): anyhow for top-level startup and CLI reporting. The IPC layer has one match site translating typed errors into JSON-RPC error objects.

Fatal vs non-fatal

Class	Examples	Behavior
Fatal at daemon startup	socket bind fails; state dir uncreatable; any config invalid; duplicate port	exit non-zero, log to stderr
Non-fatal at runtime	child spawn fails; restart cap hit; log file write fails	log, mark server `failed` (or degrade log subsystem), daemon keeps running

Testing strategy

Unit tests

xy-supervisor: state-machine transitions using a mock ChildHandle trait so tests don't actually spawn processes. Cases:
- Restart policy decisions (always / on-failure / never × clean/dirty exit).
- Backoff math (initial, exponential, cap).
- Retry window (sliding 60s) → failed transition.
- Stop flow: grace timer expires → SIGKILL escalation.
xy-protocol: KDL parse cases (minimal, full, invalid). JSON-RPC envelope round-trips. Error-code mapping.

Integration tests

In crates/xy/tests/:

Spin up the real daemon on a temp socket with temp state and config dirs (per-test XDG_* env via tempfile).
Use tiny long-running test-only binaries built in the workspace:
- xy-test-sleep-server: sleeps until SIGTERM, prints periodic lines.
- xy-test-exit-immediately: exits non-zero immediately, used for failure-mode tests.
Drive the real CLI subcommands; assert on list output and observable state transitions.

CI

cargo +nightly fmt --check
cargo clippy --all-targets -- -D warnings
cargo test --all

Future work (out of scope for MVP)

Container isolation (rootless podman / Docker backend per server).
HTTP/MCP-aware health probes.
launchd LaunchAgent install command.
TUI dashboard (would reuse xy-protocol over the same socket).
macOS status bar app (same).
Optional auth on the socket if it ever leaves the user's machine.

15 KiB Raw Permalink Blame History Unescape Escape