Daemon refuses to start after unclean shutdown: stale pidfile is treated as a running daemon #1

Closed
opened 2026-06-05 15:28:49 +00:00 by logaritmisk · 0 comments
Owner

Observed

After a machine restart (unclean shutdown), the daemon's pidfile and socket file were left on disk. Starting the daemon again failed with:

another xy daemon appears to be running

…but no daemon process existed. Manual cleanup (deleting the pid and sock files) was required before the daemon would start.

Root cause

PidFile::acquire (crates/xy/src/pidfile.rs) opens the pidfile with create_new(true), so any pre-existing file — live or stale — fails with AlreadyExists. Cleanup relies entirely on Drop, which never runs on power loss / SIGKILL / hard reboot.

The socket file is not actually the blocker: bind() in crates/xy-ipc/src/server.rs already removes a pre-existing sock file before binding. But startup aborts at the pidfile check (crates/xy/src/daemon/mod.rs:72) before reaching that point.

Proposed fix

On AlreadyExists:

  1. Read the PID from the existing file.
  2. Check whether that process is alive (kill(pid, 0)ESRCH means stale).
  3. If stale (or the file is unreadable/garbage), remove it and retry the create_new acquire (loop, to stay race-safe against a concurrent starter).
  4. If the process is alive, keep the current "already running" error — ideally including the PID in the message.

Alternative worth considering: hold an flock() on the pidfile instead of relying on create_new. The kernel releases the lock when the process dies, regardless of how, which makes stale-file detection unnecessary (the file's existence stops being the liveness signal). PID contents stay useful for diagnostics.

Either way: add a test that simulates the crash case — write a pidfile containing a dead PID, assert the daemon starts and replaces it.

## Observed After a machine restart (unclean shutdown), the daemon's pidfile and socket file were left on disk. Starting the daemon again failed with: > another xy daemon appears to be running …but no daemon process existed. Manual cleanup (deleting the pid and sock files) was required before the daemon would start. ## Root cause `PidFile::acquire` (`crates/xy/src/pidfile.rs`) opens the pidfile with `create_new(true)`, so *any* pre-existing file — live or stale — fails with `AlreadyExists`. Cleanup relies entirely on `Drop`, which never runs on power loss / SIGKILL / hard reboot. The socket file is *not* actually the blocker: `bind()` in `crates/xy-ipc/src/server.rs` already removes a pre-existing sock file before binding. But startup aborts at the pidfile check (`crates/xy/src/daemon/mod.rs:72`) before reaching that point. ## Proposed fix On `AlreadyExists`: 1. Read the PID from the existing file. 2. Check whether that process is alive (`kill(pid, 0)` — `ESRCH` means stale). 3. If stale (or the file is unreadable/garbage), remove it and retry the `create_new` acquire (loop, to stay race-safe against a concurrent starter). 4. If the process is alive, keep the current "already running" error — ideally including the PID in the message. Alternative worth considering: hold an `flock()` on the pidfile instead of relying on `create_new`. The kernel releases the lock when the process dies, regardless of how, which makes stale-file detection unnecessary (the file's *existence* stops being the liveness signal). PID contents stay useful for diagnostics. Either way: add a test that simulates the crash case — write a pidfile containing a dead PID, assert the daemon starts and replaces it.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: logaritmisk/xy#1