Skip to content

PR 2/2: supervised systemd service + @restart#234

Open
spierenburg wants to merge 13 commits into
feat/module-reloadfrom
feat/service-supervision
Open

PR 2/2: supervised systemd service + @restart#234
spierenburg wants to merge 13 commits into
feat/module-reloadfrom
feat/service-supervision

Conversation

@spierenburg
Copy link
Copy Markdown
Collaborator

Summary

Runs matterbot/matterfeed as supervised systemd services with an operator-gated @restart/!restart. The @update/self-update option was deliberately cut — there is no chat-triggered code path; code updates remain a plain git pull on the host (then @reload from PR 1 for modules).

  • systemd units (contrib/systemd/): matterbot.service + matterfeed.service, one dedicated unprivileged matterbot user, hardened sandbox (NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, empty CapabilityBoundingSet, ReadWritePaths=/var/lib/matterbot), Restart=always + StartLimitBurst crash-loop backstop. No updater unit / polkit / sudoers / second deploy account / read-only-tree (all were @update-only machinery).
  • @restart (isoperator-gated = botadmins ∪ botoperators): writes a marker, posts "Restarting now…", then cleanly exits (request_shutdown → websocket close → run_forever returns → sys.exit(0) → systemd Restart=always). On a non-systemd host it os.execv-re-execs instead. On first reconnect the bot posts "Back up. ✅" to the originating thread.
  • runtime_config.py: configure_logging (logfile of null/""/"-" → stdout/journald; a path → file, the legacy default) and resolve_feedmap (single source of truth) wired into both processes — this also fixes matterfeed's pre-existing feedmap split-brain (it read Modules.feedmap but wrote Matterbot.feedmap).
  • New config: lifecyclecmds gains !restart/@restart; botoperators: []; service_manager: systemd|none.

lifecycle.py (pure/tested) holds detect_service_manager, the restart-marker IO, and self_reexec. 27 tests pass (PR 1's 10 + 17 new); independent final review found 0 Critical / 0 Important and confirmed no @update/Path-B leakage.

Stacked PR. Base is feat/module-reload (PR #233) because PR 2 extends the lifecyclecmds config and the handle_post lifecycle branch PR 1 introduces. Review/merge #233 first, then this auto-retargets to main (rebase is trivial).

Test Plan

Automated (passing): pytest -q → 27 passed; matterbot.py/matterfeed.py compile; imports clean; ruff F821 clean.

Manual — needs a systemd host, a Mattermost operator (Matterbot.botoperators or botadmins) and a plain-user account, with lifecyclecmds (incl. @restart) in the deployed config.yaml:

  • Install per contrib/systemd/README.md; systemctl enable --now; journalctl -u matterbot shows startup (validates logfile: "-")
  • Operator @restart → "Restarting now…", clean exit, systemd relaunch, "Back up. ✅" in-thread
  • Plain user @restart → permission-denied, no restart
  • @reload (PR 1) still works for an admin; non-admin denied (regression check)
  • systemctl stop matterbot → stays stopped (Restart=always doesn't fight an explicit stop)
  • Crash-loop a broken start → unit failed after StartLimitBurst (no infinite flap)
  • service_manager: none from a shell → @restart re-execs, bot returns, "Back up. ✅" posts

Dependency-free helpers: configure_logging (logfile null/""/"-" → stdout
for journald, path → file as before) and resolve_feedmap (Matterbot.feedmap
single source of truth, Modules.feedmap deprecated fallback). 5 tests.
… feedmap key)

matterbot.py & matterfeed.py __main__ logging now go through
runtime_config.configure_logging (logfile sentinel → journald). Both
processes resolve the feedmap path via runtime_config.resolve_feedmap;
this fixes the pre-existing matterfeed split-brain (it read
Modules.feedmap but wrote Matterbot.feedmap). 15 tests green.
lifecyclecmds gains !restart/@restart (reload tokens kept from PR1).
Adds botoperators (ids/role names allowed to @restart) and
service_manager (systemd|none). Annotates the logfile sentinel and the
feedmap single-source-of-truth note.
isoperator(userid) = botadmins ∪ botoperators (role names or ids),
delegating to isadmin for the admin superset. Gates @restart in a later
task. 4 tests; 19 suite green.
…_reexec)

detect_service_manager (systemd only if INVOCATION_ID present; explicit
'none' wins), state_dir, write/read/clear_restart_marker (for the
post-restart "back up" notice), and self_reexec (non-systemd @restart
fallback). Pure/importable. 6 tests; 25 suite green. (T5+T6 committed
together so the file lands ruff-clean — every import is used.)
__init__ gains _shutdown_requested/_exit_intent. request_shutdown()
stops the websocket; run_forever() returns (skipping backoff/reconnect)
when the flag is set; __main__ installs a SIGUSR1 handler and exits 0
after run_forever returns so systemd Restart=always relaunches on the
on-disk code. 2 tests; 27 suite green.
Restructures the PR1 lifecycle elif into an intent dispatch: @restart →
isoperator-gated → write restart marker, post "Restarting now…", then
request_shutdown (systemd) or self_reexec (no supervisor). reload
branches untouched and still reachable; unknown lifecycle tokens inert.
27 suite green.
T8's @restart dispatch calls lifecycle.detect_service_manager/
write_restart_marker/self_reexec but matterbot.py never imported
lifecycle — a latent NameError on every @restart (py_compile/pytest/
import all pass since no test exercises that path; ruff F821 caught
it). 27 suite green; the three callsites confirmed resolvable.
report_restart() reads the restart marker on first connect and posts
"Back up. ✅" to the originating thread, then clears the marker. Called
once per process in run_forever (guarded), after welcome-reconcile and
before websocket connect. 27 suite green.
contrib/systemd/{matterbot,matterfeed}.service: Type=simple, one
unprivileged matterbot user, hardened sandbox (NoNewPrivileges,
ProtectSystem=strict, ProtectHome, PrivateTmp, empty CapabilityBoundingSet,
ReadWritePaths=/var/lib/matterbot), Restart=always + StartLimit
crash-loop backstop. README.md documents the single-account install,
config paths (journald via logfile:"-"), and @restart behaviour. No
updater unit / polkit / deploy account (@update was cut).
Replaces the "no daemonization" stance: tmux/screen still supported, but
systemd (contrib/systemd/README.md) recommended for production —
unprivileged user, hardened sandbox, operator-gated @restart. Notes the
non-systemd re-exec fallback and that there is no chat-triggered code
path (git pull on the host + @reload for modules).
M1 — isadmin/isoperator now return an explicit bool on the success path
and False (never None) in the except handler, so the lifecycle gate is
fail-closed *and* unambiguous (no None-is-falsy fall-through from the
admin sub-check). Also bind `log` at module scope: it was assigned only
inside `if __name__ == '__main__'`, so the except path raised NameError
instead of returning False whenever MattermostManager was used outside
the script entrypoint (unit tests, imports). getLogger is a singleton —
the entrypoint still attaches handlers/level.

M2 — restart marker is now written with O_NOFOLLOW + O_CREAT and fchmod
0o600, and read with O_RDONLY|O_NOFOLLOW (refuses a planted symlink,
fail-closed to "no marker"). A local host user can no longer steer the
post-restart confirmation by pre-creating/symlinking the marker.
StateDirectoryMode=0750 added to both systemd units so the state dir is
not world-traversable.

Tests assert the invariants: plain-user/API-error -> exactly False,
marker mode == 0o600, read and write both refuse symlinks. 32 passed.
L3 — split test deps out of the runtime requirements: requirements.txt
no longer pulls pytest into the production venv (the systemd README
installs requirements.txt only). New requirements-dev.txt pins
pytest>=8.0,<9.0.

L4 — the bare `except: # ... create an empty map instead` in
load_bindmap/load_feedmap was a lying comment: the code re-raises (and
should — a corrupt existing map must fail loud, not be silently blanked,
which would drop every binding unnoticed; a missing map is handled by
is_file() and never reaches the handler). Comment now matches behaviour;
`except:` tightened to `except Exception:`.

L5 — load_command_table silently overwrote on a case-insensitive command
dir-name collision (Foo/ vs foo/), so a command the operator believes is
loaded would vanish. The collision is now recorded in report["failed"]
and the first dir walked is kept. Test added (walk simulated so it is
host-FS-independent).

L6 — replaced command.lstrip('!@') (strips any leading run of !/@, so
"@@restart" -> "restart") with a single exact-prefix strip computed once.
Unreachable today (dispatch is gated on exact membership in
lifecyclecmds) but removes the latent fragility if that list changes. No
dedicated test: the branch is provably unreachable, a test would only
exercise mocked-up scaffolding.

33 passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant