Skip to content

fix(startup): refuse to start when another ccbot already holds the lock#86

Merged
Time4Mind merged 1 commit into
mainfrom
fix/singleton-flock-mutex
May 16, 2026
Merged

fix(startup): refuse to start when another ccbot already holds the lock#86
Time4Mind merged 1 commit into
mainfrom
fix/singleton-flock-mutex

Conversation

@Time4Mind
Copy link
Copy Markdown
Owner

Summary

  • Telegram's getUpdates long-poll is exclusive per bot token: a second instance silently steals updates and the original spams telegram.error.Conflict: terminated by other getUpdates request on every poll. Today we hit this with two ccbot processes running side by side — one under ccbot-supervisor.sh in tmux ccbot:__main__, the other started ~1d earlier from a NetHunter-terminal shell outside tmux. User-visible behaviour: message delays, ghost responses, card edits failing.
  • Fix: main.py now acquires an exclusive fcntl.flock(LOCK_EX | LOCK_NB) on $CCBOT_DIR/ccbot.lock BEFORE any tmux / bot startup. FD_CLOEXEC is set so the lock doesn't leak into subprocess / asyncio.subprocess children. Handle lives at module scope so the OS holds the lock for the whole process lifetime.
  • A contending instance hits OSError on flock → logs the lock path, prints to stderr (supervisor captures it), sys.exit(1). The supervisor's restart-backoff loop just waits for the existing instance to die. No more silent getUpdates cross-fire.

Test plan

  • +4 unit tests in test_singleton_lock.py: happy-path acquire, contention → SystemExit(1), re-acquire after release (process death suffices — no stale-lock sweep needed), parent-directory creation.
  • Full suite 473/473 green; ruff + pyright clean.

🤖 Generated with Claude Code

Telegram's ``getUpdates`` long-poll is exclusive per bot token: a second
instance silently steals updates and the original starts spamming
``telegram.error.Conflict: terminated by other getUpdates request`` on
every poll. Today we hit this with two ccbot processes running side by
side — one under ``ccbot-supervisor.sh`` inside ``tmux ccbot:__main__``,
the other started ~1d earlier from a NetHunter-terminal interactive
shell outside tmux. Both kept retrying for hours; user-visible behaviour
was message delays, ghost responses, and Card edits failing.

Fix: ``main.py`` now opens ``$CCBOT_DIR/ccbot.lock`` and holds an
exclusive ``fcntl.flock(LOCK_EX | LOCK_NB)`` for the process lifetime
BEFORE any tmux / bot startup work runs. The handle lives at module
scope so the lock survives until the interpreter exits; ``FD_CLOEXEC``
prevents the lock from leaking into any ``subprocess`` /
``asyncio.subprocess`` child (a stray child outliving the parent would
otherwise hold the lock and block future starts).

A contending instance hits ``OSError`` on ``flock``, logs the path and
the reason, prints to stderr (for supervisor capture), closes the
handle, and ``sys.exit(1)``. The supervisor's restart-backoff loop then
just waits for the existing instance to die naturally — no more
silent ``getUpdates`` cross-fire.

+4 unit tests cover happy-path acquire, contention → SystemExit(1),
re-acquire after release (process death is enough — no stale-lock
sweep needed), and parent-directory creation for first-launch
``$CCBOT_DIR``.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Time4Mind Time4Mind merged commit f22aff9 into main May 16, 2026
4 checks passed
@Time4Mind Time4Mind deleted the fix/singleton-flock-mutex branch May 16, 2026 21:53
Time4Mind added a commit that referenced this pull request May 16, 2026
…87)

Recent PRs (#82#86) changed bot behaviour in user-facing and
operator-facing ways that the docs hadn't caught up with yet:

- CLAUDE.md
  * Core Design Constraints: add the SessionStart + UserPromptSubmit
    hook story (self-heal) and the singleton flock.
  * Configuration: add ``ccbot.lock`` to the state-files list.
  * Hook Configuration: full block with both events + the safety
    contract (zero stdout, always exit 0, fast-path skip).

- .claude/rules/architecture.md
  * State-files diagram: ``session_map.json`` now lists both hook
    events; new ``ccbot.lock`` entry.
  * Key Design Decisions: hook self-heal, singleton lock, archive-
    time orphan-claude kill, startup orphan-window detection.

- .claude/rules/dm-architecture.md
  * window_id → claude session_id: both hook events update the map.
  * "What is unchanged": session_map.json description matches.

- .claude/rules/secrets.md
  * Add ``ccbot.lock`` to the where-things-are table.

- doc/dm-multisession-spec.md
  * State persistence: list ``ccbot.lock``.
  * Section 9 Recovery: flock acquire step, orphan-window scan, and
    the archive cleanup path that SIGTERMs orphan claude processes.

- src/ccbot/i18n.py
  * /help → Tips body (en / ru / zh): two new bullets — single-
    instance lock, hook self-heal — so operators have visibility
    into the new guarantees without digging through code.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant