Skip to content

cron: catch up jobs that have never recorded a successful run#139

Draft
alex-fedotyev wants to merge 2 commits into
ClickHouse:mainfrom
alex-fedotyev:alex/cron-catchup-never-run
Draft

cron: catch up jobs that have never recorded a successful run#139
alex-fedotyev wants to merge 2 commits into
ClickHouse:mainfrom
alex-fedotyev:alex/cron-catchup-never-run

Conversation

@alex-fedotyev

Copy link
Copy Markdown
Contributor

Summary

Stacked on #138 (the day-of-week fix); please review that one first. This diff includes #138's commit until it merges, after which it shows only the catch-up change. It is a draft for that reason.

_catchup_missed_jobs runs at startup and fires any job that should have run while the daemon was down. It skipped any job whose last successful run was missing, so a job that had never recorded a success was never backfilled. Together with #138, a weekly job that missed its first fire during a restart could wait a full week, or longer if it kept missing, before it ran.

This replaces the success-only check with _catchup_baseline, which measures a missed fire against, in order:

  1. the last successful run,
  2. the most recent attempt of any status (so a job that has only ever errored is still retried),
  3. the server start time.

Anchoring a job with no history at all to the server start time keeps a brand-new, not-yet-due job from firing spuriously: its first fire is still in the future, so it is not overdue.

Also adds get_last_cron_run (the most recent cron_logs row regardless of status) and records the server start time on startup.

Test plan

  • pytest tests/test_cron.py: 88 passed locally.
  • New TestCatchupBaseline covers baseline selection: prefers the last success, falls back to the last attempt of any status when the job never succeeded, then to server start, then to now when server start is unset.
  • New end-to-end cases: a weekly job that only ever errored (an 8-day-old attempt) catches up; a job with no history at all is anchored to server start and does not fire.
  • Full suite passes apart from pre-existing failures unrelated to cron (Docker and codex environment tests).

APScheduler's numeric day_of_week is 0=Mon..6=Sun, while Unix crontab
is 0=Sun..6=Sat (7 also means Sun). CronTrigger.from_crontab passes the
number straight through without remapping, so every numeric day-of-week
schedule fired one weekday late: "0 13 * * 1" (Monday) actually ran on
Tuesday, and "0 3 * * 0" (Sunday) ran on Monday.

Add _crontab_to_trigger, a from_crontab replacement that translates the
day-of-week field to APScheduler's day-name aliases (preserving *,
ranges like 1-5, lists like 1,4, and step suffixes) and leaves the
other four fields and the timezone handling unchanged. Route the three
trigger-construction sites (job scheduling, source scheduling, and the
overdue check) through it.

Tests cover the day-of-week mapping, parity with from_crontab for
non-DOW schedules, the interval-string fallback, and a weekly
_is_overdue case that is due after exactly one week.

After a restart, every numeric day-of-week cron shifts to the day its
schedule already specifies; that is the intended correction.
_catchup_missed_jobs skipped any job whose last successful run was
missing, so a weekly job that missed its first fire during a daemon
restart was starved: with no success row to measure against, it was
never backfilled and waited a full week for the next scheduled fire.

Replace the success-only check with _catchup_baseline, which measures
a missed fire against the last successful run, then the most recent
attempt of any status (so a job that has only ever errored is still
retried), then the server start time. Anchoring a job with no history
to server start keeps a brand-new, not-yet-due job from firing
spuriously, since its first fire is still in the future.

Add get_last_cron_run (most recent cron_logs row regardless of status)
and record the server start time on startup.

Tests cover the baseline selection (success, error-only, no-history)
and end-to-end catch-up for a weekly job that only ever errored.
@alex-fedotyev alex-fedotyev force-pushed the alex/cron-catchup-never-run branch from 4397e60 to 1153f5a Compare June 22, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant