Skip to content

feat(celery): Add task and worker lifecycle metrics#4439

Open
diogosilva30 wants to merge 3 commits intoopen-telemetry:mainfrom
diogosilva30:feat/celery-task-and-worker-metrics
Open

feat(celery): Add task and worker lifecycle metrics#4439
diogosilva30 wants to merge 3 commits intoopen-telemetry:mainfrom
diogosilva30:feat/celery-task-and-worker-metrics

Conversation

@diogosilva30
Copy link
Copy Markdown
Contributor

@diogosilva30 diogosilva30 commented Apr 15, 2026

Description

Implements a subset of the Prometheus metrics exposed by Celery Flower directly in the Celery instrumentation, so users who only need basic task/worker observability no longer need to run Flower as a separate service.

Fixes #3458

Type of change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)

Changes

Task-level metrics (in CeleryInstrumentor)

Metric tracking wired into the existing prerun/postrun/failure/retry handlers:

Flower metric Impl. name Labels Type
flower_events_total flower.events.total task, type, worker Counter
flower_task_runtime_seconds flower.task.runtime.seconds task, worker Histogram
flower_worker_number_of_currently_executing_tasks flower.worker.number.of.currently.executing.tasks worker UpDownCounter (gauge)

Worker lifecycle metrics (new CeleryWorkerInstrumentor)

A separate instrumentor that hooks into worker_ready / worker_shutdown / task_received / task_revoked signals in the main worker process:

Flower metric Impl. name Labels Type
flower_events_total flower.events.total task, type, worker Counter
flower_worker_online flower.worker.online worker UpDownCounter (gauge)

Registered as a new celery_worker entry point in pyproject.toml.

Omitted metrics

flower_task_prefetch_time_seconds and flower_worker_prefetched_tasks are not implemented. In pool=prefork mode (the default), task_received fires in the main process while task_prerun fires in a child process — there is no shared state to compute the time delta between them. Flower can do this because it consumes Celery events externally in a single process; replicating that inside the worker is not feasible without producer-side instrumentation or protocol changes.

Bug fix — memory leak in task_id_to_start_time

Previously, task_id_to_start_time was a class-level dict that was never cleaned up after task completion, growing unbounded over time. It is now instance-scoped and entries are removed in _trace_postrun.

Housekeeping

  • Type annotations added across __init__.py and utils.py
  • Typed CeleryGetter properly (Getter[Request])
  • Null guards added to detach_context / retrieve_context / retrieve_task_id_from_message in utils.py
  • Replaced bare dict metrics store with typed _CeleryTaskMetrics / _CeleryWorkerMetrics dataclasses

How Has This Been Tested?

  • Comprehensive unit tests covering all new signal handlers and metric emissions (test_metrics.py, ~1100 new lines)
  • Integration tests that start a real Celery worker in-process to verify worker_ready / worker_shutdown signals
  • Edge-case tests: duplicate signals, unknown workers, uninstrument/reinstrument cycles
  • Pre-commit hooks (ruff lint + format) passing
  • End-to-end testing with a real Django + Celery project exporting to Grafana via OTLP

Trying it out

A self-contained example project (Django + Celery + Docker Compose + Grafana LGTM) is available at celery-metrics-example (separate branch on the fork, not part of this PR).

Quick start:

git clone -b celery-metrics-example https://github.com/diogosilva30/opentelemetry-python-contrib.git /tmp/celery-example
cd /tmp/celery-example/instrumentation/opentelemetry-instrumentation-celery/example
docker compose up -d

# Exercise every metric path
docker compose run --rm -w /repo django \
    uv run --no-sync \
        --directory instrumentation/opentelemetry-instrumentation-celery/example \
        python exercise_metrics.py

Then open Grafana at http://localhost:3000 (admin/admin) and explore metrics prefixed with flower.*.

Does This PR Require a Core Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@diogosilva30 diogosilva30 marked this pull request as ready for review April 15, 2026 23:00
@diogosilva30 diogosilva30 requested a review from a team as a code owner April 15, 2026 23:00
@diogosilva30 diogosilva30 marked this pull request as draft April 16, 2026 00:10
@diogosilva30 diogosilva30 force-pushed the feat/celery-task-and-worker-metrics branch from 45f5df3 to 94a36e1 Compare April 16, 2026 15:54
@diogosilva30 diogosilva30 marked this pull request as ready for review April 16, 2026 15:59
@diogosilva30 diogosilva30 force-pushed the feat/celery-task-and-worker-metrics branch 3 times, most recently from 16069a0 to c6ae222 Compare April 21, 2026 10:27
Introduce Flower-compatible metrics for Celery task and worker events:

- flower.events.total: counter for task-sent, task-received,
  task-started, task-succeeded, task-failed, task-retried, task-revoked
- flower.task.runtime.seconds: histogram of task execution time
- flower.worker.number.of.currently.executing.tasks: gauge of in-flight
  tasks per worker
- flower.worker.online: gauge tracking worker online/offline status

CeleryInstrumentor (child process) handles task tracing and task-level
metrics.  CeleryWorkerInstrumentor (main process) handles worker
lifecycle signals (worker_ready, worker_shutdown) and main-process
events (task_received, task_revoked).

Prefetch-time metrics (task_prefetch_time_seconds, worker_prefetched_tasks)
are intentionally omitted — Celery splits task_received and task_prerun
across processes in prefork mode, making it impossible to compute the
delta without external event consumption.
@diogosilva30 diogosilva30 force-pushed the feat/celery-task-and-worker-metrics branch from c6ae222 to a136151 Compare April 21, 2026 10:32
@diogosilva30
Copy link
Copy Markdown
Contributor Author

Hey @xrmx @aabmass @lzchen — would appreciate a review when you get a chance!

The CI failures seem unrelated to my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve metrics in Celery instrumentation

1 participant