Skip to content

[OTEL] Add OpenTelemetry observability support#285

Open
royischoss wants to merge 27 commits intomlrun:developmentfrom
royischoss:ceml-641
Open

[OTEL] Add OpenTelemetry observability support#285
royischoss wants to merge 27 commits intomlrun:developmentfrom
royischoss:ceml-641

Conversation

@royischoss
Copy link
Copy Markdown
Contributor

@royischoss royischoss commented Apr 5, 2026

Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685

Changes

OTel operator sub-chart

  • Added opentelemetry-operator v0.78.1 as an optional dependency
  • crds.create: false — CRD rendering disabled on the sub-chart; the parent chart owns the CRDs via crds/ (see below)

CRD bootstrap via crds/ directory

  • Three minimal stub CRDs added to charts/mlrun-ce/crds/:
    • crd-opentelemetrycollector.yaml
    • crd-opentelemetryinstrumentation.yaml
    • crd-opampbridges.yaml
  • Helm applies crds/ before any templates or hooks, so the OTel CRD types are established before the crd-readiness-job hook runs — no CRD polling needed
  • Stubs use x-kubernetes-preserve-unknown-fields: true (minimal schema); the operator's admission webhook handles full CR validation once it's running
  • tests/package.sh replaces the large CRD files inside the opentelemetry-operator sub-chart tarball with 41-byte stubs, keeping the Helm release Secret well under the 3 MB Kubernetes API limit

New templates (templates/opentelemetry/)

  • Pre-install hook to label/annotate the namespace for OTel webhook injection and namespace-wide Python auto-instrumentation
  • collector.yaml and instrumentation.yaml — placeholder files; the actual CRs are applied by crd-readiness-job.yaml (post-install/post-upgrade hook) after the operator webhook is ready
  • RBAC for hook jobs

Metrics: push model (OTLP → Prometheus)

  • OTel collector exports metrics by pushing directly to Prometheus via the otlphttp/prometheus exporter at http://prometheus-operated.<namespace>.svc:9090/api/v1/otlp
  • Prometheus is configured with --enable-feature=otlp-write-receiver and --web.enable-otlp-receiver (both required in Prometheus v3)
  • Consistent with how the non-CE production system handles metrics collection

Instrumentation CR

  • Deployment-mode collector — single pod per namespace receiving OTLP from all instrumented workloads
  • Disabled aws_lambda OTel instrumentor to suppress irrelevant Lambda warnings
  • Removed duplicate OTEL_RESOURCE_ATTRIBUTES_* env vars (auto-injected by the operator)

MLRun API crash fix

  • Added mlrun.api.extraEnvKeyValue.PYTHONPATH — OTel operator injects PYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH) using K8s env var expansion, which can't see Docker image ENV vars. Without this explicit K8s env var, $(PYTHONPATH) resolves to empty, dropping the MLRun services package path and crashing the API

Admin / non-admin split

  • Admin: installs OTel operator with namespace-selector webhook; CRs disabled
  • User namespace: operator disabled; collector + instrumentation CRs enabled

🤖 Generated with Claude Code

- action: replace
target_label: metrics_source
replacement: otel_collector
kube-state-metrics:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is design limitation with no conditions on the values.yaml the scraping will run even if otel is disabled

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this extra scraping job and not just to add web.enable-otlp-receiver flag to the Prometheus deployment??

As you can see here

@royischoss royischoss marked this pull request as ready for review April 9, 2026 07:50
@royischoss royischoss requested a review from davesh0812 April 12, 2026 10:16
Comment thread charts/mlrun-ce/templates/_helpers.tpl Outdated
{{- include "mlrun-ce.otel.labels" . | nindent 4 }}
spec:
mode: {{ .Values.opentelemetry.collector.mode }}
upgradeStrategy: automatic
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to add an option for the user in the values.yaml to change this value?

Comment thread charts/mlrun-ce/README.md Outdated
mlrun/mlrun-ce
```

#### Split Installation (Admin/Non-Admin)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be remove as it should be document in MLRun docs

…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts:
#	charts/mlrun-ce/Chart.yaml
#	charts/mlrun-ce/README.md
#	charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants