[OTEL] Add OpenTelemetry observability support#285
Open
royischoss wants to merge 27 commits intomlrun:developmentfrom
Open
[OTEL] Add OpenTelemetry observability support#285royischoss wants to merge 27 commits intomlrun:developmentfrom
royischoss wants to merge 27 commits intomlrun:developmentfrom
Conversation
…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh
royischoss
commented
Apr 9, 2026
| - action: replace | ||
| target_label: metrics_source | ||
| replacement: otel_collector | ||
| kube-state-metrics: |
Contributor
Author
There was a problem hiding this comment.
This is design limitation with no conditions on the values.yaml the scraping will run even if otel is disabled
Contributor
There was a problem hiding this comment.
Why do we need this extra scraping job and not just to add web.enable-otlp-receiver flag to the Prometheus deployment??
As you can see here
| {{- include "mlrun-ce.otel.labels" . | nindent 4 }} | ||
| spec: | ||
| mode: {{ .Values.opentelemetry.collector.mode }} | ||
| upgradeStrategy: automatic |
Collaborator
There was a problem hiding this comment.
Maybe we need to add an option for the user in the values.yaml to change this value?
| mlrun/mlrun-ce | ||
| ``` | ||
|
|
||
| #### Split Installation (Admin/Non-Admin) |
Collaborator
There was a problem hiding this comment.
This can be remove as it should be document in MLRun docs
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685
Changes
OTel operator sub-chart
opentelemetry-operatorv0.78.1 as an optional dependencycrds.create: false— CRD rendering disabled on the sub-chart; the parent chart owns the CRDs viacrds/(see below)CRD bootstrap via
crds/directorycharts/mlrun-ce/crds/:crd-opentelemetrycollector.yamlcrd-opentelemetryinstrumentation.yamlcrd-opampbridges.yamlcrds/before any templates or hooks, so the OTel CRD types are established before thecrd-readiness-jobhook runs — no CRD polling neededx-kubernetes-preserve-unknown-fields: true(minimal schema); the operator's admission webhook handles full CR validation once it's runningtests/package.shreplaces the large CRD files inside theopentelemetry-operatorsub-chart tarball with 41-byte stubs, keeping the Helm release Secret well under the 3 MB Kubernetes API limitNew templates (
templates/opentelemetry/)collector.yamlandinstrumentation.yaml— placeholder files; the actual CRs are applied bycrd-readiness-job.yaml(post-install/post-upgrade hook) after the operator webhook is readyMetrics: push model (OTLP → Prometheus)
otlphttp/prometheusexporter athttp://prometheus-operated.<namespace>.svc:9090/api/v1/otlp--enable-feature=otlp-write-receiverand--web.enable-otlp-receiver(both required in Prometheus v3)Instrumentation CR
aws_lambdaOTel instrumentor to suppress irrelevant Lambda warningsOTEL_RESOURCE_ATTRIBUTES_*env vars (auto-injected by the operator)MLRun API crash fix
mlrun.api.extraEnvKeyValue.PYTHONPATH— OTel operator injectsPYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH)using K8s env var expansion, which can't see Docker imageENVvars. Without this explicit K8s env var,$(PYTHONPATH)resolves to empty, dropping the MLRun services package path and crashing the APIAdmin / non-admin split
🤖 Generated with Claude Code