Skip to content

[Feature Proposal]: OpenTelemetry Traces for Operations #1748

@notheotherben

Description

@notheotherben

I've been using Rustic and one of the major selling points for me was the native OpenTelemetry integration (enabling me to more easily monitor the collection of backups across my infrastructure). I was, however, somewhat surprised to see that only the otel/metrics extension was supported.

I've previously used the otel/traces extension on another backup tool to make it significantly easier to gather and analyze contextual information relating to a backup (including how long different stages take, the success/failure of specific stages, stage-specific context, etc).

This is a feature I'd love to see in Rustic, and I'd be very happy to contribute it if you're open to the idea - and no worries if you have a reason you've chosen to avoid doing so.

How I Propose Implementing This

  • I'd use the tracing crate and tracing-opentelemetry to instrument key functions within commands::*.
  • I'd use different trace severity levels and control these through the config file using a dedicated log-level-traces property.
  • I'd update the opentelemetry config to point to the root OTEL_EXPORTER_OTLP_ENDPOINT rather than the current OTEL_EXPORTER_OTLP_METRICS_ENDPOINT (and automatically detect if the /v1/metrics suffix is present, stripping that if present to maintain backwards compatibility with older configs.
  • I'd use events to record information about specific warnings/failures which occur within the scope of a backup run (e.g. "unable to open the file for reading because it is locked by another process" etc).
  • I'd keep the trace spans reasonably high-level (representing the customer-facing units of work mapped from the config.toml file - things like snapshots etc) to ensure the volume remains reasonable.
  • I'd mark the top-level trace span as succeeded/failed based on the process exit code.
  • I'd include the trace ID in the logs written by Rustic for a given execution, allowing the logs to be tied back to their corresponding trace if required.

Warnings/Gotchas

  • I've personally found that the tracing-opentelemetry crate doesn't necessarily update on the same cadence as the opentelemetry crate, and the opentelemetry crate tends to make breaking changes on a regular basis, so if you're trying to stay on the latest versions it can be a bit of a headache to maintain. That said, I see you're already using the opentelemetry crate and are likely already subject to much of this.
  • If we change the behaviour of the opentelemetry config key (to support its use for both metrics and traces) we run the risk of trying to ship traces to endpoints which don't support them (e.g. Prometheus), and we make it difficult for users to ship metrics and traces to different endpoints. We could bypass this by not using the config to configure tracing (i.e. using environment variables) but I think that would be confusing since we use the config for metrics; and similarly we could add a new key for opentelemetry-traces but I think that will lead to confusion around why the other isn't called opentelemetry-metrics.
    I think the "best" approach would be to shift to using opentelemetry to point to the root endpoint (aka OTEL_EXPORTER_OTLP_ENDPOINT) with /v1/metrics and /v1/traces being added automatically if they are not already present. If the /v1/metrics or /v1/traces suffix is present, then we only enable that subset of the functionality (i.e. if you've got an existing config, the app continues to work as it does today - but if you strip the /v1/metrics suffix you will start receiving traces).
  • Traces, depending on the data we export from them, have the potential to expose sensitive information (paths might include PII etc). I think this is primarily a documentation issue (i.e. we should call out the types of information which are exposed through the tracing interface so that operators can plan for this).
  • If we use the tracing crate, we are likely to get a bunch of tracing coming along "for free" from other crates which implement it as well. This may result in extra spans being emitted beyond the scope of Rustic - and we may wish to consider whether that's useful or not (and at what log-level-traces levels those should be visible/hidden).
  • If we're shipping traces, we may need to wait for a period of time upon process completion before exiting (determined by the latency for emission, and potentially extending out to a configurable timeout if the backend is slow to respond). This could make the program appear "slow" at the completion of an operation.

Anyway, let me know your thoughts and whether this is something you'd be open to me working on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    S-triageStatus: Waiting for a maintainer to triage this issue/PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions