[Feature Proposal]: OpenTelemetry Traces for Operations

I've been using Rustic and one of the major selling points for me was the native OpenTelemetry integration (enabling me to more easily monitor the collection of backups across my infrastructure). I was, however, somewhat surprised to see that **only** the `otel/metrics` extension was supported.

I've previously used the `otel/traces` extension on [another backup tool](https://github.com/SierraSoftworks/github-backup) to make it significantly easier to gather and analyze contextual information relating to a backup (including how long different stages take, the success/failure of specific stages, stage-specific context, etc).

This is a feature I'd love to see in Rustic, and I'd be very happy to contribute it if you're open to the idea - and no worries if you have a reason you've chosen to avoid doing so.

## How I Propose Implementing This
 - I'd use the [`tracing`](https://docs.rs/tracing) crate and [`tracing-opentelemetry`](https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/) to instrument key functions within `commands::*`.
 - I'd use different trace severity levels and control these through the config file using a dedicated `log-level-traces` property.
 - I'd update the `opentelemetry` config to point to the root `OTEL_EXPORTER_OTLP_ENDPOINT` rather than the current `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` (and automatically detect if the `/v1/metrics` suffix is present, stripping that if present to maintain backwards compatibility with older configs.
 - I'd use events to record information about specific warnings/failures which occur within the scope of a backup run (e.g. "unable to open the file for reading because it is locked by another process" etc).
 - I'd keep the trace spans reasonably high-level (representing the customer-facing units of work mapped from the `config.toml` file - things like snapshots etc) to ensure the volume remains reasonable.
 - I'd mark the top-level trace span as succeeded/failed based on the process exit code.
 - I'd include the trace ID in the logs written by Rustic for a given execution, allowing the logs to be tied back to their corresponding trace if required.

## Warnings/Gotchas
 - I've personally found that the `tracing-opentelemetry` crate doesn't necessarily update on the same cadence as the `opentelemetry` crate, and the `opentelemetry` crate tends to make breaking changes on a regular basis, so if you're trying to stay on the latest versions it can be a bit of a headache to maintain. That said, I see you're already using the `opentelemetry` crate and are likely already subject to much of this.
 - If we change the behaviour of the `opentelemetry` config key (to support its use for both metrics and traces) we run the risk of trying to ship traces to endpoints which don't support them (e.g. Prometheus), and we make it difficult for users to ship metrics and traces to different endpoints. We could bypass this by not using the config to configure tracing (i.e. using environment variables) but I think that would be confusing since we use the config for metrics; and similarly we could add a new key for `opentelemetry-traces` but I think that will lead to confusion around why the other isn't called `opentelemetry-metrics`.
  I think the "best" approach would be to shift to using `opentelemetry` to point to the root endpoint (aka `OTEL_EXPORTER_OTLP_ENDPOINT`) with `/v1/metrics` and `/v1/traces` being added automatically **if** they are not already present. If the `/v1/metrics` or `/v1/traces` suffix is present, then we only enable that subset of the functionality (i.e. if you've got an existing config, the app continues to work as it does today - but if you strip the `/v1/metrics` suffix you will start receiving traces).
 - Traces, depending on the data we export from them, have the potential to expose sensitive information (paths might include PII etc). I think this is primarily a documentation issue (i.e. we should call out the types of information which are exposed through the tracing interface so that operators can plan for this).
 - If we use the `tracing` crate, we are likely to get a bunch of tracing coming along "for free" from other crates which implement it as well. This may result in extra spans being emitted beyond the scope of Rustic - and we may wish to consider whether that's useful or not (and at what `log-level-traces` levels those should be visible/hidden).
 - If we're shipping traces, we may need to wait for a period of time upon process completion before exiting (determined by the latency for emission, and potentially extending out to a configurable timeout if the backend is slow to respond). This could make the program appear "slow" at the completion of an operation.

Anyway, let me know your thoughts and whether this is something you'd be open to me working on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal]: OpenTelemetry Traces for Operations #1748

How I Propose Implementing This

Warnings/Gotchas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Proposal]: OpenTelemetry Traces for Operations #1748

Description

How I Propose Implementing This

Warnings/Gotchas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions