I've been using Rustic and one of the major selling points for me was the native OpenTelemetry integration (enabling me to more easily monitor the collection of backups across my infrastructure). I was, however, somewhat surprised to see that only the otel/metrics extension was supported.
I've previously used the otel/traces extension on another backup tool to make it significantly easier to gather and analyze contextual information relating to a backup (including how long different stages take, the success/failure of specific stages, stage-specific context, etc).
This is a feature I'd love to see in Rustic, and I'd be very happy to contribute it if you're open to the idea - and no worries if you have a reason you've chosen to avoid doing so.
How I Propose Implementing This
- I'd use the
tracing crate and tracing-opentelemetry to instrument key functions within commands::*.
- I'd use different trace severity levels and control these through the config file using a dedicated
log-level-traces property.
- I'd update the
opentelemetry config to point to the root OTEL_EXPORTER_OTLP_ENDPOINT rather than the current OTEL_EXPORTER_OTLP_METRICS_ENDPOINT (and automatically detect if the /v1/metrics suffix is present, stripping that if present to maintain backwards compatibility with older configs.
- I'd use events to record information about specific warnings/failures which occur within the scope of a backup run (e.g. "unable to open the file for reading because it is locked by another process" etc).
- I'd keep the trace spans reasonably high-level (representing the customer-facing units of work mapped from the
config.toml file - things like snapshots etc) to ensure the volume remains reasonable.
- I'd mark the top-level trace span as succeeded/failed based on the process exit code.
- I'd include the trace ID in the logs written by Rustic for a given execution, allowing the logs to be tied back to their corresponding trace if required.
Warnings/Gotchas
- I've personally found that the
tracing-opentelemetry crate doesn't necessarily update on the same cadence as the opentelemetry crate, and the opentelemetry crate tends to make breaking changes on a regular basis, so if you're trying to stay on the latest versions it can be a bit of a headache to maintain. That said, I see you're already using the opentelemetry crate and are likely already subject to much of this.
- If we change the behaviour of the
opentelemetry config key (to support its use for both metrics and traces) we run the risk of trying to ship traces to endpoints which don't support them (e.g. Prometheus), and we make it difficult for users to ship metrics and traces to different endpoints. We could bypass this by not using the config to configure tracing (i.e. using environment variables) but I think that would be confusing since we use the config for metrics; and similarly we could add a new key for opentelemetry-traces but I think that will lead to confusion around why the other isn't called opentelemetry-metrics.
I think the "best" approach would be to shift to using opentelemetry to point to the root endpoint (aka OTEL_EXPORTER_OTLP_ENDPOINT) with /v1/metrics and /v1/traces being added automatically if they are not already present. If the /v1/metrics or /v1/traces suffix is present, then we only enable that subset of the functionality (i.e. if you've got an existing config, the app continues to work as it does today - but if you strip the /v1/metrics suffix you will start receiving traces).
- Traces, depending on the data we export from them, have the potential to expose sensitive information (paths might include PII etc). I think this is primarily a documentation issue (i.e. we should call out the types of information which are exposed through the tracing interface so that operators can plan for this).
- If we use the
tracing crate, we are likely to get a bunch of tracing coming along "for free" from other crates which implement it as well. This may result in extra spans being emitted beyond the scope of Rustic - and we may wish to consider whether that's useful or not (and at what log-level-traces levels those should be visible/hidden).
- If we're shipping traces, we may need to wait for a period of time upon process completion before exiting (determined by the latency for emission, and potentially extending out to a configurable timeout if the backend is slow to respond). This could make the program appear "slow" at the completion of an operation.
Anyway, let me know your thoughts and whether this is something you'd be open to me working on.
I've been using Rustic and one of the major selling points for me was the native OpenTelemetry integration (enabling me to more easily monitor the collection of backups across my infrastructure). I was, however, somewhat surprised to see that only the
otel/metricsextension was supported.I've previously used the
otel/tracesextension on another backup tool to make it significantly easier to gather and analyze contextual information relating to a backup (including how long different stages take, the success/failure of specific stages, stage-specific context, etc).This is a feature I'd love to see in Rustic, and I'd be very happy to contribute it if you're open to the idea - and no worries if you have a reason you've chosen to avoid doing so.
How I Propose Implementing This
tracingcrate andtracing-opentelemetryto instrument key functions withincommands::*.log-level-tracesproperty.opentelemetryconfig to point to the rootOTEL_EXPORTER_OTLP_ENDPOINTrather than the currentOTEL_EXPORTER_OTLP_METRICS_ENDPOINT(and automatically detect if the/v1/metricssuffix is present, stripping that if present to maintain backwards compatibility with older configs.config.tomlfile - things like snapshots etc) to ensure the volume remains reasonable.Warnings/Gotchas
tracing-opentelemetrycrate doesn't necessarily update on the same cadence as theopentelemetrycrate, and theopentelemetrycrate tends to make breaking changes on a regular basis, so if you're trying to stay on the latest versions it can be a bit of a headache to maintain. That said, I see you're already using theopentelemetrycrate and are likely already subject to much of this.opentelemetryconfig key (to support its use for both metrics and traces) we run the risk of trying to ship traces to endpoints which don't support them (e.g. Prometheus), and we make it difficult for users to ship metrics and traces to different endpoints. We could bypass this by not using the config to configure tracing (i.e. using environment variables) but I think that would be confusing since we use the config for metrics; and similarly we could add a new key foropentelemetry-tracesbut I think that will lead to confusion around why the other isn't calledopentelemetry-metrics.I think the "best" approach would be to shift to using
opentelemetryto point to the root endpoint (akaOTEL_EXPORTER_OTLP_ENDPOINT) with/v1/metricsand/v1/tracesbeing added automatically if they are not already present. If the/v1/metricsor/v1/tracessuffix is present, then we only enable that subset of the functionality (i.e. if you've got an existing config, the app continues to work as it does today - but if you strip the/v1/metricssuffix you will start receiving traces).tracingcrate, we are likely to get a bunch of tracing coming along "for free" from other crates which implement it as well. This may result in extra spans being emitted beyond the scope of Rustic - and we may wish to consider whether that's useful or not (and at whatlog-level-traceslevels those should be visible/hidden).Anyway, let me know your thoughts and whether this is something you'd be open to me working on.