Skip to content

Quasi-mature OpenTelemetry integration#223

Draft
Helveg wants to merge 9 commits intomainfrom
feature/test-telemetry
Draft

Quasi-mature OpenTelemetry integration#223
Helveg wants to merge 9 commits intomainfrom
feature/test-telemetry

Conversation

@Helveg
Copy link
Contributor

@Helveg Helveg commented Feb 25, 2026

Describe the work done

WIP branch for @drodarie to experiment with and see if it helps tracing MPI errors. VERY MUCH A DRAFT.

  • Can only write to JSONLines in rudimentary fashion, no replay yet.
  • Architecture is all wrong and messy.
  • Untested whether running spans survive exceptions and process killing.

Still, might help already. First order of business is to make JSONLines replay possible, then make running spans immortal, then refactor everything into its own bsb-otel package.

@drodarie you can already work a bit in parallel to me and try to obtain the JSONLines logs the following way:

# Pull the latest commit on this branch,
# then bsb-core metadata needs to be reinstalled:
pip install -e ./packages/bsb-core 
# Run your simulation with instrumentation config
mpirun -n 8 opentelemetry-instrument --traces_exporter jsonlines bsb compile

@Helveg Helveg marked this pull request as draft February 25, 2026 10:55
@drodarie
Copy link
Contributor

@Helveg I tried to get opentelemetry logs from the bsb reconstruction on HPC with the following command:
mpirun -n 48 opentelemetry-instrument --traces_exporter jsonlines bsb compile divergent_declive_column.yaml -v4 --clear
Unfortunately, the batch ran to the end but I did not get any jsonlines files and there is no feedback in the logs explaining why...

@Helveg
Copy link
Contributor Author

Helveg commented Mar 13, 2026

Can you try some smaller debug examples first?

  • Simple single core bsb compile on the login node with an empty skeleton config
  • Multi core skeleton compile
  • ...

If none of them work, I think I have a CINECA login, or can use yours but I keep forgetting how hahaha :D I can try to debug the code there then.

@drodarie
Copy link
Contributor

Ok I have a small test. The following command produces a jsonlines` file on my local PC but not on login node of cineca:

# bla.yaml does not exist
opentelemetry-instrument --traces_exporter jsonlines bsb compile bla.yaml -v4  

Should we look into python libs differences?

@drodarie
Copy link
Contributor

Ok it seems I was lacking some opentelemetry libraries despite the code running without throwing any errors.
Now it produced files on CINECA, I will try again.
@Helveg Could you confirm which libraries are necessary?

opentelemetry-api                        1.40.0
opentelemetry-distro                     0.61b0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-instrumentation            0.61b0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
``

@Helveg
Copy link
Contributor Author

Helveg commented Mar 13, 2026

let's just say all of them. it's hard to tell because the otel package ecosystem for python specifically is a MESS.it's multiple monorepos on github,that all contain names pace packages for multiple names paces 🥲 it's really hard to tell what comes from what package. but I'll fix that as part of the move to a bsb-otel package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants