Skip to content

feat: expose more metrics on the relay server#4085

Open
Frando wants to merge 5 commits intomainfrom
Frando/relay-metrics
Open

feat: expose more metrics on the relay server#4085
Frando wants to merge 5 commits intomainfrom
Frando/relay-metrics

Conversation

@Frando
Copy link
Copy Markdown
Member

@Frando Frando commented Apr 8, 2026

Description

Adds more metrics to the relay server:

  • Metrics for TCP connections
  • Metrics for QUIC connections
  • Metrics for inactive clients

Also slightly improves logging for QUIC connections.

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/4085/docs/iroh/

Last updated: 2026-04-09T10:47:35Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: dac9284

Comment on lines +70 to +77
/// Number of accepted QUIC connections.
pub quic_accepted: Counter,
/// Number of terminated QUIC connections.
pub quic_disconnected: Counter,
/// Number of QUIC connections that terminated with an error.
///
/// The number is *included* in `quic_disconnected` (not in addition to).
pub quic_disconnected_error: Counter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call these qad_?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +78 to +85
/// Number of accepted TCP connections.
pub tcp_accepted: Counter,
/// Number of terminated TCP connections.
pub tcp_disconnected: Counter,
/// Number of TCP connections that terminated with an error.
///
/// The number is *included* in `tcp_disconnected` (not in addition to).
pub tcp_disconnected_error: Counter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And these http_ I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Frando Frando requested a review from flub April 8, 2026 09:29
@Frando Frando force-pushed the Frando/relay-metrics branch from ef370b3 to 53a31a0 Compare April 8, 2026 09:31
@n0bot n0bot bot added this to iroh Apr 8, 2026
@github-project-automation github-project-automation bot moved this to 🚑 Needs Triage in iroh Apr 8, 2026
@github-project-automation github-project-automation bot moved this from 🚑 Needs Triage to 🏗 In progress in iroh Apr 8, 2026
Comment on lines +71 to +77
pub clients_inactive_add: Counter,

/// Number of times a client was removed from the inactive state.
///
/// Happens when a client disconnects while being inactive, or if a client is upgraded to be
/// active again (happens only when the currently-active client for that endpoint id disconnects).
pub clients_inactive_remove: Counter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below you use the passive ("disconnected") andhere you use active. "added" and "removed" makes more sense I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, renamed.

///
/// After completion, each is counted in qad_accepted_disconnected.
/// The number of active connections is qad_accepted - qad_accepted_disconnected.
pub qad_accepted: Counter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the number of connections that existed? Why not qad_connections?

So qad_incoming = qad_incoming_disconnected + qad_accepted, is that correct?

And qad_accepted - qad_accepted_disconnected is the number of currently connected qad connections?

While qad_accepted_disconnected_error < qad_accepted_disconnected?

The naming is weird, I can't think of what is more conventional right now. Maybe:

  • qad_incoming
  • qad_incoming_error
  • qad_connections
  • qad_connections_closed
  • qad_connections_errored (still a subset of closed, needs to be clearly documented)

The "usual" thing is to have a metric qad_conn_closed with a status field. But it seems that PR still hasn't been merged.

Anyway, would also like @Arqu 's opinion.

Copy link
Copy Markdown
Member Author

@Frando Frando Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So qad_incoming = qad_incoming_disconnected + qad_accepted, is that correct?

If none are inflight, yes. More correct is
qad_incoming = qad_incoming_disconnected + qad_accepted + qad_inflight
i.e.
qad_inflight = qad_incoming - qad_incoming_disconnected - qad_accepted

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flub I pushed a commit with renames and expanded docs.

@Frando Frando requested a review from flub April 8, 2026 13:50
@Frando Frando force-pushed the Frando/relay-metrics branch from 821563f to d9bd1ba Compare April 9, 2026 10:45
Copy link
Copy Markdown
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgmt, would still like @Arqu 's opinion

// TODO: only important stat that we cannot track right now
// pub average_queue_duration:
//
/// Number of incoming QUIC connections.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Number of incoming QUIC connections.
/// Number of incoming QAD connections.

@flub flub requested a review from Arqu April 9, 2026 14:02
Copy link
Copy Markdown
Collaborator

@Arqu Arqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No complaints from my side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

3 participants