Skip to content

fix: stabilize analytics sample cohort#5

Open
silentgeckoaudit3801 wants to merge 1 commit into
sonic-mast:mainfrom
silentgeckoaudit3801:main
Open

fix: stabilize analytics sample cohort#5
silentgeckoaudit3801 wants to merge 1 commit into
sonic-mast:mainfrom
silentgeckoaudit3801:main

Conversation

@silentgeckoaudit3801

Copy link
Copy Markdown

Closes #3.

Sorts agents by verification timestamp and BTC address before selecting the 20-agent sample. This makes the cohort deterministic when the API changes result ordering, preventing cumulative estimates from regressing solely because a different cohort was selected.

The existing public data demonstrates the failure: total_messages fell 651 -> 602 and later 602 -> 553.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request stabilizes the sampled cohort of agents in scripts/collect-metrics.js by sorting them by registration date and Bitcoin address before slicing. The reviewer identified a critical selection bias issue with this approach, as sorting by registration date permanently limits the sample to the oldest agents. They recommended using a deterministic, unbiased hashed sampling method based on the SHA-256 hash of the Bitcoin address to ensure fair representation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +37 to +40
// Keep the cohort stable when the API changes its result ordering.
const sample = [...agents]
.sort((a, b) => a.verifiedAt.localeCompare(b.verifiedAt) || a.btcAddress.localeCompare(b.btcAddress))
.slice(0, sampleSize);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Methodological Issue: Selection Bias in Cohort Sampling

Sorting the agents by verifiedAt ascending means the sample cohort is permanently frozen to the 20 oldest registered agents in the network. As the network grows, new agents are completely excluded from the sample.

This introduces a severe selection bias because:

  1. The behavior of the oldest 20 agents is extrapolated to the entire network using scaleFactor = agents.length / sample.length.
  2. If these 20 oldest agents become inactive or have unusually high activity compared to the rest of the network, the extrapolated metrics (total messages, sats received/sent) will be highly inaccurate.

Recommended Solution: Deterministic, Unbiased Hashed Sampling

To keep the cohort stable and deterministic without biasing it to the oldest agents, we should sort by a property that is randomly distributed across all agents, such as a hash of their btcAddress.

Using a hash (like SHA-256) is crucial because:

  • It ensures that new agents have an equal chance of being sampled.
  • It avoids address format bias (e.g., legacy addresses starting with 1 sorting before Bech32 addresses starting with bc1 lexicographically).
Suggested change
// Keep the cohort stable when the API changes its result ordering.
const sample = [...agents]
.sort((a, b) => a.verifiedAt.localeCompare(b.verifiedAt) || a.btcAddress.localeCompare(b.btcAddress))
.slice(0, sampleSize);
// Keep the cohort stable and unbiased by sorting by the hash of the BTC address.
const crypto = require('crypto');
const getHash = (addr) => crypto.createHash('sha256').update(addr || '').digest('hex');
const sample = [...agents]
.sort((a, b) => getHash(a.btcAddress).localeCompare(getHash(b.btcAddress)))
.slice(0, sampleSize);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cumulative total_messages can decrease when the sampled agent cohort changes

1 participant