Skip to content

add protection for potentially invalid disk data#2105

Open
chadpatel wants to merge 1 commit intoaws:mainfrom
chadpatel:invalid-disk-data-workaround
Open

add protection for potentially invalid disk data#2105
chadpatel wants to merge 1 commit intoaws:mainfrom
chadpatel:invalid-disk-data-workaround

Conversation

@chadpatel
Copy link
Copy Markdown
Contributor

Description of the issue

The systemmetricsreceiver disk scraper aggregates aggregate_disk_free and aggregate_disk_used by summing across all /dev/-prefixed mounts. On Linux, gopsutil's disk.UsageWithContext computes Total and Free as uint64(stat.Blocks) * uint64(stat.Bsize) from the kernel's statfs syscall. Certain filesystem states — transient loop mounts, mounts under heavy I/O pressure, or broken FUSE devices — can cause statfs to return garbage values where Free > Total or both values are near 2^63. When these are summed into the aggregate, the resulting metric is nonsensical and poisons downstream min/max rollups.

Observed in production on an EC2 host with a nearly-full root XFS volume: 5 out of ~2100 one-minute samples reported aggregate_disk_free ≈ -9.2e18 (Long.MIN_VALUE range after float64 conversion), which can break downstream consumers that don't expect negative or astronomically large byte values.

Description of changes

Adds a per-mount plausibility check (isPlausibleDiskUsage) before accumulating into the aggregate sum. A mount is rejected if:

  • Total == 0 (statvfs failure or pseudo-filesystem)
  • Free > Total (physically impossible — the exact signature of the observed bug)
  • Total or Free exceeds 1 PiB (1 << 50 bytes, ~16x the largest EBS volume)

When any mount fails the check, the entire sample is dropped rather than emitting a partial sum. CloudWatch handles missing datapoints correctly; it does not handle wrong datapoints correctly — a partial sum silently understates free space and poisons min() aggregations.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  • Added TestDiskScraperDropsSampleWhenFreeExceedsTotal — reproduces the exact bug signature (tiny total, huge free)
  • Added TestDiskScraperDropsSampleWhenTotalExceedsCap — mount above 1 PiB cap
  • Added TestDiskScraperDropsSampleWhenTotalIsZero — defense-in-depth (already filtered by DiskUsage())
  • Added TestIsPlausibleDiskUsage — table-driven test covering 6 cases (normal, full disk, at cap, zero total, free > total, above cap)
  • All existing disk scraper tests pass with Total field added to test fixtures
  • Full package: go test ./receiver/systemmetricsreceiver/ — all disk/plausibility tests pass

Requirements

  1. Run make fmt and make fmt-sh
  2. Run make lint

@chadpatel chadpatel requested a review from a team as a code owner April 29, 2026 16:02
@zhihonl zhihonl added the ready for testing Indicates this PR is ready for integration tests to run label Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants