add protection for potentially invalid disk data by chadpatel · Pull Request #2105 · aws/amazon-cloudwatch-agent

chadpatel · 2026-04-29T16:02:17Z

Description of the issue

The systemmetricsreceiver disk scraper aggregates aggregate_disk_free and aggregate_disk_used by summing across all /dev/-prefixed mounts. On Linux, gopsutil's disk.UsageWithContext computes Total and Free as uint64(stat.Blocks) * uint64(stat.Bsize) from the kernel's statfs syscall. Certain filesystem states — transient loop mounts, mounts under heavy I/O pressure, or broken FUSE devices — can cause statfs to return garbage values where Free > Total or both values are near 2^63. When these are summed into the aggregate, the resulting metric is nonsensical and poisons downstream min/max rollups.

Observed in production on an EC2 host with a nearly-full root XFS volume: 5 out of ~2100 one-minute samples reported aggregate_disk_free ≈ -9.2e18 (Long.MIN_VALUE range after float64 conversion), which can break downstream consumers that don't expect negative or astronomically large byte values.

Description of changes

Adds a per-mount plausibility check (isPlausibleDiskUsage) before accumulating into the aggregate sum. A mount is rejected if:

Total == 0 (statvfs failure or pseudo-filesystem)
Free > Total (physically impossible — the exact signature of the observed bug)
Total or Free exceeds 1 PiB (1 << 50 bytes, ~16x the largest EBS volume)

When any mount fails the check, the entire sample is dropped rather than emitting a partial sum. CloudWatch handles missing datapoints correctly; it does not handle wrong datapoints correctly — a partial sum silently understates free space and poisons min() aggregations.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Added TestDiskScraperDropsSampleWhenFreeExceedsTotal — reproduces the exact bug signature (tiny total, huge free)
Added TestDiskScraperDropsSampleWhenTotalExceedsCap — mount above 1 PiB cap
Added TestDiskScraperDropsSampleWhenTotalIsZero — defense-in-depth (already filtered by DiskUsage())
Added TestIsPlausibleDiskUsage — table-driven test covering 6 cases (normal, full disk, at cap, zero total, free > total, above cap)
All existing disk scraper tests pass with Total field added to test fixtures
Full package: go test ./receiver/systemmetricsreceiver/ — all disk/plausibility tests pass

Requirements

Run make fmt and make fmt-sh
Run make lint

add protection for potentially invalid disk data

d3051e6

chadpatel assigned zhihonl and sky333999 Apr 29, 2026

chadpatel requested a review from a team as a code owner April 29, 2026 16:02

zhihonl added the ready for testing Indicates this PR is ready for integration tests to run label Apr 29, 2026

zhihonl approved these changes Apr 29, 2026

View reviewed changes

sky333999 approved these changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add protection for potentially invalid disk data#2105

add protection for potentially invalid disk data#2105
chadpatel wants to merge 1 commit intoaws:mainfrom
chadpatel:invalid-disk-data-workaround

chadpatel commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chadpatel commented Apr 29, 2026

Description of the issue

Description of changes

License

Tests

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants