Skip to content

feat(disruption): disk full injection.#1058

Draft
Zenithar wants to merge 2 commits intomainfrom
zenithar/chaos-controller/disk_full_disruption
Draft

feat(disruption): disk full injection.#1058
Zenithar wants to merge 2 commits intomainfrom
zenithar/chaos-controller/disk_full_disruption

Conversation

@Zenithar
Copy link
Copy Markdown
Contributor

@Zenithar Zenithar commented Apr 8, 2026

What does this PR do?

  • Adds new functionality

Adds a new diskFull disruption kind that genuinely fills a target pod volume using the fallocate(2) syscall, causing real ENOSPC errors on all subsequent write operations. This fills a gap where existing disruptions (DiskPressure = I/O throttling, DiskFailure = eBPF on openat only) don't simulate actual disk exhaustion visible to monitoring and all syscalls.

Features

  • Volume fill (v1): Creates a ballast file via fallocate(2) syscall (instant, O(1) on ext4/xfs) to genuinely consume disk space. Falls back to writing zeros on unsupported filesystems.
  • eBPF write interception (v2, optional): Launches an eBPF program to intercept write syscalls and return configurable error codes (ENOSPC, EDQUOT, EIO, etc.) with probability-based failure injection.
  • Safety: 1Mi minimum free space floor (overridable via unsafeMode.allowDiskFullNoFloor). Pod-level only. Webhook warning for ephemeral-storage eviction risk.
  • Pure Go fallocate: Vendored fallocate/ package (adapted from detailyang/go-fallocate, MIT) — no dependency on fallocate or dd binaries in the injector image.

How it differs from existing disruptions

Disruption Mechanism ENOSPC on writes? Visible to df/monitoring?
Disk Pressure Cgroup blkio throttling No No
Disk Failure eBPF on openat only Only on file open No
Disk Full (new) Real space allocation + optional eBPF Yes (all syscalls) Yes

Example

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: disk-full-test
spec:
  selector:
    app: my-service
  count: 1
  level: pod
  duration: 10m
  diskFull:
    path: "/data"
    capacity: "95%"
    # Optional: eBPF write interception
    writeSyscall:
      exitCode: ENOSPC
      probability: "50%"

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by adding new unit tests.
  • I manually tested the following steps:
    • locally.
    • as a canary deployment to a cluster.

Test coverage (50 specs)

  • Spec validation (39 tests): capacity/remaining mutual exclusivity, boundary values, writeSyscall probability, GenerateArgs, Explain, GetExitCodeInt
  • Injector (11 tests): creation, inject with capacity/remaining, dry-run, remaining > available (skip), inject+clean round trip, idempotent cleanup

Files changed (27 files, ~2000 lines)

Component Files
CRD spec + validation api/v1beta1/disk_full.go, disruption_types.go, disruption_webhook.go, safemode.go
Injector injector/disk_full.go (volume fill + eBPF launch)
CLI cli/injector/disk_full.go, cli/injector/main.go
fallocate package fallocate/ (4 platform-specific files, adapted from go-fallocate MIT)
eBPF program ebpf/disk-full-write/ (C kernel program + Go userspace binary)
Safemode safemode/safemode_disk_full.go, safemode/safemode.go
Types types/types.go (DisruptionKindDiskFull), ebpf/const-*.go (SysWrite)
Docs docs/disk_full.md, docs/disruption_catalogue.md
Tests api/v1beta1/disk_full_test.go, injector/disk_full_test.go

Signed-off-by: Thibault NORMAND <thibault.normand@datadoghq.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Thibault NORMAND <me@zenithar.org>
@Zenithar Zenithar force-pushed the zenithar/chaos-controller/disk_full_disruption branch from d238abd to 6109573 Compare April 8, 2026 15:11
@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 bot commented Apr 8, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 57.09%
Overall Coverage: 38.98% (+0.49%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 499ca66 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant