Skip to content

feat: add 9 Konflux Web RCA incident scenarios (SPAI-414)#854

Open
Aliciapet11 wants to merge 2 commits into
itbench-hub:mainfrom
Aliciapet11:feat/konflux-incident-fault-types
Open

feat: add 9 Konflux Web RCA incident scenarios (SPAI-414)#854
Aliciapet11 wants to merge 2 commits into
itbench-hub:mainfrom
Aliciapet11:feat/konflux-incident-fault-types

Conversation

@Aliciapet11

Copy link
Copy Markdown
Contributor

Summary

  • Add 9 new fault types and 9 corresponding scenarios derived from real Konflux production incidents (Web RCA data)
  • Analyzed 85 curated Web RCA incidents from DevLake, classified 55 as relevant using a relevance evaluation framework, and consolidated them into 9 new fault categories not currently in ITBench
  • All faults and scenarios generated using ITBench automation tools (make generate-docs, make validate-docs, make generate-resource-files) — 0 validation failures

New Fault Types

Fault Index ID Based On Real Incidents
31 crashing-kubernetes-controller-workload Controller/operator CrashLoopBackOff 12
32 exhausted-etcd-database-storage etcd storage full / quota exceeded 6
33 stalled-pipeline-controller Pipeline processing stall 7
34 cluster-availability-loss Full cluster outage 9
35 breaking-configuration-change Policy/config breaking production 6
36 corrupted-kubernetes-secret-credentials Secret rotation / auth failure 3
37 node-resource-exhaustion Node-level resource pressure 3
38 failed-release-pipeline-service Release signing / image failure 3
39 monitoring-probe-failure Monitoring probes going down 2

New Scenarios (all using OpenTelemetry Demo)

Scenario ID Fault Applied To Complexity
115 recommendation service — CrashLoopBackOff medium
116 etcd — storage exhaustion via ConfigMaps high
117 kafka — scaled to zero, stalling consumers medium
118 checkout — nodes cordoned, pods evicted high
119 cart — breaking env var change medium
120 email — corrupted secret credentials medium
121 product-catalog — excessive resource requests medium
122 shipping — invalid container image medium
123 load-generator — monitoring probe scaled to zero low

Context

This work is part of the Konflux ↔ ITBench collaboration (SPAI-414) to contribute new incident scenarios based on real production data from Konflux Web RCA. The 9 fault types cover 51 real production incidents spanning controller crashes, etcd issues, pipeline failures, credential problems, config changes, resource exhaustion, release failures, and monitoring gaps.

Test plan

  • make generate-docs — passes (0 failures, 0 warnings)
  • make validate-docs — passes (0 failures, all schemas validated)
  • make generate-resource-files — passes (all stubs generated)
  • Deploy and test injection on a live cluster with OpenTelemetry Demo

🤖 Generated with Claude Code

Aliciapet11 and others added 2 commits June 15, 2026 15:56
Add 3 new fault types and scenarios derived from real Konflux production
incidents (Web RCA data) for ITBench evaluation:

- Fault 31 / Scenario 115: Crashing Kubernetes Controller Workload
  Based on 12 real CrashLoopBackOff incidents (Tekton Kueue, Integration Service, etc.)

- Fault 32 / Scenario 116: Exhausted Etcd Database Storage
  Based on 6 real etcd full/quota incidents causing cluster outages

- Fault 33 / Scenario 117: Stalled Pipeline Controller
  Based on 7 real pipeline processing degradation incidents

Each includes: fault definition, Ansible injection task, scenario template,
scenario index, and ground truth files (v1 DSL + v2).

Generated using ITBench automation tools (make generate-docs, validate-docs,
generate-resource-files). Injection tasks and ground truth v1 written manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 6 more fault types and scenarios derived from real Konflux production
incidents, completing the full set of 9 new fault types:

- Fault 34 / Scenario 118: Cluster Availability Loss (9 real incidents)
- Fault 35 / Scenario 119: Breaking Configuration Change (6 real incidents)
- Fault 36 / Scenario 120: Corrupted Kubernetes Secret Credentials (3 real incidents)
- Fault 37 / Scenario 121: Node Resource Exhaustion (3 real incidents)
- Fault 38 / Scenario 122: Failed Release Pipeline Service (3 real incidents)
- Fault 39 / Scenario 123: Monitoring Probe Failure (2 real incidents)

Each includes: fault definition, Ansible injection task, scenario template,
scenario index, and ground truth files (v1 DSL + v2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant