feat: add 9 Konflux Web RCA incident scenarios (SPAI-414)#854
Open
Aliciapet11 wants to merge 2 commits into
Open
feat: add 9 Konflux Web RCA incident scenarios (SPAI-414)#854Aliciapet11 wants to merge 2 commits into
Aliciapet11 wants to merge 2 commits into
Conversation
Add 3 new fault types and scenarios derived from real Konflux production incidents (Web RCA data) for ITBench evaluation: - Fault 31 / Scenario 115: Crashing Kubernetes Controller Workload Based on 12 real CrashLoopBackOff incidents (Tekton Kueue, Integration Service, etc.) - Fault 32 / Scenario 116: Exhausted Etcd Database Storage Based on 6 real etcd full/quota incidents causing cluster outages - Fault 33 / Scenario 117: Stalled Pipeline Controller Based on 7 real pipeline processing degradation incidents Each includes: fault definition, Ansible injection task, scenario template, scenario index, and ground truth files (v1 DSL + v2). Generated using ITBench automation tools (make generate-docs, validate-docs, generate-resource-files). Injection tasks and ground truth v1 written manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 6 more fault types and scenarios derived from real Konflux production incidents, completing the full set of 9 new fault types: - Fault 34 / Scenario 118: Cluster Availability Loss (9 real incidents) - Fault 35 / Scenario 119: Breaking Configuration Change (6 real incidents) - Fault 36 / Scenario 120: Corrupted Kubernetes Secret Credentials (3 real incidents) - Fault 37 / Scenario 121: Node Resource Exhaustion (3 real incidents) - Fault 38 / Scenario 122: Failed Release Pipeline Service (3 real incidents) - Fault 39 / Scenario 123: Monitoring Probe Failure (2 real incidents) Each includes: fault definition, Ansible injection task, scenario template, scenario index, and ground truth files (v1 DSL + v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
make generate-docs,make validate-docs,make generate-resource-files) — 0 validation failuresNew Fault Types
crashing-kubernetes-controller-workloadexhausted-etcd-database-storagestalled-pipeline-controllercluster-availability-lossbreaking-configuration-changecorrupted-kubernetes-secret-credentialsnode-resource-exhaustionfailed-release-pipeline-servicemonitoring-probe-failureNew Scenarios (all using OpenTelemetry Demo)
recommendationservice — CrashLoopBackOffetcd— storage exhaustion via ConfigMapskafka— scaled to zero, stalling consumerscheckout— nodes cordoned, pods evictedcart— breaking env var changeemail— corrupted secret credentialsproduct-catalog— excessive resource requestsshipping— invalid container imageload-generator— monitoring probe scaled to zeroContext
This work is part of the Konflux ↔ ITBench collaboration (SPAI-414) to contribute new incident scenarios based on real production data from Konflux Web RCA. The 9 fault types cover 51 real production incidents spanning controller crashes, etcd issues, pipeline failures, credential problems, config changes, resource exhaustion, release failures, and monitoring gaps.
Test plan
make generate-docs— passes (0 failures, 0 warnings)make validate-docs— passes (0 failures, all schemas validated)make generate-resource-files— passes (all stubs generated)🤖 Generated with Claude Code