Skip to content

Extend existing failure notifications for tracking flaky CI issues#2974

Open
G-D-Petrov wants to merge 2 commits intomasterfrom
gpetrov/flaky_ci_issues
Open

Extend existing failure notifications for tracking flaky CI issues#2974
G-D-Petrov wants to merge 2 commits intomasterfrom
gpetrov/flaky_ci_issues

Conversation

@G-D-Petrov
Copy link
Copy Markdown
Collaborator

@G-D-Petrov G-D-Petrov commented Mar 17, 2026

Reference Issues/PRs

What does this implement or fix?

Extends the existing failure_notification.yaml workflow to automatically track intermittent CI failures as GitHub issues and enrich Slack notifications with known/new status.

What changed

The workflow now has two jobs:

  1. track-failures (runs on all branches) — when any monitored workflow (Build and Test, Build with conda, Build with analysis tools, Coverity Static Analysis, Installation Tests Execution) fails:

    • Parses failed test names from run logs (GoogleTest [ FAILED ] and pytest FAILED patterns)
    • Identifies failed infrastructure steps via the GitHub API (e.g. Install MongoDB timing out due to network issues)
    • For each failure, searches for an existing open issue — if found, adds a comment with the run link; otherwise creates a new issue
    • Outputs a summary classifying each failure as Known or New
  2. notify-slack (runs on master only, same condition as before) — sends a Slack message enriched with the failure summary, e.g.:

    🔥 Build with conda failure on master
    ⚠️ KnownStorageLockWithAndWithoutRetry.StressManyWriters (Increase Python Pickle protocol to 4 #42)
    🚨 New3.11 Linux / integration-NoCache / Install MongoDB (issue)

Labels

Label Colour Used for
flaky-test yellow Test failures (e.g. Flaky test: StorageLockWithAndWithoutRetry.StressManyWriters)
flaky-step pink Infrastructure/setup step failures (e.g. Flaky step: 3.11 Linux / integration-NoCache / Install MongoDB)
ci-failure red Failures where no specific test or step could be identified

Labels are auto-created on first run if they don't already exist.

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@github-actions
Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

types: [completed]

jobs:
track-failures:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow triggers on any branch failure (workflow_run fires regardless of the branch the triggering run ran on). The PR description says "on the master branch" but the if: condition only checks conclusion == 'failure' — it does not filter on branch. This means flaky-test issues will be auto-created for failures on feature branches, release branches, forks, etc., generating noise.

Consider adding a branch filter:

Suggested change
track-failures:
if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.head_branch == 'master'

for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do
job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name")
step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \
--jq '.steps[] | select(.conclusion == "failure") | .name')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log download silently succeeds even when it fails (|| true). If gh run view --log-failed fails (e.g. rate-limit, token permission issue), /tmp/failed_logs.txt may be empty or contain an error message, causing the grep patterns to produce zero results. The workflow would then fall through to the "unparseable failures" path and create a generic issue for every CI failure — this is the highest-noise fallback and should be avoided.

Consider logging a warning when the log download fails so at least the issue body contains diagnostic context.

echo "$FAILING_TESTS" | while IFS= read -r test_name; do
[ -z "$test_name" ] && continue

echo "Processing test: $test_name"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injection risk via ${{ steps.parse.outputs.tests }}.

FAILING_TESTS is populated from parsed log output (test names) and then passed through ${{ steps.parse.outputs.tests }}. GitHub Actions interpolates ${{ ... }} expressions directly into the YAML before the runner executes the step. If a crafted test name contains shell metacharacters or YAML-breaking sequences, this could alter step behaviour.

The correct mitigation is to reference the output only via the environment variable ($FAILING_TESTS), which is already being done inside the run: script — the environment variable assignment itself is the risk. Prefer writing step outputs to a file (e.g. /tmp/failing_tests.txt) and reading that file in the subsequent step, rather than passing through ${{ steps.parse.outputs.tests }} as an env value.

The same pattern applies to FAILED_STEPS at line 197.

body="### Another failure observed"
body="$body"$'\n\n'"- **Run:** $RUN_URL"
body="$body"$'\n'"- **Commit:** \`${COMMIT_SHA:0:10}\`"
body="$body"$'\n'"- **Date:** $run_date"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub Search API (used by gh issue list --search) can return stale results with a lag of a few minutes. If two CI runs fail in quick succession for the same test, the deduplication search may find no existing issue for both runs, causing two issues to be created with the same title.

A safer deduplication approach would be to search by exact title match using gh issue list --search "is:issue is:open label:flaky-test \"$issue_title\"" and then verify the title server-side with --jq, which is what the code already does — but the underlying search index latency is unavoidable. This is worth noting in the workflow comments so maintainers understand the duplicates-on-simultaneous-failures edge case.

RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
run: |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "unparseable failures" step creates a new issue per run with a unique title (CI failure (unparseable): run #$RUN_ID). If the log download frequently fails or the grep patterns never match (e.g. due to a log format change), this will create an unbounded number of issues — one for every CI failure.

Unlike the flaky-test and flaky-step steps, there is no deduplication here. Consider either deduplicating by searching for a generic title (e.g. CI failure (unparseable)) or enforcing a rate-limit / daily cap, or simply skipping issue creation and only writing to $GITHUB_STEP_SUMMARY for this fallback case.

body="$body"$'\n\n'"---"
body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
body="$body"$'\n'"*Add further failure occurrences as comments below.*"
gh issue create --repo "$REPO" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue body hard-codes "on the master branch" but the workflow is not restricted to master (see the branch-filter comment on line 9). This text will be misleading when issues are filed from non-master runs.

Suggested change
gh issue create --repo "$REPO" \
body="$body"$'\n\n'"This test has been detected as failing in CI."

# Get failed jobs with their names and IDs
failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
--jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API call fetches up to 100 jobs (per_page=100) but does not handle pagination. If a workflow run has more than 100 jobs (unlikely but possible as the matrix grows), failures in jobs beyond position 100 will be silently missed. Consider adding --paginate if the gh api CLI supports it here, or at least document this limitation.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 17, 2026

ArcticDB Code Review Summary - PR 2974 delta review complete. See inline comments for details. Critical: (1) branch filter missing on track-failures line 9 fires on all branches not just master, (2) unbounded issue creation in unparseable fallback line 230 no dedup, (3) step output injection via dollar-brace interpolation line 135, (4) unescaped vars in JSON heredoc line 300 WORKFLOW_NAME CONCLUSION BRANCH need JSON escaping. Warnings: (5) log download silently swallowed line 68, (6) jobs API not paginated line 39, (7) notify-slack fires when track-failures errors line 266, (8) notify-slack missing permissions block. Code quality: track-failures running on all branches conflicts with original master-only intent; relationship with flaky_tests_summary.yml not addressed.

Comment on lines +260 to +319
runs-on: ubuntu-latest
needs: [track-failures]
if: >-
(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '') &&
always() &&
needs.track-failures.result != 'skipped'
steps:
- name: Send Slack notification
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
WORKFLOW_NAME: ${{ github.event.workflow_run.name }}
CONCLUSION: ${{ github.event.workflow_run.conclusion }}
BRANCH: ${{ github.event.workflow_run.head_branch }}
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
REPO_URL: ${{ github.server_url }}/${{ github.repository }}
SUMMARY: ${{ needs.track-failures.outputs.slack_summary }}
run: |
if [ "$CONCLUSION" = "timed_out" ]; then
icon=":hourglass:"
else
icon=":fire:"
fi

# Build the failures section — only include if we have a summary
failures_block=""
if [ -n "$SUMMARY" ]; then
# Escape for JSON
escaped_summary=$(echo "$SUMMARY" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
failures_block=",{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"$escaped_summary\"}}"
fi

payload=$(cat <<PAYLOAD_END
{
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`"
}
}$failures_block,
{
"type": "context",
"elements": [
{
"type": "mrkdwn",
"text": "<$REPO_URL|$REPO> | <$RUN_URL|View Failure>"
}
]
}
]
}
PAYLOAD_END
)

curl -sf -X POST -H 'Content-type: application/json' \
--data "$payload" \
"$SLACK_WEBHOOK_URL"

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 26 days ago

To fix the problem, explicitly define restricted GITHUB_TOKEN permissions for the notify-slack job. Since this job only reads workflow metadata and a prior job’s output and posts to Slack (an external service), it does not need any GitHub write permissions, and likely only needs read access to repository contents (or even no special permissions if it doesn’t call GitHub APIs at all). A safe and clear minimal configuration is to set permissions: contents: read on the notify-slack job.

Concretely, in .github/workflows/failure_notification.yaml, under jobs: notify-slack:, add a permissions: block at the same indentation level as runs-on, needs, and if. Place it between runs-on: ubuntu-latest (line 260) and needs: [track-failures] (line 261), and set contents: read as the single permission. This keeps existing functionality intact while constraining the token according to the principle of least privilege.

Suggested changeset 1
.github/workflows/failure_notification.yaml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/failure_notification.yaml b/.github/workflows/failure_notification.yaml
--- a/.github/workflows/failure_notification.yaml
+++ b/.github/workflows/failure_notification.yaml
@@ -258,6 +258,8 @@
 
   notify-slack:
     runs-on: ubuntu-latest
+    permissions:
+      contents: read
     needs: [track-failures]
     if: >-
       (github.event.workflow_run.head_branch == 'master' ||
EOF
@@ -258,6 +258,8 @@

notify-slack:
runs-on: ubuntu-latest
permissions:
contents: read
needs: [track-failures]
if: >-
(github.event.workflow_run.head_branch == 'master' ||
Copilot is powered by AI and may make mistakes. Always verify output.
@G-D-Petrov G-D-Petrov changed the title Add a flow for tracking flaky CI issues Extend existing failure notifications for tracking flaky CI issues Mar 17, 2026
jobs:
on-failure:
track-failures:
runs-on: ubuntu-latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The track-failures job has no branch restriction — it fires on failures from any branch (feature branches, forks' PRs, etc.). This will create GitHub issues for every flaky test run on any contributor's branch, generating significant noise.

The original workflow explicitly filtered to master or '' (scheduled runs). That filter should be preserved here, or at minimum added as a comment explaining the intentional choice to track all branches.

Suggested change
runs-on: ubuntu-latest
if: >-
(github.event.workflow_run.conclusion == 'failure' ||
github.event.workflow_run.conclusion == 'timed_out') &&
(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '')


echo "Fetching failed jobs for run $RUN_ID..."

failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs API is called with per_page=100 but without pagination. A large matrix workflow can exceed 100 jobs (ArcticDB's build matrix is already sizeable). Any failures in jobs beyond position 100 will be silently missed, causing the issue tracker to never see those test failures.

Consider using gh api --paginate or at least adding a comment noting the limit. Note that --paginate with --jq requires using jq -s to merge pages, e.g.:

failed_jobs_json=$(gh api --paginate "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
  --jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]' | jq -s 'add // []')

echo "$failed_steps"

# --- Download logs and parse test failures ---
gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The || true here silently swallows failures from gh run view --log-failed. If this command fails — due to a rate limit, insufficient token permissions (actions: read is granted, but log access can be restricted on forks), or a transient API error — the file /tmp/failed_logs.txt will either be empty or contain an error message like HTTP 403. The subsequent grep calls will match nothing, both FAILING_TESTS and FAILED_STEPS will be empty, and the workflow will fall through to the unparseable fallback, creating a new ci-failure issue for every such run.

At minimum, log the exit code or file content so failures are visible in the step logs:

gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || {
  echo "WARNING: gh run view --log-failed failed (exit $?); test name parsing will be skipped"
}

RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
FAILING_TESTS: ${{ steps.parse.outputs.tests }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: step output injection via ${{ ... }} expression interpolation.

FAILING_TESTS and FAILED_STEPS are populated from parsed CI log content — test names and step names that ultimately come from source code, commit messages, and workflow definitions. GitHub Actions evaluates ${{ steps.parse.outputs.tests }} by substituting the value directly into the YAML before the runner executes the step.

If a test name or step name contains characters that break the YAML/shell context (e.g. a test name with \n, backticks, or $(...) sequences), this can alter step behaviour. While the workflow_run trigger somewhat limits exposure (only code on this repo can influence test names), it is still best practice to avoid interpolating step outputs directly into env: blocks.

The safe pattern is to write the parsed outputs to a file in the parse step and read that file in the issues step:

# In parse step:
echo "$failing_tests" > /tmp/failing_tests.txt
echo "$infra_steps"   > /tmp/failed_steps.txt

# In issues step (no ${{ }} interpolation of untrusted content):
FAILING_TESTS=$(cat /tmp/failing_tests.txt)
FAILED_STEPS=$(cat /tmp/failed_steps.txt)

This avoids the ${{ }} interpolation path entirely for untrusted content.


# --- Fallback for unparseable failures ---
if [ -z "$FAILING_TESTS" ] && [ -z "$FAILED_STEPS" ]; then
issue_title="CI failure (unparseable): run #$RUN_ID"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unparseable-failure fallback creates a new issue for every CI run using a unique title (CI failure (unparseable): run #$RUN_ID). There is no deduplication here.

The primary scenario that triggers this path is when the log download fails (see the || true concern on line 68). If that happens consistently — e.g. due to a log permission change or gh CLI upgrade breaking the --log-failed flag — every CI failure on master will produce a new issue indefinitely.

Consider one of:

  1. Writing to $GITHUB_STEP_SUMMARY only, instead of opening an issue, so humans can investigate without issue spam.
  2. Searching for a generic open issue titled CI failures (unparseable) and adding a comment to it rather than creating a new issue per run.
  3. Capping at a daily or weekly issue (use a date-stamped title like CI failure (unparseable): YYYY-MM-DD).

(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '') &&
always() &&
needs.track-failures.result != 'skipped'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notify-slack condition has a subtle logic gap. The combination of always() and needs.track-failures.result != 'skipped' means this job will also fire when track-failures itself fails (e.g. due to a permissions error, API rate limit, or script bug). In that case needs.track-failures.outputs.slack_summary will be empty, and the Slack message will be sent but with no failure details — potentially confusing on-call engineers.

Consider adding needs.track-failures.result != 'failure' to the condition, or at minimum handle the case where SUMMARY is empty by adding a fallback message indicating that issue tracking itself failed:

    if: >-
      (github.event.workflow_run.head_branch == 'master' ||
       github.event.workflow_run.head_branch == '') &&
      always() &&
      needs.track-failures.result != 'skipped' &&
      needs.track-failures.result != 'cancelled'

"type": "section",
"text": {
"type": "mrkdwn",
"text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell variables injected unescaped into a JSON string literal inside a heredoc.

$WORKFLOW_NAME, $CONCLUSION, and $BRANCH are substituted directly into the JSON payload string. If any of these contain double quotes, backslashes, or newlines (e.g. a workflow name like Build "with" quotes), the resulting JSON will be malformed and curl will send an invalid payload, silently failing with -f.

The escaped_summary is already being JSON-escaped via python3. Apply the same treatment to the other variables before inserting them:

esc_workflow=$(printf '%s' "$WORKFLOW_NAME" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_conclusion=$(printf '%s' "$CONCLUSION"    | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_branch=$(printf '%s' "$BRANCH"           | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")

Then use $esc_workflow, $esc_conclusion, $esc_branch in the heredoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants