Extend existing failure notifications for tracking flaky CI issues#2974
Extend existing failure notifications for tracking flaky CI issues#2974G-D-Petrov wants to merge 2 commits intomasterfrom
Conversation
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
.github/workflows/flaky_CI_issue.yml
Outdated
| types: [completed] | ||
|
|
||
| jobs: | ||
| track-failures: |
There was a problem hiding this comment.
The workflow triggers on any branch failure (workflow_run fires regardless of the branch the triggering run ran on). The PR description says "on the master branch" but the if: condition only checks conclusion == 'failure' — it does not filter on branch. This means flaky-test issues will be auto-created for failures on feature branches, release branches, forks, etc., generating noise.
Consider adding a branch filter:
| track-failures: | |
| if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.head_branch == 'master' |
.github/workflows/flaky_CI_issue.yml
Outdated
| for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do | ||
| job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name") | ||
| step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \ | ||
| --jq '.steps[] | select(.conclusion == "failure") | .name') |
There was a problem hiding this comment.
The log download silently succeeds even when it fails (|| true). If gh run view --log-failed fails (e.g. rate-limit, token permission issue), /tmp/failed_logs.txt may be empty or contain an error message, causing the grep patterns to produce zero results. The workflow would then fall through to the "unparseable failures" path and create a generic issue for every CI failure — this is the highest-noise fallback and should be avoided.
Consider logging a warning when the log download fails so at least the issue body contains diagnostic context.
.github/workflows/flaky_CI_issue.yml
Outdated
| echo "$FAILING_TESTS" | while IFS= read -r test_name; do | ||
| [ -z "$test_name" ] && continue | ||
|
|
||
| echo "Processing test: $test_name" |
There was a problem hiding this comment.
Injection risk via ${{ steps.parse.outputs.tests }}.
FAILING_TESTS is populated from parsed log output (test names) and then passed through ${{ steps.parse.outputs.tests }}. GitHub Actions interpolates ${{ ... }} expressions directly into the YAML before the runner executes the step. If a crafted test name contains shell metacharacters or YAML-breaking sequences, this could alter step behaviour.
The correct mitigation is to reference the output only via the environment variable ($FAILING_TESTS), which is already being done inside the run: script — the environment variable assignment itself is the risk. Prefer writing step outputs to a file (e.g. /tmp/failing_tests.txt) and reading that file in the subsequent step, rather than passing through ${{ steps.parse.outputs.tests }} as an env value.
The same pattern applies to FAILED_STEPS at line 197.
.github/workflows/flaky_CI_issue.yml
Outdated
| body="### Another failure observed" | ||
| body="$body"$'\n\n'"- **Run:** $RUN_URL" | ||
| body="$body"$'\n'"- **Commit:** \`${COMMIT_SHA:0:10}\`" | ||
| body="$body"$'\n'"- **Date:** $run_date" |
There was a problem hiding this comment.
The GitHub Search API (used by gh issue list --search) can return stale results with a lag of a few minutes. If two CI runs fail in quick succession for the same test, the deduplication search may find no existing issue for both runs, causing two issues to be created with the same title.
A safer deduplication approach would be to search by exact title match using gh issue list --search "is:issue is:open label:flaky-test \"$issue_title\"" and then verify the title server-side with --jq, which is what the code already does — but the underlying search index latency is unavoidable. This is worth noting in the workflow comments so maintainers understand the duplicates-on-simultaneous-failures edge case.
.github/workflows/flaky_CI_issue.yml
Outdated
| RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }} | ||
| REPO: ${{ github.repository }} | ||
| COMMIT_SHA: ${{ github.event.workflow_run.head_sha }} | ||
| run: | |
There was a problem hiding this comment.
The "unparseable failures" step creates a new issue per run with a unique title (CI failure (unparseable): run #$RUN_ID). If the log download frequently fails or the grep patterns never match (e.g. due to a log format change), this will create an unbounded number of issues — one for every CI failure.
Unlike the flaky-test and flaky-step steps, there is no deduplication here. Consider either deduplicating by searching for a generic title (e.g. CI failure (unparseable)) or enforcing a rate-limit / daily cap, or simply skipping issue creation and only writing to $GITHUB_STEP_SUMMARY for this fallback case.
.github/workflows/flaky_CI_issue.yml
Outdated
| body="$body"$'\n\n'"---" | ||
| body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*" | ||
| body="$body"$'\n'"*Add further failure occurrences as comments below.*" | ||
| gh issue create --repo "$REPO" \ |
There was a problem hiding this comment.
The issue body hard-codes "on the master branch" but the workflow is not restricted to master (see the branch-filter comment on line 9). This text will be misleading when issues are filed from non-master runs.
| gh issue create --repo "$REPO" \ | |
| body="$body"$'\n\n'"This test has been detected as failing in CI." |
.github/workflows/flaky_CI_issue.yml
Outdated
| # Get failed jobs with their names and IDs | ||
| failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \ | ||
| --jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]') | ||
|
|
There was a problem hiding this comment.
The API call fetches up to 100 jobs (per_page=100) but does not handle pagination. If a workflow run has more than 100 jobs (unlikely but possible as the matrix grows), failures in jobs beyond position 100 will be silently missed. Consider adding --paginate if the gh api CLI supports it here, or at least document this limitation.
ArcticDB Code Review Summary - PR 2974 delta review complete. See inline comments for details. Critical: (1) branch filter missing on track-failures line 9 fires on all branches not just master, (2) unbounded issue creation in unparseable fallback line 230 no dedup, (3) step output injection via dollar-brace interpolation line 135, (4) unescaped vars in JSON heredoc line 300 WORKFLOW_NAME CONCLUSION BRANCH need JSON escaping. Warnings: (5) log download silently swallowed line 68, (6) jobs API not paginated line 39, (7) notify-slack fires when track-failures errors line 266, (8) notify-slack missing permissions block. Code quality: track-failures running on all branches conflicts with original master-only intent; relationship with flaky_tests_summary.yml not addressed. |
| runs-on: ubuntu-latest | ||
| needs: [track-failures] | ||
| if: >- | ||
| (github.event.workflow_run.head_branch == 'master' || | ||
| github.event.workflow_run.head_branch == '') && | ||
| always() && | ||
| needs.track-failures.result != 'skipped' | ||
| steps: | ||
| - name: Send Slack notification | ||
| env: | ||
| SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} | ||
| WORKFLOW_NAME: ${{ github.event.workflow_run.name }} | ||
| CONCLUSION: ${{ github.event.workflow_run.conclusion }} | ||
| BRANCH: ${{ github.event.workflow_run.head_branch }} | ||
| RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }} | ||
| REPO: ${{ github.repository }} | ||
| REPO_URL: ${{ github.server_url }}/${{ github.repository }} | ||
| SUMMARY: ${{ needs.track-failures.outputs.slack_summary }} | ||
| run: | | ||
| if [ "$CONCLUSION" = "timed_out" ]; then | ||
| icon=":hourglass:" | ||
| else | ||
| icon=":fire:" | ||
| fi | ||
|
|
||
| # Build the failures section — only include if we have a summary | ||
| failures_block="" | ||
| if [ -n "$SUMMARY" ]; then | ||
| # Escape for JSON | ||
| escaped_summary=$(echo "$SUMMARY" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])") | ||
| failures_block=",{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"$escaped_summary\"}}" | ||
| fi | ||
|
|
||
| payload=$(cat <<PAYLOAD_END | ||
| { | ||
| "blocks": [ | ||
| { | ||
| "type": "section", | ||
| "text": { | ||
| "type": "mrkdwn", | ||
| "text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`" | ||
| } | ||
| }$failures_block, | ||
| { | ||
| "type": "context", | ||
| "elements": [ | ||
| { | ||
| "type": "mrkdwn", | ||
| "text": "<$REPO_URL|$REPO> | <$RUN_URL|View Failure>" | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| PAYLOAD_END | ||
| ) | ||
|
|
||
| curl -sf -X POST -H 'Content-type: application/json' \ | ||
| --data "$payload" \ | ||
| "$SLACK_WEBHOOK_URL" |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 26 days ago
To fix the problem, explicitly define restricted GITHUB_TOKEN permissions for the notify-slack job. Since this job only reads workflow metadata and a prior job’s output and posts to Slack (an external service), it does not need any GitHub write permissions, and likely only needs read access to repository contents (or even no special permissions if it doesn’t call GitHub APIs at all). A safe and clear minimal configuration is to set permissions: contents: read on the notify-slack job.
Concretely, in .github/workflows/failure_notification.yaml, under jobs: notify-slack:, add a permissions: block at the same indentation level as runs-on, needs, and if. Place it between runs-on: ubuntu-latest (line 260) and needs: [track-failures] (line 261), and set contents: read as the single permission. This keeps existing functionality intact while constraining the token according to the principle of least privilege.
| @@ -258,6 +258,8 @@ | ||
|
|
||
| notify-slack: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| needs: [track-failures] | ||
| if: >- | ||
| (github.event.workflow_run.head_branch == 'master' || |
| jobs: | ||
| on-failure: | ||
| track-failures: | ||
| runs-on: ubuntu-latest |
There was a problem hiding this comment.
The track-failures job has no branch restriction — it fires on failures from any branch (feature branches, forks' PRs, etc.). This will create GitHub issues for every flaky test run on any contributor's branch, generating significant noise.
The original workflow explicitly filtered to master or '' (scheduled runs). That filter should be preserved here, or at minimum added as a comment explaining the intentional choice to track all branches.
| runs-on: ubuntu-latest | |
| if: >- | |
| (github.event.workflow_run.conclusion == 'failure' || | |
| github.event.workflow_run.conclusion == 'timed_out') && | |
| (github.event.workflow_run.head_branch == 'master' || | |
| github.event.workflow_run.head_branch == '') |
|
|
||
| echo "Fetching failed jobs for run $RUN_ID..." | ||
|
|
||
| failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \ |
There was a problem hiding this comment.
The jobs API is called with per_page=100 but without pagination. A large matrix workflow can exceed 100 jobs (ArcticDB's build matrix is already sizeable). Any failures in jobs beyond position 100 will be silently missed, causing the issue tracker to never see those test failures.
Consider using gh api --paginate or at least adding a comment noting the limit. Note that --paginate with --jq requires using jq -s to merge pages, e.g.:
failed_jobs_json=$(gh api --paginate "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
--jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]' | jq -s 'add // []')| echo "$failed_steps" | ||
|
|
||
| # --- Download logs and parse test failures --- | ||
| gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || true |
There was a problem hiding this comment.
The || true here silently swallows failures from gh run view --log-failed. If this command fails — due to a rate limit, insufficient token permissions (actions: read is granted, but log access can be restricted on forks), or a transient API error — the file /tmp/failed_logs.txt will either be empty or contain an error message like HTTP 403. The subsequent grep calls will match nothing, both FAILING_TESTS and FAILED_STEPS will be empty, and the workflow will fall through to the unparseable fallback, creating a new ci-failure issue for every such run.
At minimum, log the exit code or file content so failures are visible in the step logs:
gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || {
echo "WARNING: gh run view --log-failed failed (exit $?); test name parsing will be skipped"
}| RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }} | ||
| REPO: ${{ github.repository }} | ||
| COMMIT_SHA: ${{ github.event.workflow_run.head_sha }} | ||
| FAILING_TESTS: ${{ steps.parse.outputs.tests }} |
There was a problem hiding this comment.
Security: step output injection via ${{ ... }} expression interpolation.
FAILING_TESTS and FAILED_STEPS are populated from parsed CI log content — test names and step names that ultimately come from source code, commit messages, and workflow definitions. GitHub Actions evaluates ${{ steps.parse.outputs.tests }} by substituting the value directly into the YAML before the runner executes the step.
If a test name or step name contains characters that break the YAML/shell context (e.g. a test name with \n, backticks, or $(...) sequences), this can alter step behaviour. While the workflow_run trigger somewhat limits exposure (only code on this repo can influence test names), it is still best practice to avoid interpolating step outputs directly into env: blocks.
The safe pattern is to write the parsed outputs to a file in the parse step and read that file in the issues step:
# In parse step:
echo "$failing_tests" > /tmp/failing_tests.txt
echo "$infra_steps" > /tmp/failed_steps.txt
# In issues step (no ${{ }} interpolation of untrusted content):
FAILING_TESTS=$(cat /tmp/failing_tests.txt)
FAILED_STEPS=$(cat /tmp/failed_steps.txt)This avoids the ${{ }} interpolation path entirely for untrusted content.
|
|
||
| # --- Fallback for unparseable failures --- | ||
| if [ -z "$FAILING_TESTS" ] && [ -z "$FAILED_STEPS" ]; then | ||
| issue_title="CI failure (unparseable): run #$RUN_ID" |
There was a problem hiding this comment.
The unparseable-failure fallback creates a new issue for every CI run using a unique title (CI failure (unparseable): run #$RUN_ID). There is no deduplication here.
The primary scenario that triggers this path is when the log download fails (see the || true concern on line 68). If that happens consistently — e.g. due to a log permission change or gh CLI upgrade breaking the --log-failed flag — every CI failure on master will produce a new issue indefinitely.
Consider one of:
- Writing to
$GITHUB_STEP_SUMMARYonly, instead of opening an issue, so humans can investigate without issue spam. - Searching for a generic open issue titled
CI failures (unparseable)and adding a comment to it rather than creating a new issue per run. - Capping at a daily or weekly issue (use a date-stamped title like
CI failure (unparseable): YYYY-MM-DD).
| (github.event.workflow_run.head_branch == 'master' || | ||
| github.event.workflow_run.head_branch == '') && | ||
| always() && | ||
| needs.track-failures.result != 'skipped' |
There was a problem hiding this comment.
The notify-slack condition has a subtle logic gap. The combination of always() and needs.track-failures.result != 'skipped' means this job will also fire when track-failures itself fails (e.g. due to a permissions error, API rate limit, or script bug). In that case needs.track-failures.outputs.slack_summary will be empty, and the Slack message will be sent but with no failure details — potentially confusing on-call engineers.
Consider adding needs.track-failures.result != 'failure' to the condition, or at minimum handle the case where SUMMARY is empty by adding a fallback message indicating that issue tracking itself failed:
if: >-
(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '') &&
always() &&
needs.track-failures.result != 'skipped' &&
needs.track-failures.result != 'cancelled'| "type": "section", | ||
| "text": { | ||
| "type": "mrkdwn", | ||
| "text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`" |
There was a problem hiding this comment.
Shell variables injected unescaped into a JSON string literal inside a heredoc.
$WORKFLOW_NAME, $CONCLUSION, and $BRANCH are substituted directly into the JSON payload string. If any of these contain double quotes, backslashes, or newlines (e.g. a workflow name like Build "with" quotes), the resulting JSON will be malformed and curl will send an invalid payload, silently failing with -f.
The escaped_summary is already being JSON-escaped via python3. Apply the same treatment to the other variables before inserting them:
esc_workflow=$(printf '%s' "$WORKFLOW_NAME" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_conclusion=$(printf '%s' "$CONCLUSION" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_branch=$(printf '%s' "$BRANCH" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")Then use $esc_workflow, $esc_conclusion, $esc_branch in the heredoc.
Reference Issues/PRs
What does this implement or fix?
Extends the existing
failure_notification.yamlworkflow to automatically track intermittent CI failures as GitHub issues and enrich Slack notifications with known/new status.What changed
The workflow now has two jobs:
track-failures(runs on all branches) — when any monitored workflow (Build and Test,Build with conda,Build with analysis tools,Coverity Static Analysis,Installation Tests Execution) fails:[ FAILED ]and pytestFAILEDpatterns)Install MongoDBtiming out due to network issues)notify-slack(runs on master only, same condition as before) — sends a Slack message enriched with the failure summary, e.g.:Labels
flaky-testFlaky test: StorageLockWithAndWithoutRetry.StressManyWriters)flaky-stepFlaky step: 3.11 Linux / integration-NoCache / Install MongoDB)ci-failureLabels are auto-created on first run if they don't already exist.
Any other comments?
Checklist
Checklist for code changes...