Skip to content

Add MCE and PCIe AER sub-checks to check_syslogs#55

Open
gustcol wants to merge 1 commit intofacebookresearch:mainfrom
gustcol:feature/check-dmesg
Open

Add MCE and PCIe AER sub-checks to check_syslogs#55
gustcol wants to merge 1 commit intofacebookresearch:mainfrom
gustcol:feature/check-dmesg

Conversation

@gustcol
Copy link
Contributor

@gustcol gustcol commented Feb 24, 2026

Summary

  • Add mce and pcie-aer sub-commands to the existing check_syslogs Click group
  • Detect Machine Check Exceptions (MCE) and PCIe Advanced Error Reporting (AER) messages from dmesg
  • Follow the established pattern of xid, link-flaps, and io-errors sub-checks
  • Include documentation pages under check-syslogs/ and test coverage

Test plan

# Run the new tests
python -m pytest gcm/tests/health_checks_tests/test_check_syslogs_mce_pcie.py -v

# Run all check_syslogs tests
python -m pytest gcm/tests/health_checks_tests/test_check_syslogs.py gcm/tests/health_checks_tests/test_check_syslogs_mce_pcie.py -v

# Usage examples
health_checks check-syslogs mce [CLUSTER] app
health_checks check-syslogs pcie-aer [CLUSTER] app

Sample output

MCE check (clean):

OK: No MCE errors detected.

MCE check (errors found):

CRITICAL: 2 MCE error(s) detected.

PCIe AER check (corrected errors):

WARN: 1 PCIe AER corrected error(s) detected.

PCIe AER check (uncorrectable errors):

CRITICAL: 2 PCIe AER error(s) detected, including uncorrectable.

@github-actions
Copy link

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-cla
Copy link

meta-cla bot commented Feb 24, 2026

Hi @gustcol!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@gustcol gustcol closed this Feb 24, 2026
@gustcol gustcol reopened this Feb 24, 2026
@meta-cla
Copy link

meta-cla bot commented Feb 24, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the cla signed label Feb 24, 2026
@gustcol gustcol closed this Feb 24, 2026
@gustcol gustcol reopened this Feb 24, 2026
@luccabb
Copy link
Member

luccabb commented Feb 24, 2026

@gustcol gustcol force-pushed the feature/check-dmesg branch from 53fc6db to 822200d Compare February 24, 2026 20:14
@gustcol
Copy link
Contributor Author

gustcol commented Feb 24, 2026

Hi @luccabb, thanks for the feedback!

I've added the documentation page:

  • website/docs/GCM_Health_Checks/health_checks/check-dmesg.md — covers overview, error detection patterns (Xid, MCE, PCIe AER), command-line options, exit conditions, and usage examples

The doc follows the same structure as existing health check pages like xid.md and check-syslogs/README.md.

Let me know if you'd like any adjustments!

@gustcol gustcol force-pushed the feature/check-dmesg branch 2 times, most recently from a0675d8 to bc77d65 Compare February 25, 2026 00:22
@luccabb
Copy link
Member

luccabb commented Feb 25, 2026

what do you think of:

  1. keeping XIDs at: https://facebookresearch.github.io/gcm/docs/GCM_Health_Checks/health_checks/check-syslogs/xid/

  2. moving Machine Check Exceptions (MCE), and PCIe AER errors to be their own checks under check-syslogs? https://facebookresearch.github.io/gcm/docs/GCM_Health_Checks/health_checks/check-syslogs/

Implement mce and pcie-aer sub-commands under the existing check_syslogs
Click group for detecting Machine Check Exceptions and PCIe Advanced
Error Reporting messages from dmesg. Follows the established pattern
of xid, link-flaps, and io-errors sub-checks.

Include documentation pages and comprehensive test coverage.
@gustcol gustcol force-pushed the feature/check-dmesg branch from bc77d65 to 9f59a6c Compare February 25, 2026 02:26
@gustcol gustcol requested a review from A-Kokolis as a code owner February 25, 2026 02:26
@gustcol
Copy link
Contributor Author

gustcol commented Feb 25, 2026

Hi @luccabb, great suggestion! I've refactored the PR:

  • Removed the standalone check_dmesg command entirely
  • Added mce and pcie-aer as new sub-commands under check_syslogs, following the same pattern as xid, link-flaps, and io-errors

Now the usage looks like:

# MCE check
health_checks check-syslogs mce [CLUSTER] app

# PCIe AER check
health_checks check-syslogs pcie-aer [CLUSTER] app

Each sub-check has its own:

  • Protocol method in Syslog + implementation in SyslogImpl
  • Processing function (process_mce_output, process_pcie_aer_output)
  • HealthCheckName enum entry
  • Feature flag killswitch
  • Documentation page under check-syslogs/
  • Test coverage (7 new tests, 22 total passing)

XIDs remain at their current location in check_syslogs xid as you suggested.

@gustcol gustcol changed the title Add check_dmesg health check for GPU kernel errors Add MCE and PCIe AER sub-checks to check_syslogs Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants