Skip to content

Adding monitoring Scripts#101

Open
mpatilgit-hub9 wants to merge 2 commits intoIBM:mainfrom
mpatilgit-hub9:main
Open

Adding monitoring Scripts#101
mpatilgit-hub9 wants to merge 2 commits intoIBM:mainfrom
mpatilgit-hub9:main

Conversation

@mpatilgit-hub9
Copy link
Copy Markdown
Member

For workflow monitoring, we are adding monitoring scripts.

Signed-off-by: mpatilgit-hub9 <Mahesh.Patil9@ibm.com>
Comment thread .github/workflows/blank.yml Outdated
build:
strategy:
matrix:
runner: ["ubuntu-24.04-ppc64le", "ubuntu-24.04-ppc64le-p10"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should monitor all types of workers (default, large, p/z) and not limit to the above two.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ubuntu-24.04-ppc64le",
"ubuntu-24.04-ppc64le-p10",
"ubuntu-24.04-ppc64le-2xlarge",
"ubuntu-24.04-ppc64le-2xlarge-p10",
"ubuntu-24.04-ppc64le-4xlarge",
"ubuntu-24.04-ppc64le-4xlarge-p10",
"ubuntu-24.04-s390x"

we have tested for above and ready to push changes in pr. Hope this fulfil our need

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Anup,

We have now updated the workflow to include all runner types (default, p/z, 2xlarge, 4xlarge, etc.) as suggested and validated the execution across them.

One observation from testing:
Runners like 2xlarge and 4xlarge tend to have higher queue and execution times compared to standard runners. If we monitor all of them using the same thresholds, it may lead to frequent false alerts from the watchdog.

As a follow-up improvement, we can consider:

  • Splitting heavy runners (2xlarge/4xlarge) into a separate workflow, or
  • Applying relaxed thresholds / lower monitoring frequency for them

This will help reduce alert noise and improve signal quality.

For now, the current implementation covers all runner types as expected. Please let me know if you’d like us to proceed with the split-monitoring approach as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of splitting up the 2x and 4xlarge runners into a separate workflow to reduce false alarms and to prioritize the larger workflow runs actually using this service.

steps:
- uses: actions/checkout@v4
- name: Run a one-line script
run: echo "Hello, world! GitHub app is running successfully on ${{ matrix.runner }}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets think of adding a basic test (io, network) and not just an echo test (but i'm open to adding that in future PRs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anup-kodlekere what did you have in mind? I think this could fulfill the monitoring requirement but I it would be nice to do something that mimics a real workflow. I like the idea of an io test

Signed-off-by: mpatilgit-hub9 <Mahesh.Patil9@ibm.com>
@mtarsel
Copy link
Copy Markdown
Member

mtarsel commented Apr 29, 2026

so can this PR be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants