Skip to content

Conversation

@NathanGraddon
Copy link

Integration of Part 4: Advanced Topics in cuPyNumeric (profiling & debugging) rst file.

NathanGraddon and others added 24 commits November 26, 2025 11:41
Updated references and formatting in profiling_debugging.rst for clarity and consistency.
Corrected numbering and formatting in profiling debugging documentation.
Updated profiler output images and descriptions for both inefficient and efficient CPU, utility, I/O, system, channel, GPU, and framebuffer results.
Updated formatting for section titles and removed example text.
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@manopapad
Copy link
Contributor

@ipdemes @shriram-jagan @Jacobfaib could you please take a look over this tutorial that Nathan (Sunita Chandrasekaran's student from University of Delaware) wrote for us? Also @lightsighter FYI


**3.) After a run completes, in the directory you ran the command you’ll see:**

- A folder: ``legate_prof/``, a self-contained HTML report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this folder is generated with the recent legate/legate_prof

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, I checked my latest runs. That section is now updated. Thank you.

…script.

Removed unnecessary line breaks and adjusted formatting for clarity.
Updated usage examples to include the --provenance flag for diagnostic commands.
mag1cp1n pushed a commit to mag1cp1n/cupynumeric that referenced this pull request Dec 17, 2025
@dongb
Copy link
Contributor

dongb commented Jan 20, 2026

@manopapad , what's the status of the review?

Copy link
Contributor

@shriram-jagan shriram-jagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, please address some of the comments I left.

see:

- Setting up your environment and running `cuPyNumeric <https://docs.nvidia.com/cupynumeric/latest/user/tutorial.html>`_
- Extending cuPyNumeric with `Legate Task <https://docs.nvidia.com/cupynumeric/25.10/user/task.html>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, fixed!

multi-node clusters. Previous sections covered how to get code running; here
the focus shifts to making workloads production-ready. At scale, success is
not just about adding GPUs or nodes, it requires ensuring that applications
remain efficient, stable, and resilient under load. That means finding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what "stable, and resilient under load" means in this context. I'd probably leave out this sentence that defines what success is at scale and instead continue to the next sentence which is more specific and relatable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I can see how that would be ambiguous, its been removed.


* - **What you'll gain:** By combining profiling tools with solid
OOM-handling strategies, you can significantly improve the
efficiency, scalability, and reliability of cuPyNumeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by reliability, do you mean that the library doesn't fail on different processor variants or architectures? (you don't have to update the doc with the definition, maybe just tell me what it is in a comment)

Copy link
Author

@NathanGraddon NathanGraddon Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its definitely a broad statement, what I meant by "reliability" here is execution stability. Applying profiling and OOM-handling practices makes cuPyNumeric runs less likely to fail (OOM/job crashes) and reduces stalling/underutilization from memory pressure and tiny tasks, especially at scale. Which would in turn make it a more reliable.

Although, it doesn't exactly mean the program is reliable in a way such that it would still always work with different architectures, as that would require more context such as specific runtime and hardware environment which would be outside the scope of the profiler and OOM sections to an extent.

Please feel free to let me know if you think this part should be altered for clarity or removed, I'd be more than happy to change it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying profiling and OOM-handling practices makes cuPyNumeric runs less likely to fail (OOM/job crashes) and reduces stalling/underutilization from memory pressure and tiny tasks, especially at scale. Which would in turn make it a more reliable.

I like this part and I understand now what you are trying to convey. Profiling gives you an understanding of how your application is performing at scale. In particular, it helps you understand different metrics -- memory pressure and tiny tasks, like you mentioned, are a couple of them. Can you rephrase the original sentence and make it more specific instead of saying "reliability". Somehow I don't feel comfortable using the word reliability in the context of profiling when we have asynchronous runtime underneath.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have this now "What you'll gain: By combining profiling with practical OOM-handling strategies, you can improve efficiency and scaling by identifying memory pressure and over-granular execution, while reducing OOM crashes and runtime stalls across CPUs, GPUs, and multi-node systems.".


**For more detail, see the official references:**

- `Usage — NVIDIA legate <https://docs.nvidia.com/legate/24.11/usage.html>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/legate/latest/manual/usage/index.html

here and elsewhere in this page, please link to "latest" instead of linking to a specific version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, all sections should be fixed!


# Multi-GPU/Multi-Node: multiple ranks (pass them all: e.g: N0, N1, N2, etc)
legate_prof view /path/to/legate_*.prof

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we leave a link to the legate profiler stanford page?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could be a good idea, is this what you were referring to?: https://legion.stanford.edu/profiling/index.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Bo or Manolis, what do you think? If yes, I can go ahead and add it in

@manopapad @dongb

in many tiny tasks; runtime overhead dominates useful computation.

Profiler Output and Interpretation - Inefficient CPU Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might be easy to interpret if you tell the user how the data is presented in the profiler -- profiler's x-axis is time and y-axis is utilization and each panel in the profile is some kind of resource, so you essentially resource utilization in each panel (e.g., mem, processor utilization (cpu/gpu/omp), etc).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I added a general interpretation under each profiling section, for inefficient (2x) and efficient (2x). Under the first image in each part:

"Interpretation: The profiler is presented as a timeline. The x-axis is time, the y-axis is organized by resource/utilization lanes. each horizontal lane represents a particular resource stream (CPU workers, GPU Device/Host, runtime/Utility threads, memory pools like Framebuffer/Zerocopy, and copy/Channel). Colored boxes show work on that resource; the box width is how long it ran, gaps indicate idle/waiting, and dense “barcode” slivers usually mean many tiny tasks (high overhead), while long solid blocks indicate fewer, larger tasks (better utilization)."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"each horizontal lane" -> "Each horizontal lane". Looks good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

production-ready code. Profiling turns performance tuning from guesswork into
an intentional, data-driven process that elevates code quality from functional
to excellent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also mention how you can "trace back" dependencies by say, looking at a task in a panel (GPU utilization) and finding its task ID, and then searching for that ID and looking at other panels (say utility to see when it got mapped, or channel to see if there were any data movement, etc)? this is how we can interactively find what operations were associated with a task.

note that the profiler allows you to search by other keys as well, not just ID.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, its been added in the wrap up section.

Updated links to the latest NVIDIA Legate documentation.
Added detailed interpretation of profiler timelines for CPU and GPU resources.
Added details about the traceable view feature in profiling, explaining how to use task identifiers to connect performance symptoms to runtime activities.
@shriram-jagan
Copy link
Contributor

@NathanGraddon, left a few nits. looks good on my end.

Enhanced the explanation of benefits from profiling tools and OOM-handling strategies, emphasizing memory pressure identification and execution granularity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants