-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Change the cudagraph distribution from linearly to exponentially-decreasing #3509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a09988a
a1555ba
20798af
24bc8d6
bbf2dd1
3c718e9
ffae8ef
cafe6af
618d016
1f3654d
96f5278
d62f2fc
116a785
ad6753e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -210,7 +210,7 @@ class CUDAGraphBatchDimensionBuilder: | |
| """ | ||
|
|
||
| # Constant for rounding token counts when generating CUDA graph batch dimensions | ||
| CUDA_GRAPH_ROUNDER = 8 | ||
| CUDA_GRAPH_ROUNDER = 2 | ||
|
|
||
| @staticmethod | ||
| def _calculate_cuda_graph_token_counts( | ||
|
|
@@ -219,8 +219,9 @@ def _calculate_cuda_graph_token_counts( | |
| """ | ||
| Calculate CUDA graph token counts for a given configuration. | ||
|
|
||
| This method computes evenly-spaced token counts from step_size up to | ||
| cuda_graph_max_tokens, ensuring proper rounding and TP alignment. | ||
| This method computes exponentially-decreasing token counts (powers of 2) | ||
| from cuda_graph_max_tokens down to CUDA_GRAPH_ROUNDER, ensuring proper | ||
| rounding and TP alignment. | ||
|
|
||
| Args: | ||
| tp_size: Tensor parallel size (for alignment) | ||
|
|
@@ -232,38 +233,52 @@ def _calculate_cuda_graph_token_counts( | |
|
|
||
| Example: | ||
| >>> _calculate_cuda_graph_token_counts | ||
| (tp_size=2, num_cuda_graphs=4, cuda_graph_max_tokens=1000) | ||
| [1000, 752, 504, 256] | ||
| (tp_size=1, num_cuda_graphs=8, cuda_graph_max_tokens=128) | ||
| [128, 64, 32, 16, 8, 4, 2, 1] | ||
| """ | ||
| assert num_cuda_graphs >= 1, f"num_cuda_graphs must be >= 1, got {num_cuda_graphs}" | ||
| assert ( | ||
| cuda_graph_max_tokens > 0 | ||
| ), f"cuda_graph_max_tokens must be > 0, got {cuda_graph_max_tokens}" | ||
|
|
||
| # Cuda graph step size. | ||
| cuda_graph_step_size = cuda_graph_max_tokens / num_cuda_graphs | ||
| cuda_graph_step_size = CUDAGraphBatchDimensionBuilder.CUDA_GRAPH_ROUNDER * int( | ||
| math.ceil(int(cuda_graph_step_size) / CUDAGraphBatchDimensionBuilder.CUDA_GRAPH_ROUNDER) | ||
| ) | ||
| # Make sure divisible by TP size | ||
| cuda_graph_step_size = math.ceil(cuda_graph_step_size / tp_size) * tp_size | ||
| rounder = CUDAGraphBatchDimensionBuilder.CUDA_GRAPH_ROUNDER | ||
|
|
||
| # round down cuda graph max tokens to be multiple of TP size | ||
| # Round down cuda graph max tokens to be multiple of TP size | ||
| cuda_graph_max_tokens = (cuda_graph_max_tokens // tp_size) * tp_size | ||
|
|
||
| # Cuda graph token counts. | ||
| if num_cuda_graphs == 1: | ||
| cuda_graph_token_counts = [cuda_graph_max_tokens] | ||
| else: | ||
| cuda_graph_token_counts = list( | ||
| range(cuda_graph_step_size, cuda_graph_max_tokens, cuda_graph_step_size) | ||
| ) | ||
| if ( | ||
| len(cuda_graph_token_counts) == 0 | ||
| or cuda_graph_token_counts[-1] != cuda_graph_max_tokens | ||
| ): | ||
| cuda_graph_token_counts.append(cuda_graph_max_tokens) | ||
| cuda_graph_token_counts.reverse() | ||
| return [cuda_graph_max_tokens] | ||
|
|
||
| # Exponentially decreasing, stops after num_cuda_graphs entries | ||
| # or when below the minimum size. | ||
| # TODO(helenn/lmcafee): Extend upper range of distribution to be linearly-spaced. | ||
| cuda_graph_token_counts = [] | ||
|
yobibyte marked this conversation as resolved.
|
||
| val = cuda_graph_max_tokens | ||
| for _ in range(num_cuda_graphs): | ||
| # Round down to multiple of rounder, then up to multiple of TP size | ||
| rounded = max(rounder, (val // rounder) * rounder) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the old code guaranteed that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will bring this back and sub out a middle graph for it. |
||
| rounded = math.ceil(rounded / tp_size) * tp_size | ||
| if rounded not in cuda_graph_token_counts: | ||
| cuda_graph_token_counts.append(rounded) | ||
| val //= 2 | ||
| if val < 1: | ||
| break | ||
|
|
||
| # Ensure cuda_graph_max_tokens is always included | ||
| if cuda_graph_token_counts[0] != cuda_graph_max_tokens: | ||
| cuda_graph_token_counts.insert(0, cuda_graph_max_tokens) | ||
|
|
||
| # Include a (possibly extra) size-1 graph | ||
| if cuda_graph_token_counts[-1] != tp_size: | ||
| cuda_graph_token_counts.append(tp_size) | ||
|
|
||
| # Trim from the middle if we exceed num_cuda_graphs requested by the user | ||
| # Since num_cuda_graphs >= 1, this only runs when we have at least 2 elements. | ||
| while len(cuda_graph_token_counts) > num_cuda_graphs: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this line need to be Otherwise
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have a guarantee that |
||
| cuda_graph_token_counts.pop(-2) | ||
|
Comment on lines
+277
to
+278
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add these lines afterwards: |
||
|
|
||
| assert len(cuda_graph_token_counts) <= num_cuda_graphs | ||
| assert cuda_graph_max_tokens in cuda_graph_token_counts | ||
|
|
||
| return cuda_graph_token_counts | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also vote to leave the linear-spaced CGs as an option, there's no harm in doing so since the code is already setup, we can just default to exponential in the arguments.
One reason for keeping this setting & code is that vLLM uses linear spacing, they just create a ton more graphs than we do because they can create them so quickly and efficiently, and I think that just speaks to how unoptimized our CG system is. So I personally would keep the old option, and just plan to use it in the future.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be against adding a new flag to toggle the distribution of inference cudagraphs. Inference already has a lot of flags and users don't know how to combine them effectively, and there is no empirical case that makes the argument to keep linear distribution around at the moment. I will leave a TODO to re-enable when someone wants to take it on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, #3527 already implements the vllm strategy orthogonal to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathemakitten on second thought, I am also in favor of slowly phasing out the older strategy - mostly because it's the most stress tested one we have right now. We could make yours default, while having the option to fallback onto that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re:3527 - it does use a linear function, but builds a lot more cudagraphs compared to our default strategy.