Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions antora/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,60 @@
*** xref:Building_a_Simple_Engine/Advanced_Topics/Robustness2.adoc[Robustness2]
** Appendix
*** xref:Building_a_Simple_Engine/Appendix/appendix.adoc[Appendix]
* Advanced Vulkan Compute
** xref:Advanced_Vulkan_Compute/introduction.adoc[Introduction]
** The Compute Architecture and Execution Model
*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/02_workgroups_and_invocations.adoc[Workgroups and Invocations]
*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/03_occupancy_and_latency_hiding.adoc[Occupancy and Latency Hiding]
*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc[Vulkan 1.4 Scalar Layouts]
** Memory Models and Consistency
*** xref:Advanced_Vulkan_Compute/03_Memory_Models/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/03_Memory_Models/02_vulkan_memory_model.adoc[The Vulkan Memory Model]
*** xref:Advanced_Vulkan_Compute/03_Memory_Models/03_shared_memory_lds.adoc[Shared Memory (LDS)]
*** xref:Advanced_Vulkan_Compute/03_Memory_Models/04_memory_consistency.adoc[Memory Consistency]
** Subgroup Operations: The Hidden Power
*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/02_cross_invocation_communication.adoc[Cross-Invocation Communication]
*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/03_subgroup_partitioning.adoc[Subgroup Partitioning]
*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/04_non_uniform_indexing.adoc[Non-Uniform Indexing]
** Heterogeneous Ecosystem: OpenCL on Vulkan
*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/02_setup_and_installation.adoc[Setup and Installation]
*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/03_clspv_pipeline.adoc[The clspv Pipeline]
*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/04_kernel_portability.adoc[Kernel Portability]
*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc[clvk and Layering]
** High-Level Abstraction: SYCL and Single-Source C++
*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc[Setup and Installation]
*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/03_single_source_gpgpu.adoc[Single-Source GPGPU]
*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/04_vulkan_interoperability.adoc[Vulkan Interoperability]
*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc[Unified Shared Memory (USM)]
** Advanced Data Structures on the GPU
*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/02_gpu_resident_trees.adoc[GPU-Resident Trees]
*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/03_global_atomic_management.adoc[Global Atomic Management]
*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/04_device_addressable_buffers.adoc[Device-Addressable Buffers]
** Indirect Dispatch and GPU-Driven Pipelines
*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/02_indirect_dispatch.adoc[Indirect Dispatch]
*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/03_gpu_side_command_generation.adoc[GPU-Side Command Generation]
*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc[Multi-Draw Indirect (MDI)]
** Asynchronous Compute Orchestration
*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/02_concurrent_execution.adoc[Concurrent Execution]
*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/03_timeline_semaphores.adoc[Timeline Semaphores]
*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/04_queue_priority.adoc[Queue Priority]
** Cooperative Matrices and Specialized Math
*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/02_cooperative_matrices.adoc[Cooperative Matrices]
*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/03_mixed_precision.adoc[Mixed Precision]
** Performance Auditing and Optimization
*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/02_instruction_throughput.adoc[Instruction Throughput Analysis]
*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/03_divergence_audit.adoc[The "Divergence" Audit]
** Diagnostics and AI-Assisted Compute Refinement
*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/01_introduction.adoc[Introduction]
*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/02_compute_validation.adoc[Compute Validation]
*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc[Assistant-Led Optimization]
** xref:Advanced_Vulkan_Compute/conclusion.adoc[Conclusion]
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
:pp: {plus}{plus}

= The Compute Architecture and Execution Model: Introduction

== Overview

To write efficient compute kernels, you must look beyond the abstract execution model of "workgroups" and "invocations" and understand how these concepts map to the physical hardware. While Vulkan provides a cross-vendor API, the silicon beneath it from AMD, NVIDIA, and Intel has specific ways of handling your data.

In this chapter, we will bridge the gap between your shader code and the silicon. We'll explore how the 3D grid system you define in `vkCmdDispatch` is sliced, diced, and distributed across the GPU's **Compute Units (CU)** or **Streaming Multiprocessors (SM)**.

=== The Language of Silicon

Before we dive in, let's align our vocabulary. Different vendors use different names for the same concepts:

* **Workgroups** (Vulkan/OpenCL) are often mapped to **Thread Blocks** (CUDA).
* **Invocations** (Vulkan) are simply **Threads**.
* **Subgroups** (Vulkan) are called **Wavefronts** (AMD) or **Warps** (NVIDIA).
* **Compute Units** (AMD) are equivalent to **Streaming Multiprocessors** (NVIDIA).

Understanding these mappings allows you to read hardware-specific documentation and performance guides regardless of which GPU you are targeting.

== Hardware Mapping

When you dispatch a workload, the GPU's hardware command processor breaks the global grid into individual workgroups. These workgroups are the fundamental unit of scheduling.

A critical rule of the GPU execution model is **workgroup atomicity**: once a workgroup is assigned to a physical compute unit, all its invocations will stay on that unit until the workgroup completes. They cannot be split across multiple units. This locality is what enables **Shared Memory (LDS - Local Data Store)**—since all threads in a workgroup are physically on the same hardware block, they can share a dedicated, ultra-fast cache.

=== Invocations and SIMD

While workgroups are the scheduling unit, the **invocation** is the smallest unit of execution. However, GPUs are **SIMD (Single Instruction, Multiple Data)** machines. They don't execute invocations one by one; instead, they group them into small bundles (Subgroups).

In these bundles, every invocation executes the exact same instruction at the same time, but on different data. This is incredibly efficient for math, but it introduces a major pitfall: **Branch Divergence**. If your code contains an `if` statement where some threads go left and others go right, the hardware must execute *both* paths, masking out the inactive threads for each.

== Performance Metrics

Throughout this section, we will focus on two key metrics that determine how well you're utilizing the hardware:

1. **Occupancy**: This is the "concurrency" metric. It represents how many active workgroups are residing on a compute unit compared to its theoretical maximum. High occupancy helps **hide latency**—if one bundle is waiting for a memory fetch from slow VRAM, the scheduler can instantly switch to another bundle that's ready to do math.
2. **Bandwidth Efficiency**: This is the "throughput" metric. Modern GPUs have massive memory bandwidth, but it's easily wasted by poor data alignment. We'll see how Vulkan 1.4's **Scalar Layouts** allow us to pack data tightly, ensuring that the shader actually uses every byte fetched from VRAM.

== What's Next?

We'll start by diving into the 3D grid system and seeing exactly how it maps to physical hardware. From there, we'll learn how to calculate theoretical occupancy and use engine tools to monitor real-world utilization. Finally, we'll master the scalar block layouts to maximize your data throughput.

xref:../introduction.adoc[Previous: Introduction] | xref:02_workgroups_and_invocations.adoc[Next: Workgroups and Invocations]
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
:pp: {plus}{plus}

= Workgroups and Invocations: The 3D Lattice

== Introduction

In the basic compute tutorial, we used a simple one-dimensional dispatch. While that works for simple tasks, it doesn't represent how the GPU actually schedules work. To write high-performance kernels, you need to understand how Vulkan's 3D grid system maps to the physical silicon of the GPU.

The grid system is more than just a convenient way to index into textures; it defines how your workload is subdivided and scheduled across the hardware.

== The Three-Tier Hierarchy

When you define a compute dispatch, you are working with a hierarchy of units. Getting these dimensions right is the first step toward high performance.

1. **Global Dispatch Grid**: This is the entire workload, defined in `vkCmdDispatch(x, y, z)`.
2. **Workgroups**: The global grid is subdivided into workgroups. The GPU's hardware scheduler assigns these workgroups to physical compute units.
3. **Invocations**: Each workgroup contains multiple individual threads, defined by the `local_size` in your shader.

=== Workgroup Locality

In the previous section, we mentioned that a workgroup cannot be split across multiple physical **Compute Units** (CU, on AMD/Intel) or **Streaming Multiprocessors** (SM, on NVIDIA). This means that all invocations within a workgroup are physically executed on the same hardware block.

This locality is a key design constraint. It allows invocations in the same workgroup to share a fast, local memory known as **LDS** (Local Data Store) or **groupshared** memory, but it also means that the size of your workgroup is limited by the physical resources of a single CU/SM. If your workgroup size is too large, the GPU simply won't be able to schedule it.

== The Math of Indexing

Vulkan provides several built-in variables to help you find your place in the grid. In Slang, these are typically passed as parameters to the entry point using semantics like `SV_DispatchThreadID`, `SV_GroupThreadID`, and `SV_GroupID`.

Let's look at how these relate in a typical shader:

[source,slang]
----
[numthreads(16, 16, 1)]
void main(
uint3 groupID : SV_GroupID, // gl_WorkGroupID
uint3 localID : SV_GroupThreadID, // gl_LocalInvocationID
uint3 globalID : SV_DispatchThreadID // gl_GlobalInvocationID
) {
// globalID: The unique index for this thread in the entire grid
// Formula: globalID = groupID * numthreads + localID
uint x = globalID.x;
uint y = globalID.y;
// Process pixel (x, y)
}
----

Using a 2D or 3D grid makes spatial tasks (like image processing or physics simulations) much cleaner. Instead of manually calculating a 1D index, you can use `.xy` or `.xyz` coordinates that match your data structure.

== Choosing Optimal Sizes

A common mistake is choosing workgroup sizes based solely on what "fits" your data. For example, if you're processing a 10x10 image, you might choose a workgroup size of (10, 10, 1).

However, GPUs execute invocations in bundles of 32 or 64—known as **Subgroups**, **Warps** (NVIDIA), or **Wavefronts** (AMD). If your workgroup size is not a multiple of the hardware's native bundle size, you are leaving silicon idle. This is called **internal fragmentation**.

=== The Rule of 32/64

* **NVIDIA** GPUs typically prefer multiples of **32** (Warps).
* **AMD** GPUs typically prefer multiples of **64** (Wavefronts), though modern RDNA architectures can also handle 32.
* **Intel** GPUs have variable sizes (8, 16, 32).

A safe, portable choice for many workloads is a workgroup size of **64** or **256** (e.g., `16x16` or `8x8x4`). This ensures that most hardware can keep its **SIMD** (Single Instruction, Multiple Data) lanes full.

== Dispatching the Work

When you call `vkCmdDispatch(groupCountX, groupCountY, groupCountZ)`, you are defining how many times the `local_size` block is repeated.

If you have an image of size `width` x `height` and a workgroup size of `16x16`, your dispatch would look like this:

[source,cpp]
----
uint32_t groupCountX = (width + 15) / 16;
uint32_t groupCountY = (height + 15) / 16;
commandBuffer.dispatch(groupCountX, groupCountY, 1);
----

Note the use of "rounding up" (`(width + 15) / 16`). This ensures that if your image size isn't a perfect multiple of 16, you don't miss the last few pixels. Inside the shader, you would then use a bounds check: `if (x < width && y < height)`.

== What's Next?

Understanding how workgroups map to hardware is the foundation of GPU compute. But mapping work to hardware is only part of the story; we also need to keep that hardware busy. In the next section, we'll talk about **Occupancy** and how to hide the massive latency of VRAM.

xref:01_introduction.adoc[Previous: Introduction] | xref:03_occupancy_and_latency_hiding.adoc[Next: Occupancy and Latency Hiding]
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
:pp: {plus}{plus}

= Occupancy and Latency Hiding: Keeping the GPU Busy

== Introduction

In the previous section, we learned how workgroups are mapped to the GPU's factory floor (the Compute Units or SMs). But simply getting a workgroup onto a CU is only half the battle. If that workgroup is poorly designed, it might only use a fraction of the hardware's potential, leaving expensive silicon sitting idle.

To understand why this happens, we must talk about **Latency** and **Occupancy**.

== The Latency Gap

GPUs are memory-bound. While a modern GPU can perform trillions of floating-point operations per second (**TFLOPS**), fetching a single piece of data from **VRAM** (Video Random Access Memory) can take hundreds or even thousands of clock cycles.

If a bundle of invocations (a warp or wavefront) needs to read from memory, it has to wait. If that CU only has one bundle to run, the entire CU goes silent until the data arrives. This is a disaster for performance, and is known as **memory latency**.

The GPU's solution is **Concurrency**. Instead of waiting for one bundle, the CU switches to another bundle that is ready to execute. The more bundles you have "in flight" on a single CU, the better you can hide the latency of memory fetches.

== Defining Occupancy

**Occupancy** is a measure of how many bundles are active on a CU compared to the theoretical maximum. It's often expressed as a percentage.

* **100% Occupancy**: The CU is completely packed with bundles. Whenever one waits for memory, there's almost certainly another one ready to go.
* **Low Occupancy**: Only a few bundles are active. If they all hit a memory fetch at the same time, the CU will stall.

=== The Resource Tug-of-War

You might wonder: "Why not just always dispatch thousands of threads?" The problem is that each Compute Unit has a fixed pool of physical resources. Every thread you add consumes a portion of that pool.

The three primary limiters of occupancy are:

1. **Registers**: Each thread needs a set of registers to store its variables. If your shader uses 128 registers, you can fit fewer threads than if it used 32.
2. **Shared Memory (LDS)**: This memory is shared by the whole workgroup. If your workgroup uses 32KB of LDS and the CU only has 64KB, you can only fit two workgroups on that CU, regardless of how many threads they have.
3. **Thread/Warp Slots**: There is a hard limit on how many threads the hardware scheduler can track at once (e.g., 2048 threads per CU).

|===
| Resource Usage | Impact on Occupancy | Result

| High Register Count
| **Negative**
| Fewer bundles per CU; harder to hide latency.

| High LDS Usage
| **Negative**
| Fewer workgroups per CU; limited concurrency.

| Small Workgroup Size
| **Neutral/Negative**
| May not fill all warp slots; scheduling overhead.
|===

== Calculating Theoretical Occupancy

Most GPU vendors provide tools (like NVIDIA's Nsight or AMD's RGP) that calculate occupancy for you. However, you can estimate it yourself by looking at your shader's resource usage.

If a CU has 64KB of shared memory and your workgroup uses 32KB, your CU can only ever host two workgroups at a time. If your workgroup size is small (say, 64 threads), you'll have 128 threads per CU. If that hardware is capable of tracking 2048 threads, your occupancy is only around 6%.

This is why "fat" shaders (those that use lots of registers or shared memory) often perform poorly unless they are carefully tuned.

== Monitoring Utilization

In a real engine, you don't just want to guess. Modern Vulkan engines use performance counters (via the `VK_KHR_performance_query` extension) to monitor hardware utilization in real-time.

By tracking metrics like **ValuUtilization** (AMD) or **SM Active** (NVIDIA), you can see if your kernels are actually keeping the hardware busy. If you see high memory latency but low occupancy, you know you need to optimize your register usage or shared memory footprint.

== What's Next?

Now that we know how to keep the GPU busy, we need to make sure that when it *is* busy, it's being efficient. In the final section of this chapter, we'll look at **Scalar Layouts**—a Vulkan 1.4 feature that allows us to pack our data tightly and maximize the bandwidth we've worked so hard to hide.

xref:02_workgroups_and_invocations.adoc[Previous: Workgroups and Invocations] | xref:04_vulkan_1_4_scalar_layouts.adoc[Next: Vulkan 1.4 Scalar Layouts]
Loading
Loading