diff --git a/antora/modules/ROOT/nav.adoc b/antora/modules/ROOT/nav.adoc index 89c2dc6f..4e28bb0f 100644 --- a/antora/modules/ROOT/nav.adoc +++ b/antora/modules/ROOT/nav.adoc @@ -149,3 +149,60 @@ *** xref:Building_a_Simple_Engine/Advanced_Topics/Robustness2.adoc[Robustness2] ** Appendix *** xref:Building_a_Simple_Engine/Appendix/appendix.adoc[Appendix] +* Advanced Vulkan Compute +** xref:Advanced_Vulkan_Compute/introduction.adoc[Introduction] +** The Compute Architecture and Execution Model +*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/02_workgroups_and_invocations.adoc[Workgroups and Invocations] +*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/03_occupancy_and_latency_hiding.adoc[Occupancy and Latency Hiding] +*** xref:Advanced_Vulkan_Compute/02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc[Vulkan 1.4 Scalar Layouts] +** Memory Models and Consistency +*** xref:Advanced_Vulkan_Compute/03_Memory_Models/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/03_Memory_Models/02_vulkan_memory_model.adoc[The Vulkan Memory Model] +*** xref:Advanced_Vulkan_Compute/03_Memory_Models/03_shared_memory_lds.adoc[Shared Memory (LDS)] +*** xref:Advanced_Vulkan_Compute/03_Memory_Models/04_memory_consistency.adoc[Memory Consistency] +** Subgroup Operations: The Hidden Power +*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/02_cross_invocation_communication.adoc[Cross-Invocation Communication] +*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/03_subgroup_partitioning.adoc[Subgroup Partitioning] +*** xref:Advanced_Vulkan_Compute/04_Subgroup_Operations/04_non_uniform_indexing.adoc[Non-Uniform Indexing] +** Heterogeneous Ecosystem: OpenCL on Vulkan +*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/02_setup_and_installation.adoc[Setup and Installation] +*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/03_clspv_pipeline.adoc[The clspv Pipeline] +*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/04_kernel_portability.adoc[Kernel Portability] +*** xref:Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc[clvk and Layering] +** High-Level Abstraction: SYCL and Single-Source C++ +*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc[Setup and Installation] +*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/03_single_source_gpgpu.adoc[Single-Source GPGPU] +*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/04_vulkan_interoperability.adoc[Vulkan Interoperability] +*** xref:Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc[Unified Shared Memory (USM)] +** Advanced Data Structures on the GPU +*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/02_gpu_resident_trees.adoc[GPU-Resident Trees] +*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/03_global_atomic_management.adoc[Global Atomic Management] +*** xref:Advanced_Vulkan_Compute/07_Advanced_Data_Structures/04_device_addressable_buffers.adoc[Device-Addressable Buffers] +** Indirect Dispatch and GPU-Driven Pipelines +*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/02_indirect_dispatch.adoc[Indirect Dispatch] +*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/03_gpu_side_command_generation.adoc[GPU-Side Command Generation] +*** xref:Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc[Multi-Draw Indirect (MDI)] +** Asynchronous Compute Orchestration +*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/02_concurrent_execution.adoc[Concurrent Execution] +*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/03_timeline_semaphores.adoc[Timeline Semaphores] +*** xref:Advanced_Vulkan_Compute/09_Asynchronous_Compute/04_queue_priority.adoc[Queue Priority] +** Cooperative Matrices and Specialized Math +*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/02_cooperative_matrices.adoc[Cooperative Matrices] +*** xref:Advanced_Vulkan_Compute/10_Specialized_Math/03_mixed_precision.adoc[Mixed Precision] +** Performance Auditing and Optimization +*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/02_instruction_throughput.adoc[Instruction Throughput Analysis] +*** xref:Advanced_Vulkan_Compute/11_Performance_Optimization/03_divergence_audit.adoc[The "Divergence" Audit] +** Diagnostics and AI-Assisted Compute Refinement +*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/01_introduction.adoc[Introduction] +*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/02_compute_validation.adoc[Compute Validation] +*** xref:Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc[Assistant-Led Optimization] +** xref:Advanced_Vulkan_Compute/conclusion.adoc[Conclusion] diff --git a/en/Advanced_Vulkan_Compute/02_Compute_Architecture/01_introduction.adoc b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/01_introduction.adoc new file mode 100644 index 00000000..81e70ddb --- /dev/null +++ b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/01_introduction.adoc @@ -0,0 +1,45 @@ +:pp: {plus}{plus} + += The Compute Architecture and Execution Model: Introduction + +== Overview + +To write efficient compute kernels, you must look beyond the abstract execution model of "workgroups" and "invocations" and understand how these concepts map to the physical hardware. While Vulkan provides a cross-vendor API, the silicon beneath it from AMD, NVIDIA, and Intel has specific ways of handling your data. + +In this chapter, we will bridge the gap between your shader code and the silicon. We'll explore how the 3D grid system you define in `vkCmdDispatch` is sliced, diced, and distributed across the GPU's **Compute Units (CU)** or **Streaming Multiprocessors (SM)**. + +=== The Language of Silicon + +Before we dive in, let's align our vocabulary. Different vendors use different names for the same concepts: + +* **Workgroups** (Vulkan/OpenCL) are often mapped to **Thread Blocks** (CUDA). +* **Invocations** (Vulkan) are simply **Threads**. +* **Subgroups** (Vulkan) are called **Wavefronts** (AMD) or **Warps** (NVIDIA). +* **Compute Units** (AMD) are equivalent to **Streaming Multiprocessors** (NVIDIA). + +Understanding these mappings allows you to read hardware-specific documentation and performance guides regardless of which GPU you are targeting. + +== Hardware Mapping + +When you dispatch a workload, the GPU's hardware command processor breaks the global grid into individual workgroups. These workgroups are the fundamental unit of scheduling. + +A critical rule of the GPU execution model is **workgroup atomicity**: once a workgroup is assigned to a physical compute unit, all its invocations will stay on that unit until the workgroup completes. They cannot be split across multiple units. This locality is what enables **Shared Memory (LDS - Local Data Store)**—since all threads in a workgroup are physically on the same hardware block, they can share a dedicated, ultra-fast cache. + +=== Invocations and SIMD + +While workgroups are the scheduling unit, the **invocation** is the smallest unit of execution. However, GPUs are **SIMD (Single Instruction, Multiple Data)** machines. They don't execute invocations one by one; instead, they group them into small bundles (Subgroups). + +In these bundles, every invocation executes the exact same instruction at the same time, but on different data. This is incredibly efficient for math, but it introduces a major pitfall: **Branch Divergence**. If your code contains an `if` statement where some threads go left and others go right, the hardware must execute *both* paths, masking out the inactive threads for each. + +== Performance Metrics + +Throughout this section, we will focus on two key metrics that determine how well you're utilizing the hardware: + +1. **Occupancy**: This is the "concurrency" metric. It represents how many active workgroups are residing on a compute unit compared to its theoretical maximum. High occupancy helps **hide latency**—if one bundle is waiting for a memory fetch from slow VRAM, the scheduler can instantly switch to another bundle that's ready to do math. +2. **Bandwidth Efficiency**: This is the "throughput" metric. Modern GPUs have massive memory bandwidth, but it's easily wasted by poor data alignment. We'll see how Vulkan 1.4's **Scalar Layouts** allow us to pack data tightly, ensuring that the shader actually uses every byte fetched from VRAM. + +== What's Next? + +We'll start by diving into the 3D grid system and seeing exactly how it maps to physical hardware. From there, we'll learn how to calculate theoretical occupancy and use engine tools to monitor real-world utilization. Finally, we'll master the scalar block layouts to maximize your data throughput. + +xref:../introduction.adoc[Previous: Introduction] | xref:02_workgroups_and_invocations.adoc[Next: Workgroups and Invocations] diff --git a/en/Advanced_Vulkan_Compute/02_Compute_Architecture/02_workgroups_and_invocations.adoc b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/02_workgroups_and_invocations.adoc new file mode 100644 index 00000000..269db9f5 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/02_workgroups_and_invocations.adoc @@ -0,0 +1,83 @@ +:pp: {plus}{plus} + += Workgroups and Invocations: The 3D Lattice + +== Introduction + +In the basic compute tutorial, we used a simple one-dimensional dispatch. While that works for simple tasks, it doesn't represent how the GPU actually schedules work. To write high-performance kernels, you need to understand how Vulkan's 3D grid system maps to the physical silicon of the GPU. + +The grid system is more than just a convenient way to index into textures; it defines how your workload is subdivided and scheduled across the hardware. + +== The Three-Tier Hierarchy + +When you define a compute dispatch, you are working with a hierarchy of units. Getting these dimensions right is the first step toward high performance. + +1. **Global Dispatch Grid**: This is the entire workload, defined in `vkCmdDispatch(x, y, z)`. +2. **Workgroups**: The global grid is subdivided into workgroups. The GPU's hardware scheduler assigns these workgroups to physical compute units. +3. **Invocations**: Each workgroup contains multiple individual threads, defined by the `local_size` in your shader. + +=== Workgroup Locality + +In the previous section, we mentioned that a workgroup cannot be split across multiple physical **Compute Units** (CU, on AMD/Intel) or **Streaming Multiprocessors** (SM, on NVIDIA). This means that all invocations within a workgroup are physically executed on the same hardware block. + +This locality is a key design constraint. It allows invocations in the same workgroup to share a fast, local memory known as **LDS** (Local Data Store) or **groupshared** memory, but it also means that the size of your workgroup is limited by the physical resources of a single CU/SM. If your workgroup size is too large, the GPU simply won't be able to schedule it. + +== The Math of Indexing + +Vulkan provides several built-in variables to help you find your place in the grid. In Slang, these are typically passed as parameters to the entry point using semantics like `SV_DispatchThreadID`, `SV_GroupThreadID`, and `SV_GroupID`. + +Let's look at how these relate in a typical shader: + +[source,slang] +---- +[numthreads(16, 16, 1)] +void main( + uint3 groupID : SV_GroupID, // gl_WorkGroupID + uint3 localID : SV_GroupThreadID, // gl_LocalInvocationID + uint3 globalID : SV_DispatchThreadID // gl_GlobalInvocationID +) { + // globalID: The unique index for this thread in the entire grid + // Formula: globalID = groupID * numthreads + localID + uint x = globalID.x; + uint y = globalID.y; + + // Process pixel (x, y) +} +---- + +Using a 2D or 3D grid makes spatial tasks (like image processing or physics simulations) much cleaner. Instead of manually calculating a 1D index, you can use `.xy` or `.xyz` coordinates that match your data structure. + +== Choosing Optimal Sizes + +A common mistake is choosing workgroup sizes based solely on what "fits" your data. For example, if you're processing a 10x10 image, you might choose a workgroup size of (10, 10, 1). + +However, GPUs execute invocations in bundles of 32 or 64—known as **Subgroups**, **Warps** (NVIDIA), or **Wavefronts** (AMD). If your workgroup size is not a multiple of the hardware's native bundle size, you are leaving silicon idle. This is called **internal fragmentation**. + +=== The Rule of 32/64 + +* **NVIDIA** GPUs typically prefer multiples of **32** (Warps). +* **AMD** GPUs typically prefer multiples of **64** (Wavefronts), though modern RDNA architectures can also handle 32. +* **Intel** GPUs have variable sizes (8, 16, 32). + +A safe, portable choice for many workloads is a workgroup size of **64** or **256** (e.g., `16x16` or `8x8x4`). This ensures that most hardware can keep its **SIMD** (Single Instruction, Multiple Data) lanes full. + +== Dispatching the Work + +When you call `vkCmdDispatch(groupCountX, groupCountY, groupCountZ)`, you are defining how many times the `local_size` block is repeated. + +If you have an image of size `width` x `height` and a workgroup size of `16x16`, your dispatch would look like this: + +[source,cpp] +---- +uint32_t groupCountX = (width + 15) / 16; +uint32_t groupCountY = (height + 15) / 16; +commandBuffer.dispatch(groupCountX, groupCountY, 1); +---- + +Note the use of "rounding up" (`(width + 15) / 16`). This ensures that if your image size isn't a perfect multiple of 16, you don't miss the last few pixels. Inside the shader, you would then use a bounds check: `if (x < width && y < height)`. + +== What's Next? + +Understanding how workgroups map to hardware is the foundation of GPU compute. But mapping work to hardware is only part of the story; we also need to keep that hardware busy. In the next section, we'll talk about **Occupancy** and how to hide the massive latency of VRAM. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_occupancy_and_latency_hiding.adoc[Next: Occupancy and Latency Hiding] diff --git a/en/Advanced_Vulkan_Compute/02_Compute_Architecture/03_occupancy_and_latency_hiding.adoc b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/03_occupancy_and_latency_hiding.adoc new file mode 100644 index 00000000..d3844d8b --- /dev/null +++ b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/03_occupancy_and_latency_hiding.adoc @@ -0,0 +1,70 @@ +:pp: {plus}{plus} + += Occupancy and Latency Hiding: Keeping the GPU Busy + +== Introduction + +In the previous section, we learned how workgroups are mapped to the GPU's factory floor (the Compute Units or SMs). But simply getting a workgroup onto a CU is only half the battle. If that workgroup is poorly designed, it might only use a fraction of the hardware's potential, leaving expensive silicon sitting idle. + +To understand why this happens, we must talk about **Latency** and **Occupancy**. + +== The Latency Gap + +GPUs are memory-bound. While a modern GPU can perform trillions of floating-point operations per second (**TFLOPS**), fetching a single piece of data from **VRAM** (Video Random Access Memory) can take hundreds or even thousands of clock cycles. + +If a bundle of invocations (a warp or wavefront) needs to read from memory, it has to wait. If that CU only has one bundle to run, the entire CU goes silent until the data arrives. This is a disaster for performance, and is known as **memory latency**. + +The GPU's solution is **Concurrency**. Instead of waiting for one bundle, the CU switches to another bundle that is ready to execute. The more bundles you have "in flight" on a single CU, the better you can hide the latency of memory fetches. + +== Defining Occupancy + +**Occupancy** is a measure of how many bundles are active on a CU compared to the theoretical maximum. It's often expressed as a percentage. + +* **100% Occupancy**: The CU is completely packed with bundles. Whenever one waits for memory, there's almost certainly another one ready to go. +* **Low Occupancy**: Only a few bundles are active. If they all hit a memory fetch at the same time, the CU will stall. + +=== The Resource Tug-of-War + +You might wonder: "Why not just always dispatch thousands of threads?" The problem is that each Compute Unit has a fixed pool of physical resources. Every thread you add consumes a portion of that pool. + +The three primary limiters of occupancy are: + +1. **Registers**: Each thread needs a set of registers to store its variables. If your shader uses 128 registers, you can fit fewer threads than if it used 32. +2. **Shared Memory (LDS)**: This memory is shared by the whole workgroup. If your workgroup uses 32KB of LDS and the CU only has 64KB, you can only fit two workgroups on that CU, regardless of how many threads they have. +3. **Thread/Warp Slots**: There is a hard limit on how many threads the hardware scheduler can track at once (e.g., 2048 threads per CU). + +|=== +| Resource Usage | Impact on Occupancy | Result + +| High Register Count +| **Negative** +| Fewer bundles per CU; harder to hide latency. + +| High LDS Usage +| **Negative** +| Fewer workgroups per CU; limited concurrency. + +| Small Workgroup Size +| **Neutral/Negative** +| May not fill all warp slots; scheduling overhead. +|=== + +== Calculating Theoretical Occupancy + +Most GPU vendors provide tools (like NVIDIA's Nsight or AMD's RGP) that calculate occupancy for you. However, you can estimate it yourself by looking at your shader's resource usage. + +If a CU has 64KB of shared memory and your workgroup uses 32KB, your CU can only ever host two workgroups at a time. If your workgroup size is small (say, 64 threads), you'll have 128 threads per CU. If that hardware is capable of tracking 2048 threads, your occupancy is only around 6%. + +This is why "fat" shaders (those that use lots of registers or shared memory) often perform poorly unless they are carefully tuned. + +== Monitoring Utilization + +In a real engine, you don't just want to guess. Modern Vulkan engines use performance counters (via the `VK_KHR_performance_query` extension) to monitor hardware utilization in real-time. + +By tracking metrics like **ValuUtilization** (AMD) or **SM Active** (NVIDIA), you can see if your kernels are actually keeping the hardware busy. If you see high memory latency but low occupancy, you know you need to optimize your register usage or shared memory footprint. + +== What's Next? + +Now that we know how to keep the GPU busy, we need to make sure that when it *is* busy, it's being efficient. In the final section of this chapter, we'll look at **Scalar Layouts**—a Vulkan 1.4 feature that allows us to pack our data tightly and maximize the bandwidth we've worked so hard to hide. + +xref:02_workgroups_and_invocations.adoc[Previous: Workgroups and Invocations] | xref:04_vulkan_1_4_scalar_layouts.adoc[Next: Vulkan 1.4 Scalar Layouts] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc new file mode 100644 index 00000000..d8cc88b2 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc @@ -0,0 +1,120 @@ +:pp: {plus}{plus} + += Vulkan 1.4 Scalar Layouts: Tight Packing + +== Introduction + +In the previous sections, we've focused on keeping the GPU's Compute Units busy through high occupancy. But even if you have thousands of threads active, you can still be held back by **Bandwidth**. + +Every byte you fetch from VRAM is precious. If your data is poorly laid out, you might be fetching bytes you never use. In this final section of the Compute Architecture chapter, we'll look at how Vulkan 1.4's **Scalar Layouts** solve one of the oldest and most frustrating problems in GPU programming: the alignment tax. + +== The Alignment Tax: std140 and std430 + +Historically, OpenGL and Vulkan required you to lay out your buffers using strict alignment rules known as **std140** and **std430** (standard 140/430 layouts). These rules were designed for older hardware that had difficulty reading data that wasn't perfectly aligned to 4-byte or 16-byte boundaries. + +For example, under `std140`, a simple `vec3` (which is three floats) would often be padded to the size of a `vec4`. This means that if you had an array of `vec3`, 25% of your memory bandwidth was being wasted on empty padding! + +[source,slang] +---- +// Under std140: +struct MyData { + float3 position; // 12 bytes + 4 bytes padding + float radius; // 4 bytes +}; +// Total size: 20 bytes (but logically 16) +---- + +`std430` improved this by allowing tighter packing for arrays of scalars and vectors, but it still had strict rules about how nested structures were aligned. + +== Enter GL_EXT_scalar_block_layout + +To solve this, a new extension called **GL_EXT_scalar_block_layout** was introduced. This extension allows you to use a **scalar layout**, which essentially removes all padding between members of a structure or elements of an array. + +In Vulkan 1.4, this functionality is now a core requirement. By using the `scalar` layout, you can ensure that your data structures on the GPU match your C{pp} structures perfectly, byte-for-byte. + +=== Why does this matter? + +It's not just about saving a few bytes of VRAM. It's about **Cache Efficiency**. + +When the GPU fetches data from VRAM, it fetches it in large "cache lines" (often 64 or 128 bytes). If your data is full of padding, each cache line will contain less "real" data. This means you have to perform more memory fetches to get the same amount of information, which directly leads to lower performance. + +=== Slang: Automatic Packing + +If you are using Slang, you don't even need to worry about manual layout qualifiers for most cases. Slang's layout engine handles the `scalar` rules for you when targeting Vulkan 1.4: + +[source,slang] +---- +struct MyData { + float3 position; + float radius; +}; + +[[vk::binding(0, 0)]] +RWStructuredBuffer MyBuffer; +---- + +The `RWStructuredBuffer` in Slang maps to a `Storage Buffer` in Vulkan, and because Slang defaults to natural alignment, it produces the same result as the `scalar` layout in GLSL without the boilerplate. + +=== GLSL: The Manual Struggle + +To truly appreciate the "win" in Vulkan 1.4, let's look at how this same structure would be handled in GLSL under the older `std430` rules vs. the modern `scalar` layout. + +[source,glsl] +---- +// The "Old" Way (std430) +struct MyData { + vec3 position; // 12 bytes + 4 bytes padding (arrays of vec3 are even worse!) + float radius; // 4 bytes +}; + +layout(std430, binding = 0) buffer MyBuffer { + MyData data[]; +}; +---- + +Under `std430`, if you had an array of `MyData`, each `vec3` would be padded to 16 bytes. If you tried to match this with a simple `struct { glm::vec3 p; float r; }` on the CPU, you would likely experience memory corruption because the GPU expects that 4-byte gap between `position` and `radius`. + +Now, look at the Vulkan 1.4 way using the **scalar** layout: + +[source,glsl] +---- +// The "Modern" Way (Vulkan 1.4 / GL_EXT_scalar_block_layout) +#extension GL_EXT_scalar_block_layout : enable + +struct MyData { + vec3 position; // 12 bytes + float radius; // 4 bytes +}; + +layout(scalar, binding = 0) buffer MyBuffer { + MyData data[]; +}; +// Total size of MyData: 16 bytes. No padding! +---- + +By explicitly using `layout(scalar)`, you tell the driver that you want the tighter packing rules. This allows your GLSL code to perfectly match a standard C{pp} struct without any manual `float padding` members. + +=== C{pp} Side Comparison + +To match this on the CPU, you no longer need to manually add `float padding[1]` or use `alignas(16)`. You can simply define your structure naturally: + +[source,cpp] +---- +struct MyData { + glm::vec3 position; + float radius; +}; +// Total size: 16 bytes. No padding! +---- + +If you are using modern languages like Slang, this becomes even easier. Slang defaults to a more natural, C{pp}-like layout, and its Vulkan backend handles the scalar layout details for you automatically when targeting Vulkan 1.4. + +== Conclusion + +We've covered a lot of ground in this chapter. We've seen how workgroups map to silicon, how occupancy helps us hide the massive latency of memory fetches, and how scalar layouts ensure we aren't wasting the bandwidth we've worked so hard to use. + +By understanding these low-level architectural details, you've moved beyond "writing shaders" and started "programming the hardware." + +In the next chapter, we'll take these concepts even further by looking at the **Vulkan Memory Model** and how to safely synchronize data between thousands of threads. + +xref:03_occupancy_and_latency_hiding.adoc[Previous: Occupancy and Latency Hiding] | xref:../03_Memory_Models/01_introduction.adoc[Next: Memory Models and Consistency] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/03_Memory_Models/01_introduction.adoc b/en/Advanced_Vulkan_Compute/03_Memory_Models/01_introduction.adoc new file mode 100644 index 00000000..dbab97ea --- /dev/null +++ b/en/Advanced_Vulkan_Compute/03_Memory_Models/01_introduction.adoc @@ -0,0 +1,34 @@ +:pp: {plus}{plus} + += Memory Models and Consistency: Introduction + +== Overview + +In the previous chapter, we looked at how to keep the GPU busy through high occupancy and tight data layouts. But as you scale your compute dispatches from simple, independent tasks to complex, cooperative algorithms, you'll quickly encounter a much more challenging problem: **Memory Consistency** (ensuring all parts of the GPU see the same data at the same time). + +How do you know that a value written by one invocation is visible to another? What happens when two invocations try to write to the same location at once? How do you share data efficiently between thousands of threads without crashing into a race condition? + +=== The Explicit Nature of Vulkan + +On the CPU, you're used to a world where memory is generally **coherent**. If Thread A writes a value to a variable, Thread B can usually read it shortly after without any special ceremony because the hardware keeps the caches in sync automatically. + +On the GPU, this is **not** the case. With thousands of threads executing concurrently across multiple compute units, each with its own hierarchical caches (**L1** - Level 1 and **L2** - Level 2, etc.), keeping everyone's view of memory in sync is an incredibly expensive task. Vulkan's philosophy is simple: **synchronization is never automatic**. If you want a write to be visible to a read, you must explicitly say so. + +== The Three Pillars of Memory Management + +In this chapter, we'll dive deep into the mechanisms Vulkan provides to manage this complexity: + +1. **The Vulkan Memory Model**: Mastering Availability, Visibility, and Domain operations to create a formal **"Happens-Before"** relationship (a strict ordering of operations) between threads. +2. **Shared Memory (LDS)**: Utilizing a small, ultra-fast, workgroup-local memory for high-speed data exchange and manual caching. +3. **Memory Consistency**: Using Slang's `GroupMemoryBarrierWithGroupSync` vs. fine-grained Vulkan 1.4 barriers to minimize pipeline stalls and maximize throughput. + +== Why This Matters + +Efficient memory synchronization is the difference between a high-performance simulation and a broken, non-deterministic mess. + +* **Over-synchronization**: Your kernels will be slow because every thread is constantly waiting for every other thread. +* **Under-synchronization**: You'll get flickering results, "ghost" data, and hard-to-debug crashes that only appear on certain hardware. + +We'll start by looking at the theoretical foundation: the **Vulkan Memory Model**. While it might seem abstract at first, it is the key to writing portable, robust compute code that works on every GPU from a smartphone to a high-end workstation. + +xref:../02_Compute_Architecture/04_vulkan_1_4_scalar_layouts.adoc[Previous: Scalar Layouts] | xref:02_vulkan_memory_model.adoc[Next: The Vulkan Memory Model] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/03_Memory_Models/02_vulkan_memory_model.adoc b/en/Advanced_Vulkan_Compute/03_Memory_Models/02_vulkan_memory_model.adoc new file mode 100644 index 00000000..c41d85a8 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/03_Memory_Models/02_vulkan_memory_model.adoc @@ -0,0 +1,68 @@ +:pp: {plus}{plus} + += The Vulkan Memory Model: Availability, Visibility, and Domain Operations + +In a GPU, you have thousands of threads running concurrently across multiple compute units, each with its own hierarchical caches (L1 - Level 1, L2 - Level 2, etc.) Because of this, it's not enough to simply write a value to a buffer and expect another thread to see it immediately. + +Instead, you need to follow a formal protocol to ensure that data written by one part of the device is available and visible to another. This protocol is defined by the **Vulkan Memory Model**. + +== The Three Pillars of Synchronization + +When you want to share data between two different operations, (e.g., a write in one thread and a read in another), you need to establish a **Happens-Before** relationship. This is done through three distinct operations: + +1. **Availability Operation**: This ensures that data written to a local cache is "pushed" out to a domain that is accessible to other operations. Think of this as flushing a cache. +2. **Visibility Operation**: This ensures that data that is available in a shared domain is "pulled" into the local cache of the thread that needs to read it. Think of this as invalidating a local cache so it's forced to read fresh data. +3. **Memory Domain**: This is the common "meeting ground" where availability and visibility operations meet (usually the L2 cache or VRAM). + +=== The Flow of Data + +[source,text] +---- +Thread A (Write) -> Availability Operation (Make Available) + | + [Memory Domain] + | +Thread B (Read) <- Visibility Operation (Make Visible) +---- + +== Happens-Before and Execution Barriers + +A "Happens-Before" relationship is established when an execution barrier (like `vkCmdPipelineBarrier2`) is used to synchronize two operations. This barrier specifies: + +* **srcStageMask/dstStageMask**: Which stages of the pipeline must complete before the next stages can begin. +* **srcAccessMask/dstAccessMask**: Which memory operations are being performed (writes vs. reads). + +In Vulkan 1.4, these masks have been simplified and unified with **Synchronization 2**, which we'll use throughout this tutorial. For example, a barrier between two compute dispatches might look like this: + +[source,cpp] +---- +vk::BufferMemoryBarrier2 bufferBarrier { + .srcStageMask = vk::PipelineStageFlagBits2::eComputeShader, + .srcAccessMask = vk::AccessFlagBits2::eShaderWrite, // Availability + .dstStageMask = vk::PipelineStageFlagBits2::eComputeShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead // Visibility +}; + +vk::DependencyInfo dependencyInfo { + .bufferMemoryBarrierCount = 1, + .pBufferMemoryBarriers = &bufferBarrier +}; +// ... +commandBuffer.pipelineBarrier2(dependencyInfo); +---- + +== Why You Need to Care + +If you skip these steps, your kernel might appear to work on one GPU but fail on another. This is because different architectures have different cache coherency strategies. NVIDIA's caches might behave differently than AMD's or Intel's. + +The Vulkan Memory Model is your way of telling the driver exactly what you need, so it can emit the minimal set of hardware instructions to keep your data safe without sacrificing performance. + +== Data Races and Undefined Behavior + +When two threads access the same memory location and at least one of them is a write, and they aren't synchronized by a "Happens-Before" relationship, you have a **Data Race**. + +In Vulkan, data races result in **Undefined Behavior**. This doesn't just mean you get the wrong value; it could mean you read old data, partially updated data, or even crash the GPU if the race condition leads to an out-of-bounds access or a malformed pointer. + +Next, we'll see how to apply these concepts to **Shared Memory (LDS)**, which is much faster than global VRAM. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_shared_memory_lds.adoc[Next: Shared Memory (LDS)] diff --git a/en/Advanced_Vulkan_Compute/03_Memory_Models/03_shared_memory_lds.adoc b/en/Advanced_Vulkan_Compute/03_Memory_Models/03_shared_memory_lds.adoc new file mode 100644 index 00000000..4c8ad667 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/03_Memory_Models/03_shared_memory_lds.adoc @@ -0,0 +1,102 @@ +:pp: {plus}{plus} + += Shared Memory (LDS): High-Speed Data Exchange + +The GPU's main memory (VRAM) is large but relatively slow. For many compute tasks, fetching data from VRAM is the primary bottleneck. To solve this, GPUs provide a small, ultra-fast memory that is local to each workgroup. + +In Vulkan, this is called **Shared Memory**. On physical hardware, it is often referred to as **LDS (Local Data Store)** or **Scratchpad Memory** (a fast, temporary memory for local data). + +== Why Shared Memory? + +Shared memory is your most powerful tool for optimizing memory-bound kernels. It is typically used for: + +* **Manual Caching**: Reading a block of data from VRAM once, storing it in shared memory, and then having all threads in the workgroup read from that fast local copy multiple times. +* **Data Exchange**: Passing data between threads in the same workgroup (e.g., for calculating a **prefix sum**—where each element is the sum of all previous elements—or a **reduction**). +* **Workgroup-Level Reductions**: Finding the maximum or minimum value in a large dataset by first reducing it (combining multiple values into one) within each workgroup. + +== Using Shared Memory in Slang + +In Slang (and HLSL), you declare shared memory using the `groupshared` keyword. Because it is physically local to a Compute Unit, it is shared by all threads in a workgroup but is invisible to other workgroups. + +[source,slang] +---- +groupshared float sharedData[256]; + +[numthreads(256, 1, 1)] +void main(uint3 tid : SV_GroupThreadID) { + // Each thread initializes its own slot in shared memory + sharedData[tid.x] = someBuffer[tid.x]; + + // CRITICAL: We must wait for all threads to finish writing AND make those writes visible! + // GroupMemoryBarrier: Ensures all previous memory writes are complete and visible. + // WithGroupSync: Acts as an execution barrier, waiting for all threads in the group to arrive. + GroupMemoryBarrierWithGroupSync(); + + // Now it is safe to read data written by our neighbors + float neighborValue = sharedData[(tid.x + 1) % 256]; +} +---- + +=== Breaking Down the Cryptic Name + +The function `GroupMemoryBarrierWithGroupSync()` might seem like a mouthful, but its name tells you exactly what it's doing across two different types of synchronization: + +1. **GroupMemoryBarrier**: This is a **Memory Barrier**. It ensures that any writes a thread has made to `groupshared` memory are "pushed" out and made visible to all other threads in the workgroup. Without this, a neighbor might read an old or uninitialized value from your slot in shared memory. +2. **WithGroupSync**: This is an **Execution Barrier**. It forces every thread in the workgroup to stop and wait at this exact line. No thread can proceed to the next instruction until *every* thread in the group has reached this point. + +By combining them, you guarantee that when a thread moves past this line, all its neighbors have finished their work and their data is ready to be read. + +=== GLSL: shared and barrier() + +In GLSL, you use the `shared` keyword to declare your workgroup-local memory. The synchronization is handled by the `barrier()` function, which acts as both an execution barrier and a memory barrier for `shared` memory. + +[source,glsl] +---- +shared float sharedData[256]; + +layout(local_size_x = 256) in; +void main() { + uint tid = gl_LocalInvocationID.x; + + // Each thread initializes its own slot in shared memory + sharedData[tid] = someBuffer[tid]; + + // Wait for all threads to reach this point and make memory visible + barrier(); + + // Now it is safe to read + float neighborValue = sharedData[(tid + 1) % 256]; +} +---- + +The main difference here is Slang's `GroupMemoryBarrierWithGroupSync()`, which is a more descriptive name for the common pattern of combining a memory barrier with an execution sync. + +== Bank Conflicts: The Speed Trap + +Shared memory is organized into **Banks** (parallel memory modules, typically 32 banks). Each bank can handle one request per clock cycle. If your threads access memory in a way that maps to different banks, the operation is performed in parallel at full speed. + +However, if two or more threads in a bundle (subgroup) try to access different addresses that fall within the **same bank**, you get a **Bank Conflict**. The hardware must then serialize these requests, which can double or triple the execution time of that instruction. + +[source,text] +---- +// NO CONFLICT (Fast) +Thread 0 -> Bank 0 +Thread 1 -> Bank 1 +Thread 2 -> Bank 2 + +// BANK CONFLICT (Slow) +Thread 0 -> Bank 0 (Address 0) +Thread 1 -> Bank 0 (Address 32) +---- + +To avoid bank conflicts, aim for linear access patterns where `thread_id` matches `index`. Using a **stride** of 1 (accessing elements one after another) is usually the safest way to ensure full speed. + +== Lifecycle and Scope + +Shared memory is only valid for the lifetime of a single workgroup. When the workgroup completes, its shared memory is discarded. + +Crucially, **shared memory is not coherent between workgroups**. If you need to send data from Workgroup A to Workgroup B, you must write it back to global VRAM and use a proper Vulkan memory barrier as described in the previous section. + +In the next section, we'll see how to balance these barriers to keep your pipeline as full as possible. + +xref:02_vulkan_memory_model.adoc[Previous: The Vulkan Memory Model] | xref:04_memory_consistency.adoc[Next: Memory Consistency] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/03_Memory_Models/04_memory_consistency.adoc b/en/Advanced_Vulkan_Compute/03_Memory_Models/04_memory_consistency.adoc new file mode 100644 index 00000000..97054fa1 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/03_Memory_Models/04_memory_consistency.adoc @@ -0,0 +1,66 @@ +:pp: {plus}{plus} + += Memory Consistency: Slang Barriers and Pipeline Stalls + +In the previous sections, we've explored the "what" and "where" of synchronization. Now, we'll focus on the "how"—specifically, how to balance safety with performance to keep your GPU's pipeline full. + +== The All-In-One Barrier: GroupMemoryBarrierWithGroupSync + +Most developers start with Slang's `GroupMemoryBarrierWithGroupSync()`. This is a high-level function that combines two critical operations: + +1. **Execution Sync**: It forces every thread in the current workgroup to wait at this line. No thread can proceed until its neighbors have also reached the barrier. +2. **Memory Barrier**: It ensures that all memory writes performed by the workgroup (to both shared and global memory) are made available and visible. + +This function is essentially the "Safe Mode" of synchronization. Use it when you need to be 100% sure that all data is ready for the next step of an algorithm. + +=== GLSL: The Explicit Barriers + +In GLSL, you don't have a single "magic" function that does everything. Instead, you have to be explicit about what you are synchronizing. This is where many bugs creep in, but it's also where you can find performance wins. + +[source,glsl] +---- +// The GLSL equivalent of Slang's GroupMemoryBarrierWithGroupSync() +memoryBarrierShared(); // Make shared memory writes available/visible +barrier(); // Wait for all threads to reach this point +---- + +If you are working with **Global Memory** (SSBOs), `barrier()` alone is not enough! You must also call `memoryBarrierBuffer()` to ensure that your writes to the buffer are actually visible to other threads before they proceed past the barrier. + +[source,glsl] +---- +// Ensuring global memory is ready for other threads in the workgroup +memoryBarrierBuffer(); +barrier(); +---- + +Vulkan 1.4 further refines this with **Memory Semantics**, allowing you to specify exactly which "domain" (Uniform, Buffer, Image, or Shared) you are synchronizing, avoiding the "sync everything" penalty of a general barrier. + +== The Cost of Syncing + +Synchronization is not free. Every time you call a barrier, you are essentially telling the GPU: "Stop what you are doing and wait." + +* **Workgroup Barriers** are expensive because they involve many threads (e.g., 256 or 1024). The hardware must track all these threads and ensure they have all reached the same point. +* **Pipeline Stalls**: If some threads finish their work quickly but others are delayed by slow memory fetches, the fast threads sit idle, wasting potential **TFLOPS** (trillions of floating-point operations per second). + +=== Reducing the Impact + +To minimize the performance penalty of synchronization, consider these strategies: + +1. **Batch Your Work**: Try to do as much work as possible between barriers. One large kernel with two barriers is often faster than two small kernels with one barrier each. +2. **Double-Buffering Shared Memory**: Instead of reading and writing to the same shared memory array (which requires a barrier), use two arrays. Write to `A` while reading from `B`, then swap. +3. **Atomic Operations**: For simple tasks like incrementing a global counter, use `InterlockedAdd` (which Slang inherits from HLSL). **Atomic operations** handle synchronization at the hardware level, which is often much faster than a manual barrier because they are "uninterruptible" by other threads. + +== Fine-Grained Control in Vulkan 1.4 + +Modern Vulkan (1.3+) and Synchronization 2 allow for even more granular control. In your shader, you can use more specific barrier types if your language supports them: + +* **`GroupMemoryBarrier()`**: Only synchronizes memory, without forcing an execution sync. +* **Subgroup Barriers**: Synchronizing within a bundle of 32/64 threads (a **subgroup**) is significantly faster than synchronizing an entire workgroup because it doesn't need to involve the GPU's global scheduler. + +== What's Next? + +We've covered the fundamentals of how GPUs execute code and how they manage memory. But there is a hidden layer of performance that many developers miss. + +In the next chapter, we'll dive into **Subgroup Operations**. By learning how to communicate between threads *within* a bundle, we can bypass shared memory altogether and perform high-speed data exchange directly through registers. + +xref:03_shared_memory_lds.adoc[Previous: Shared Memory (LDS)] | xref:../04_Subgroup_Operations/01_introduction.adoc[Next: Why Subgroups Matter] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/01_introduction.adoc b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/01_introduction.adoc new file mode 100644 index 00000000..3df76ee6 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/01_introduction.adoc @@ -0,0 +1,33 @@ +:pp: {plus}{plus} + += Subgroup Operations: The Hidden Power + +== Introduction + +In the previous chapters, we looked at how to share data between hundreds or even thousands of threads in a workgroup using **Shared Memory (LDS)** and explicit barriers. While powerful, this approach has a significant cost: every barrier forces the GPU to pause and wait, and every access to shared memory consumes precious bandwidth. + +What if you could share data even faster? What if you could exchange values without ever touching VRAM or even the LDS? This is where **Subgroup Operations** come in. They are the "secret sauce" behind many of the most highly optimized GPU algorithms in existence today. + +== Why Subgroups Matter + +A **Subgroup** is a hardware-level bundle of threads (typically 32 on NVIDIA/Intel or 32/64 on AMD) that execute in perfect lockstep on the same SIMD unit. Because the hardware already physically synchronizes these threads, they can communicate with each other using specialized instructions that are often as fast as a single clock cycle. + +In this chapter, we'll explore the hidden power of subgroups: + +1. **Cross-Invocation Communication**: Utilizing Subgroup Shuffles, Broadcasts, and Arithmetic to exchange data directly through registers, bypassing memory entirely. +2. **Subgroup Partitioning**: Implementing "Ballot" and "Match" operations to perform complex branching and data filtering across the entire bundle. +3. **Non-Uniform Indexing**: Leveraging modern Vulkan features to safely access arrays of resources that might be different for every thread in the subgroup. + +== Moving Beyond Barriers + +Subgroup operations allow you to write "barrier-free" kernels for small-scale data exchange. Instead of having every thread in a workgroup wait at a barrier just to share a single float, you can use a subgroup shuffle to pass that value instantly. + +This leads to: + +* **Higher Performance**: No pipeline stalls from waiting threads. +* **Lower Latency**: Data exchange happens at register speeds. +* **Greater Flexibility**: Algorithms can be more "wave-aware," adapting to the hardware's native execution width. + +We'll start by looking at the fundamental building blocks of subgroup communication: **Shuffles** and **Broadcasts**. + +xref:../03_Memory_Models/04_memory_consistency.adoc[Previous: Memory Consistency] | xref:02_cross_invocation_communication.adoc[Next: Cross-Invocation Communication] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/02_cross_invocation_communication.adoc b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/02_cross_invocation_communication.adoc new file mode 100644 index 00000000..05afea3c --- /dev/null +++ b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/02_cross_invocation_communication.adoc @@ -0,0 +1,76 @@ +:pp: {plus}{plus} + += Subgroup Shuffles, Broadcasts, and Arithmetic + +== Exchanging Data Without Memory + +In the previous section, we introduced the concept of a **Subgroup** as the hardware's native execution width (e.g., 32 or 64 threads). What makes subgroups truly powerful is the ability to share data between invocations without ever writing to memory. No VRAM, no LDS—just register-to-register communication. + +This is done through three primary categories of operations: **Broadcasts** (sending one value to all), **Shuffles** (swapping values between specific threads), and **Arithmetic (Reductions/Scans)** (performing math across the whole subgroup). + +== Subgroup Broadcasts + +The simplest form of subgroup communication is the **Broadcast**. This allows one thread in the subgroup to share its local value with all other threads in the same subgroup. + +[source,slang] +---- +// Slang example of a subgroup broadcast +float localValue = computeSomeData(); +float sharedValue = WaveReadLaneAt(localValue, 0); // Everyone gets thread 0's value +---- + +In the example above, every thread in the subgroup will now have the same `sharedValue`, which was originally unique to thread 0. This is incredibly useful for sharing "anchor" values or configuration data that only one thread needs to calculate or load. + +=== GLSL: The Subgroup Way + +In GLSL, you use the `subgroup` intrinsics. This requires enabling the proper extension (usually `GL_KHR_shader_subgroup_basic` or `GL_KHR_shader_subgroup_ballot` depending on the operation). + +[source,glsl] +---- +#extension GL_KHR_shader_subgroup_basic : enable + +// The GLSL equivalent of a subgroup broadcast +float localValue = computeSomeData(); +float sharedValue = subgroupBroadcast(localValue, 0); +---- + +== Subgroup Shuffles + +While a broadcast sends one value to everyone, a **Shuffle** allows for more complex patterns. You can think of it as a **permutation,** (a rearrangement) of the registers across the subgroup. + +In Slang, we can use `WaveReadLaneAt` for general indexing, or more specific functions for relative movements. + +[source,slang] +---- +// Every thread "swaps" its value with its neighbor (assuming 32 threads) +uint neighborIdx = (WaveGetLaneIndex() + 1) % 32; +float neighborValue = WaveReadLaneAt(localValue, neighborIdx); +---- + +Modern GPUs also support more specialized shuffles like `WaveReadLaneFirst` and bitwise shuffles. These are often more efficient than a general shuffle because they map directly to hardware data-paths. + +== Subgroup Arithmetic (Reductions and Scans) + +Beyond just moving data, subgroups can perform math across all threads in a single instruction. These are called **Reductions** and **Scans**. + +* **Subgroup Reduction**: Combines values from all threads into a single result (e.g., `WaveActiveSum`, `WaveActiveMin`, `WaveActiveMax`). +* **Subgroup Scan (Inclusive/Exclusive)**: Each thread receives the partial sum (or min/max) of all threads up to its own index. In an **inclusive** scan, the current thread's value is included; in an **exclusive** scan, it is not. + +[source,slang] +---- +// Calculate the sum of all local values in the subgroup +float subgroupTotal = WaveActiveSum(localValue); + +// Each thread gets the sum of all values from threads with a lower ID +float runningSum = WavePrefixSum(localValue); +---- + +These operations are the building blocks of high-performance prefix sums, **stream compaction** (filtering an array to only active elements), and parallel reductions. Instead of writing a complex multi-pass kernel that uses shared memory and barriers, you can often do the same work within a single subgroup in just a few cycles. + +== Choosing the Right Operation + +While it's tempting to use subgroup operations everywhere, remember that they only work within a single subgroup. If you need to share data across an entire 1024-thread workgroup, you will still need to use **Shared Memory (LDS)** to bridge the gap between subgroups. + +However, a "subgroup-first" approach is often the fastest. Perform as much work as possible within the subgroup, and only use LDS when you absolutely must communicate with another subgroup. + +xref:01_introduction.adoc[Previous: Introduction to Subgroups] | xref:03_subgroup_partitioning.adoc[Next: Subgroup Partitioning] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/03_subgroup_partitioning.adoc b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/03_subgroup_partitioning.adoc new file mode 100644 index 00000000..4a978381 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/03_subgroup_partitioning.adoc @@ -0,0 +1,73 @@ +:pp: {plus}{plus} + += Subgroup Partitioning: Ballot and Match + +== Beyond Lockstep Execution + +In the previous section, we saw how threads in a subgroup can share data. But what happens when threads in the same subgroup want to do different things? This is where **Subgroup Partitioning** comes in. + +On a GPU, all threads in a subgroup (the SIMD bundle) execute the same instruction at the same time. When you have an `if` statement, some threads might take the "true" branch while others take the "false" branch. The hardware handles this by "masking out" the threads that shouldn't execute the current instruction. This is called **Branch Divergence**, and as we discussed in Chapter 2, it can be a major performance killer. + +Subgroup partitioning tools like **Ballot** and **Match** allow you to "see" these masks and use them to optimize your code. + +== Subgroup Ballot + +A **Ballot** operation asks a boolean question to every thread in the subgroup and returns a **bitmask** (a sequence of bits where each bit represents a thread) where each bit represents the answer from one thread. + +[source,slang] +---- +// Does this thread have a valid result? +bool hasResult = computeIsSuccessful(); + +// Get a bitmask of all threads in the subgroup that have a valid result +uint4 activeMask = WaveActiveBallot(hasResult); +---- + +In Slang (and Vulkan SPIR-V), a ballot returns a `uint4` (128 bits) to support subgroups up to 128 threads wide, though 32 or 64 is more common. + +Once you have this mask, you can use bitwise operations to make decisions: + +* `WaveActiveCountBits(hasResult)`: How many threads are active? (Slang provides a convenient shorthand for this) +* `countbits(activeMask)`: Low-level bit count on the mask. +* `WavePrefixCountBits(hasResult)`: What is my **rank** (the number of active threads with a lower index) among active threads? + +This is incredibly useful for **Stream Compaction**. If only 5 threads out of 32 have data to write to a buffer, they can use these operations to calculate exactly which index in the output buffer they should write to, without any atomic operations! + +== Subgroup Match + +While `Ballot` works on booleans, **Match** works on values. It finds all threads in the subgroup that have the *same value* for a given variable. + +[source,slang] +---- +// Every thread has a 'key' (e.g., a material ID or a hash) +uint myKey = ...; + +// Get a mask of all threads that have the same key as me +uint4 sameKeyMask = WaveMatch(myKey); +---- + +This is a specialized operation (often requiring Vulkan 1.1 or specific extensions) that is a game-changer for **Global Atomic Reduction** (combining atomic operations from multiple threads into one). + +Imagine 32 threads all trying to add to the same global counter. Normally, this would result in 32 serialized atomic operations. With `WaveMatch`, the threads can identify which of them are hitting the same address, pick one **"leader"** thread (one thread that acts on behalf of the group) to perform a single atomic add for the whole group, and then distribute the result back. + +== Subgroup Elect + +The simplest form of partitioning is `WaveIsFirstLane()`. it returns `true` for exactly one thread (or **lane**) in the subgroup (usually the one with the lowest active ID) and `false` for all others. + +[source,slang] +---- +if (WaveIsFirstLane()) { + // Only one thread in the subgroup performs this expensive task + performGlobalLogging(); +} +---- + +This is perfect for tasks that only need to happen once per wave, such as writing a debug message or updating a global timestamp. + +== Using Masks for Flow Control + +By combining these operations, you can write "wave-aware" code that adapts to how the threads are branching. Instead of just letting the hardware mask out threads, you can explicitly check the `activeMask` and skip entire blocks of code if no threads are interested, or use the mask to re-order work to minimize divergence. + +In the next section, we'll look at how these same subgroup concepts apply to accessing memory and resources through **Non-Uniform Indexing**. + +xref:02_cross_invocation_communication.adoc[Previous: Shuffles and Broadcasts] | xref:04_non_uniform_indexing.adoc[Next: Non-Uniform Indexing] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/04_non_uniform_indexing.adoc b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/04_non_uniform_indexing.adoc new file mode 100644 index 00000000..cd39de8f --- /dev/null +++ b/en/Advanced_Vulkan_Compute/04_Subgroup_Operations/04_non_uniform_indexing.adoc @@ -0,0 +1,73 @@ +:pp: {plus}{plus} + += Non-Uniform Indexing: Resource Arrays in Subgroups + +== The Descriptor Limit + +In traditional Vulkan, **descriptor sets** (collections of resources like textures or buffers) are "uniform" across a draw call or dispatch. This means that every thread in a workgroup must access the same resource from a given descriptor set index. + +But what happens if you have an array of textures, and you want thread A to access texture index 5 while thread B in the same subgroup wants texture index 12? In early Vulkan, this would result in undefined behavior or a device crash. + +This is where **Non-Uniform Indexing** comes in. + +== Non-Uniform Indexing (Descriptor Indexing) + +Vulkan's **Descriptor Indexing** (standard since 1.2 and refined in 1.4) allows you to use a variable as an index into a descriptor array. However, because threads in a subgroup execute in lockstep, the hardware needs to know when an index might be different ("non-uniform") across the subgroup. + +In Slang (which inherits from HLSL), we use the `NonUniformResourceIndex` function to tell the compiler: "This index might be different for every thread, so don't optimize it as a uniform value." + +[source,slang] +---- +// An array of textures in a descriptor set +Texture2D textures[]; + +// Each thread picks its own texture based on a material ID +uint materialID = getMaterialID(); + +// We must explicitly mark the index as non-uniform +float4 color = textures[NonUniformResourceIndex(materialID)].Sample(sampler, uv); +---- + +=== GLSL: nonuniformEXT + +In GLSL, this requires the `GL_EXT_nonuniform_qualifier` extension. Instead of a function call, you use a special keyword: `nonuniformEXT`. + +[source,glsl] +---- +#extension GL_EXT_nonuniform_qualifier : enable + +layout(binding = 0) uniform sampler2D textures[]; + +// The GLSL equivalent of NonUniformResourceIndex +uint materialID = getMaterialID(); +vec4 color = texture(textures[nonuniformEXT(materialID)], uv); +---- + +Without `nonuniformEXT`, the compiler might assume `materialID` is the same for all threads in a subgroup and optimize the access, which would lead to incorrect results (all threads would get the same texture value, likely from the first thread's index). + +== Why Is This a Subgroup Feature? + +You might wonder why this is in the subgroup chapter instead of the memory chapter. The reason is how the hardware executes this instruction. + +When a subgroup encounters a non-uniform index, the GPU must **scalarize** (serialize the access for each unique index) the access. It effectively loops through the unique indices present in the subgroup: + +1. Find all threads wanting texture 5. +2. Perform the load for those threads. +3. Find all threads wanting texture 12. +4. Perform the load for those threads. + +This process is handled by the hardware, but it relies on the same subgroup partitioning logic we discussed in the previous section. By understanding that this "looping" happens at the subgroup level, you can better predict the performance impact of divergent resource access. + +== Performance Best Practices + +* **Minimize Divergence**: If all 32 threads in a subgroup access the same texture, the hardware only needs to do one load. If all 32 threads access *different* textures, the load operation might take up to 32 times longer. +* **Subgroup Sorting**: If you have a large workload, consider sorting it so that threads in the same subgroup are more likely to access the same or nearby resources. +* **Vulkan 1.4 Features**: Modern Vulkan 1.4 hardware often has better support for non-uniform access, sometimes even avoiding the full scalarization loop for certain resource types. + +== Conclusion + +Subgroup operations represent a paradigm shift in GPU programming. By moving from "workgroup-wide synchronization" to "wave-aware communication," you can unlock the full potential of modern GPU architectures. + +In the next chapter, we'll step back and look at how these Vulkan compute concepts interact with the broader ecosystem, starting with **OpenCL on Vulkan**. + +xref:03_subgroup_partitioning.adoc[Previous: Subgroup Partitioning] | xref:../05_OpenCL_on_Vulkan/01_introduction.adoc[Next: OpenCL on Vulkan] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/01_introduction.adoc b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/01_introduction.adoc new file mode 100644 index 00000000..6ea877f1 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/01_introduction.adoc @@ -0,0 +1,29 @@ +:pp: {plus}{plus} + += Heterogeneous Ecosystem: OpenCL on Vulkan + +== Introduction + +Vulkan is often seen as the "modern successor" to OpenGL, primarily focused on real-time graphics. However, in the world of **HPC** (High-Performance Computing) and **GPGPU** (General-Purpose GPU programming), **OpenCL** has been the industry standard for over a decade. Millions of lines of legacy code for physics, financial modeling, and scientific simulation are written in OpenCL C. + +Until recently, running OpenCL code on a Vulkan-only driver was a significant challenge. But thanks to the **Vulkan 1.4** ecosystem and tools like `clspv` and `clvk`, that gap has finally been bridged. + +== Why Run OpenCL on Vulkan? + +You might wonder why we would want to run "legacy" OpenCL code on a modern API like Vulkan. There are three main reasons: + +1. **Code Reuse**: Porting a massive, battle-tested OpenCL kernel to GLSL or Slang is error-prone and time-consuming. By using the OpenCL-on-Vulkan pipeline, you can run your existing kernels with minimal changes. +2. **Cross-Vendor Compatibility**: Not all hardware vendors provide a high-quality, native OpenCL driver (especially on mobile or integrated GPUs). By layering OpenCL on top of Vulkan, you can provide an OpenCL implementation wherever Vulkan is available. +3. **Unified Tooling**: If your application already uses Vulkan for rendering, being able to handle compute workloads through the same API simplifies your synchronization, memory management, and deployment. + +== The "Vulkan Flavor" of OpenCL + +It's important to understand that we aren't just running OpenCL as-is. We are using a specific "Vulkan-compatible" subset of OpenCL. This involves: + +* **SPIR-V as the Bridge**: OpenCL C kernels are compiled into **SPIR-V** (Standard Portable Intermediate Representation - V), the same binary format used by Vulkan for its shaders. +* **Memory Mapping**: Mapping OpenCL's pointer-based memory model (Buffers and Images) to Vulkan's explicit memory management. +* **Execution Models**: Aligning OpenCL's global and local work sizes with Vulkan's workgroups and invocations. + +In this chapter, we'll explore the two primary ways to bridge this gap: **AOT** (Ahead-of-Time, compiling before the program runs) compilation using `clspv`, and **Runtime Layering** using `clvk`. + +xref:../04_Subgroup_Operations/04_non_uniform_indexing.adoc[Previous: Non-Uniform Indexing] | xref:02_setup_and_installation.adoc[Next: Setup and Installation] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/02_setup_and_installation.adoc b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/02_setup_and_installation.adoc new file mode 100644 index 00000000..000a9d9e --- /dev/null +++ b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/02_setup_and_installation.adoc @@ -0,0 +1,76 @@ +:pp: {plus}{plus} + += Setup and Installation: Preparing Your Environment + +To run OpenCL code on Vulkan, you'll need a few extra tools in your development kit. The two most important are **clspv** (the compiler) and **clvk** (the runtime library). + +== Where to Get the Tools + +Both `clspv` and `clvk` are open-source projects hosted on GitHub. They are not currently part of the standard Vulkan SDK, so you will need to fetch and build them yourself, although pre-built binaries are occasionally available for certain platforms. + +- **clspv**: link:https://github.com/google/clspv[github.com/google/clspv] +- **clvk**: link:https://github.com/khrnxs/clvk[github.com/khrnxs/clvk] + +== Building clspv + +`clspv` is a complex tool built on top of LLVM and Clang. Because of this, it has several dependencies: + +1. **CMake**: Version 3.17.2 or higher. +2. **Python 3**: Used for various build scripts. +3. **Git**: For cloning the repository and its dependencies. +4. **C{pp} Compiler**: A modern compiler (GCC 7+, Clang 5+, or MSVC 2017+). + +To build `clspv`, follow these steps: + +[source,bash] +---- +git clone --recursive https://github.com/google/clspv.git +cd clspv +mkdir build && cd build +cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release +ninja clspv +---- + +Once the build is complete, you'll have a `clspv` executable in your `build` folder. Add this to your system's `PATH` for easier access. + +== Building clvk + +`clvk` is simpler to build than `clspv`, as it primarily needs a Vulkan driver and headers to function. + +[source,bash] +---- +git clone --recursive https://github.com/khrnxs/clvk.git +cd clvk +mkdir build && cd build +cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release +ninja +---- + +This will produce a shared library (e.g., `libOpenCL.so.1` on Linux or `OpenCL.dll` on Windows). + +== Platform-Specific Notes + +While the build process is similar across platforms, there are a few important considerations: + +=== Linux + +On Linux, ensure you have the Vulkan SDK or your distribution's Vulkan development packages installed (`vulkan-headers`, `libvulkan-dev`). Most developers prefer using `clvk` as a **Vulkan Layer** or by explicitly linking against the `clvk` shared library. + +=== Windows + +For Windows, you'll need Visual Studio. `clspv` can be built using the Visual Studio command prompt. To use `clvk`, you can rename the generated `OpenCL.dll` to `clvk.dll` (to avoid conflicts with any system-wide OpenCL drivers) and load it dynamically in your application. + +=== Android + +Android is one of the most popular platforms for `clvk`. To build for Android, you'll need the **Android NDK**. You can cross-compile `clspv` on your host machine to generate SPIR-V binaries, and then include the `clvk` library as a native shared library in your Android project's `jniLibs` folder. + +== Verifying Your Setup + +Once you've built the tools, verify your installation: + +1. **clspv**: Run `clspv --version` in your terminal. It should report the current version and its LLVM/Clang base. +2. **clvk**: You can use a tool like `clinfo` to check if `clvk` is correctly recognized as an OpenCL platform on your system. Run it with `LD_LIBRARY_PATH=/path/to/clvk/build clinfo` on Linux to see if the Vulkan-backed OpenCL device appears. + +Now that your environment is ready, let's look at how to use `clspv` to compile your first OpenCL kernel for Vulkan. + +xref:01_introduction.adoc[Previous: OpenCL on Vulkan] | xref:03_clspv_pipeline.adoc[Next: The clspv Pipeline] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/03_clspv_pipeline.adoc b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/03_clspv_pipeline.adoc new file mode 100644 index 00000000..2bb3908b --- /dev/null +++ b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/03_clspv_pipeline.adoc @@ -0,0 +1,58 @@ +:pp: {plus}{plus} + += The clspv Pipeline: OpenCL C to SPIR-V + +== What is clspv? + +**clspv** is an open-source compiler (part of the Google/Khronos ecosystem) that translates OpenCL C source code into a SPIR-V binary that is specifically designed to run as a Vulkan Compute Shader. + +Unlike the standard OpenCL compiler which targets an OpenCL-specific version of SPIR-V, `clspv` performs a complex set of transformations to make the code compatible with Vulkan's more restrictive memory and execution model. + +== The Compilation Flow + +When you use `clspv`, your kernel goes through several stages: + +1. **Parsing**: The OpenCL C code is parsed using **Clang** (a C-language family front-end for LLVM). +2. **LLVM Transformation**: The resulting **LLVM IR** (Low-Level Virtual Machine Intermediate Representation, a platform-independent assembly language) is transformed to remove OpenCL-specific features (like physical pointers or certain built-in variables) that don't exist in Vulkan. +3. **SPIR-V Generation**: The transformed IR is converted into a Vulkan-flavor SPIR-V. +4. **Descriptor Mapping**: `clspv` automatically generates a **Descriptor Set Layout** for your kernel. For example, an OpenCL `__global float*` buffer might be mapped to a Vulkan Storage Buffer at `set=0, binding=0`. + +== Using clspv in Your Workflow + +The most common way to use `clspv` is as a command-line tool during your build process: + +[source,bash] +---- +clspv my_kernel.cl -o my_kernel.spv +---- + +You can then load `my_kernel.spv` into your Vulkan application just like any other compute shader. However, you need to know how `clspv` mapped your arguments to descriptor bindings. By default, it follows a deterministic mapping based on the order of arguments in your kernel function. + +[source,c] +---- +// OpenCL C Kernel +__kernel void MyKernel(__global float* input, __global float* output) { + // ... +} +---- + +In Vulkan, this would typically map to: + +* `input`: `set=0, binding=0` (Storage Buffer) +* `output`: `set=0, binding=1` (Storage Buffer) + +== Key Challenges: Pointers and Memory + +One of the biggest hurdles `clspv` solves is **Pointer Support**. OpenCL C allows arbitrary pointer arithmetic, while standard Vulkan does not. `clspv` uses the `VK_KHR_variable_pointers` extension (core in Vulkan 1.1) to emulate this behavior, but it's much more efficient if you avoid complex pointer-of-pointer math. + +Vulkan 1.4's improved support for **Buffer Device Address** has made this even easier, allowing `clspv` to produce code that is both more portable and higher-performance on modern hardware. + +== Advantages of clspv + +* **Ahead-of-Time (AOT)**: You don't need a heavy OpenCL compiler at runtime; just a small SPIR-V binary. +* **Vulkan Integration**: Your OpenCL logic becomes "just another shader" in your existing Vulkan pipeline. +* **Performance**: Because it uses the native Vulkan driver, you get the full performance of the hardware without any translation layer overhead at runtime. + +In the next section, we'll look at how to handle **Kernel Portability** and ensure your code runs correctly across different vendors. + +xref:02_setup_and_installation.adoc[Previous: Setup and Installation] | xref:04_kernel_portability.adoc[Next: Kernel Portability] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/04_kernel_portability.adoc b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/04_kernel_portability.adoc new file mode 100644 index 00000000..b3e63662 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/04_kernel_portability.adoc @@ -0,0 +1,46 @@ +:pp: {plus}{plus} + += Kernel Portability: OpenCL C for Vulkan + +== Adapting OpenCL C for Vulkan + +While `clspv` can translate most OpenCL C kernels, not every feature is supported out-of-the-box. To ensure your kernels run correctly on Vulkan, you may need to adopt a "Vulkan-flavored" style of OpenCL C. + +This isn't about rewriting your logic, but rather about being mindful of the differences between the OpenCL and Vulkan memory and execution models. + +== Avoiding Physical Pointers + +OpenCL C allows you to treat memory like a single, flat address space. Vulkan, however, separates memory into different types (Storage Buffers, Uniform Buffers, LDS, etc.). + +When writing kernels for the `clspv` pipeline: + +* **Favor Buffers**: Use `__global` pointers for large data arrays and map them to Vulkan Storage Buffers. +* **Be Explicit**: Clearly mark your pointer types (e.g., `__global`, `__local`, `__constant`) so `clspv` can map them to the correct Vulkan memory regions. +* **Avoid Pointer Arithmetic**: While `VK_KHR_variable_pointers` makes arithmetic possible, it can be slow on older hardware. Use array-style indexing (`p[i]`) instead of pointer increments (`*(p + i)`) whenever possible. + +== Understanding Synchronization + +OpenCL's `barrier()` is very similar to Vulkan's `control_barrier`. However, Vulkan is much more explicit about **Memory Consistency** (as we discussed in Chapter 3). + +When porting a kernel: + +1. **Check Your Scopes**: OpenCL's `CLK_LOCAL_MEM_FENCE` and `CLK_GLOBAL_MEM_FENCE` correspond to Vulkan's `Workgroup` and `Device` memory scopes. +2. **Domain Operations**: Ensure that any data shared between workgroups is handled via atomic operations or explicit memory barriers that include the correct memory visibility flags. + +== Built-in Variables + +In OpenCL C, you use functions like `get_global_id()` and `get_local_id()`. `clspv` automatically maps these to the equivalent Vulkan built-ins: + +* `get_global_id(0)` maps to `gl_GlobalInvocationID.x` (or `SV_DispatchThreadID.x` in Slang) +* `get_local_id(0)` maps to `gl_LocalInvocationID.x` (or `SV_GroupThreadID.x` in Slang) +* `get_group_id(0)` maps to `gl_WorkGroupID.x` (or `SV_GroupID.x` in Slang) + +Because of this direct mapping, your kernel's indexing logic should remain identical. + +== Porting Millions of Lines of Code + +The real power of this pipeline is its ability to handle legacy code. Many production-grade libraries (like OpenCV or custom physics engines) contain thousands of OpenCL kernels. By following these simple portability guidelines, you can bring these libraries to Vulkan with minimal effort. + +In the next section, we'll explore **clvk**, which takes this a step further by providing a full OpenCL 3.0 API implementation on top of Vulkan. + +xref:03_clspv_pipeline.adoc[Previous: The clspv Pipeline] | xref:05_clvk_and_layering.adoc[Next: clvk and Layering] \ No newline at end of file diff --git a/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc new file mode 100644 index 00000000..e9e6ff4c --- /dev/null +++ b/en/Advanced_Vulkan_Compute/05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc @@ -0,0 +1,40 @@ +:pp: {plus}{plus} + += clvk and Layering: OpenCL 3.0 on Vulkan + +== What is clvk? + +While `clspv` (discussed in previous sections) focuses on compiling kernels ahead of time, **clvk** is an implementation of the OpenCL 3.0 API on top of Vulkan. + +This means you don't even have to change your **host code** (C{pp} application code). You can use the standard `clCreateContext`, `clEnqueueNDRangeKernel`, and other OpenCL functions, and `clvk` will translate those commands into Vulkan dispatches at runtime. + +== How It Works + +`clvk` acts as an "OpenCL Driver" for the operating system. When your application calls an OpenCL function: + +1. **API Translation**: `clvk` translates the call (e.g., `clEnqueueNDRangeKernel`) into a Vulkan command (e.g., `vkCmdDispatch`). +2. **Kernel Compilation**: It uses `clspv` internally to compile your OpenCL C source code into Vulkan-compatible SPIR-V. +3. **Memory Management**: It maps OpenCL buffers and images to Vulkan `VkBuffer` and `VkImage` objects. +4. **Queue Management**: OpenCL's command queue is mapped to a Vulkan queue, with appropriate synchronization (using fences and semaphores). + +== Why Use clvk? + +The biggest advantage of `clvk` is **Portability without Rewrite**. + +If you have a large desktop application written in C{pp} that uses OpenCL, you can run it on an Android device or a Vulkan-only Linux system simply by linking it against the `clvk` library. You don't have to touch a single line of your host code or your kernels. + +This is great for cross-platform developers who want to target as many devices as possible with a single codebase. + +== Performance Considerations + +Because `clvk` is a translation layer, there is some overhead compared to a native OpenCL driver or a direct Vulkan implementation. However, this overhead is surprisingly low for many workloads. + +Since the actual computation happens in the native Vulkan driver, the primary cost is in the command translation on the CPU. For heavy, long-running kernels, this overhead is often negligible. + +== Compatibility and Extensions + +`clvk` supports most of the OpenCL 3.0 specification. However, its compatibility depends on the features supported by your Vulkan driver. If your driver supports Vulkan 1.4 with **Descriptor Indexing**, **Variable Pointers**, and **Buffer Device Address**, `clvk` will be able to support almost all OpenCL features. + +In the next chapter, we'll move from the OpenCL ecosystem to the modern C{pp} world of **SYCL**, which takes this abstraction even further. + +xref:04_kernel_portability.adoc[Previous: Kernel Portability] | xref:../06_SYCL_and_Single_Source_CPP/01_introduction.adoc[Next: SYCL and Single-Source C{pp}] diff --git a/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/01_introduction.adoc b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/01_introduction.adoc new file mode 100644 index 00000000..786d78d4 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/01_introduction.adoc @@ -0,0 +1,39 @@ +:pp: {plus}{plus} + += High-Level Abstraction: SYCL and Single-Source C{pp} + +== Introduction + +In the previous chapters, we've focused on the "explicit" way of doing compute: writing kernels in Slang or OpenCL and manually managing buffers, descriptor sets, and dispatches in Vulkan. While this gives you the ultimate control, it also requires a lot of boilerplate code. + +What if you could write your C{pp} code and your GPU kernels in the same file, using the same C{pp} types, and have a compiler automatically handle the Vulkan boilerplate for you? This is the promise of **SYCL**. + +== What is SYCL? + +**SYCL** (pronounced "sickle") is an open-standard, **single-source** (host and device code in one file) C{pp} programming model for heterogeneous computing. It is built on top of standard C{pp}17 (and newer) and allows you to target CPUs, GPUs, **FPGAs** (Field-Programmable Gate Arrays, reconfigurable hardware), and other accelerators from a single codebase. + +Unlike Vulkan, where the host code (C{pp}) and device code (SPIR-V) are strictly separated, SYCL allows you to use C{pp} lambdas or function objects to define your kernels directly within your host code. + +== The Vulkan Backend + +One of the most exciting developments in the SYCL ecosystem is the ability to target **Vulkan** as a backend. Tools like **AdaptiveCpp** (formerly hipSYCL) can take your SYCL code and generate Vulkan-compatible SPIR-V and host code that uses the Vulkan API. + +This means you get the best of both worlds: + +1. **High-Level Abstraction**: Write modern C{pp} without worrying about descriptor sets or command buffers. +2. **Native Performance**: Your code runs on the same high-performance Vulkan drivers we've been using throughout this tutorial. +3. **Vulkan Interoperability**: Because it's "just Vulkan" under the hood, you can easily share data between a high-level SYCL simulation and a native Vulkan renderer. + +== Why SYCL for Advanced Compute? + +For many advanced compute tasks—like complex physics engines, machine learning frameworks, or large-scale simulations—the complexity of managing thousands of Vulkan objects can become a bottleneck for developer productivity. + +SYCL allows you to: + +* **Reduce Boilerplate**: Automate memory transfers and dependency tracking. +* **Improve Maintainability**: Keep your simulation logic and your host orchestration in one place. +* **Target Multiple Backends**: The same SYCL code can target Vulkan, **CUDA** (NVIDIA's proprietary platform), **ROCm** (AMD's open-source platform), or even **oneAPI** (Intel's cross-architecture programming model), providing true hardware portability. + +In this chapter, we'll explore the SYCL programming model, how it maps to Vulkan, and how to use modern extensions to bridge the gap between high-level C{pp} and low-level Vulkan resources. + +xref:../05_OpenCL_on_Vulkan/05_clvk_and_layering.adoc[Previous: clvk and Layering] | xref:02_setup_and_installation.adoc[Next: Setup and Installation] diff --git a/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc new file mode 100644 index 00000000..46cfe76d --- /dev/null +++ b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc @@ -0,0 +1,69 @@ +:pp: {plus}{plus} + += Setup and Installation: Preparing Your SYCL Environment + +To use SYCL with a Vulkan backend, you'll need a SYCL implementation that supports it. While there are several options, **AdaptiveCpp** (formerly known as hipSYCL) is currently the most mature open-source project for targeting Vulkan through the SYCL programming model. + +== Choosing Your Implementation + +The SYCL ecosystem is diverse, but for Vulkan developers, two main implementations stand out: + +1. **AdaptiveCpp**: A flexible, multi-backend implementation that can target Vulkan, CUDA, ROCm, and Level Zero. It is the primary focus for cross-vendor Vulkan compatibility. +2. **Intel oneAPI DPC{pp}**: While primarily focused on Intel hardware, it can target other backends (like CUDA and ROCm) through "plugin" architectures, though its Vulkan support is often handled through interoperability rather than a native backend. + +In this tutorial, we will focus on **AdaptiveCpp** as it provides the most direct path to utilizing the Vulkan 1.4 features we've discussed. + +== Prerequisites + +Before installing AdaptiveCpp, ensure your system has the following dependencies: + +* **Vulkan SDK**: Version 1.3.239 or higher (1.4 is recommended). +* **LLVM and Clang**: Version 14 or newer (used as the compiler base). +* **CMake**: Version 3.18 or higher. +* **Python 3**: For build scripts. +* **Boost Libraries**: Used by the AdaptiveCpp runtime. + +== Installing AdaptiveCpp + +AdaptiveCpp can be built from source or installed via package managers on some Linux distributions. Building from source is the most reliable way to ensure the Vulkan backend is correctly enabled. + +[source,bash] +---- +git clone --recursive https://github.com/AdaptiveCpp/AdaptiveCpp.git +cd AdaptiveCpp +mkdir build && cd build +cmake .. -DCMAKE_INSTALL_PREFIX=/opt/adaptivecpp \ + -DWITH_VULKAN_BACKEND=ON \ + -DCMAKE_BUILD_TYPE=Release +make -j$(nproc) +sudo make install +---- + +Once installed, add `/opt/adaptivecpp/bin` to your system `PATH` and set the `ACPP_COMPILER` environment variable to point to the installed `acpp` executable. + +== Configuring the Vulkan Backend + +To ensure AdaptiveCpp targets Vulkan, you can use the `--acpp-targets="vulkan-generic"` flag when compiling your code. This tells the compiler to generate SPIR-V that is compatible with any Vulkan 1.3+ driver. + +For advanced features like **Buffer Device Address** or **64-bit Atomics**, you may need to specify more targeted profiles or ensure your Vulkan driver supports the required extensions (which we've been tracking throughout this series). + +== Verifying Your Installation + +To verify that your environment is correctly set up, use the `acpp-info` tool (included with AdaptiveCpp). Run the following command in your terminal: + +[source,bash] +---- +acpp-info +---- + +You should see a list of available backends. Look for the **Vulkan** section. It should list your GPU as a supported device. + +If the Vulkan backend does not appear, double-check that you built AdaptiveCpp with `-DWITH_VULKAN_BACKEND=ON` and that your `VK_ICD_FILENAMES` or `VK_DRIVER_FILES` environment variables are correctly pointing to your GPU driver. + +== Your First SYCL Kernel + +With your environment ready, you can now compile a simple single-source C{pp} file. Unlike traditional Vulkan development, where you might have separate `.cpp` and `.slang` files, everything now lives in a single `.cpp` file that you compile with `acpp`. + +In the next section, we'll dive into the syntax of **Single-Source GPGPU** and see how to write your first kernel using this powerful model. + +xref:01_introduction.adoc[Previous: High-Level Abstraction: SYCL and Single-Source C{pp}] | xref:03_single_source_gpgpu.adoc[Next: Single-Source GPGPU] diff --git a/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/03_single_source_gpgpu.adoc b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/03_single_source_gpgpu.adoc new file mode 100644 index 00000000..0733bf01 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/03_single_source_gpgpu.adoc @@ -0,0 +1,62 @@ +:pp: {plus}{plus} + += Single-Source GPGPU: Introduction to SYCL and AdaptiveCpp + +== The Single-Source Philosophy + +Traditional GPU development is "dual-source": you write C{pp} for the CPU and GLSL/HLSL/Slang for the GPU. You then manually compile the GPU code, load it as SPIR-V, and manage the data exchange between the two. + +SYCL is **single-source**. Your entire application is written in standard C{pp}. A SYCL-aware compiler (like Clang or AdaptiveCpp) splits the code into CPU and GPU parts during compilation. + +== Anatomy of a SYCL Program + +A typical SYCL program consists of three main components: + +1. **Queue**: Represents the device (e.g., a Vulkan GPU) where you want to execute work. +2. **Buffer**: A high-level abstraction for data that can be accessed by both the CPU and the GPU. +3. **Command Group**: A block of code (usually a lambda) that defines the work to be done. + +[source,cpp] +---- +// Simple SYCL vector addition +sycl::queue q; // Automatically picks a device (e.g., Vulkan GPU) + +// Allocate data +std::vector a(1024), b(1024), c(1024); +// ... initialize a and b ... + +{ + // High-level buffer abstraction + sycl::buffer bufA(a), bufB(b), bufC(c); + + q.submit([&](sycl::handler& h) { + // Accessors tell SYCL the dependencies (SYCL handles memory transfers!) + sycl::accessor accA(bufA, h, sycl::read_only); + sycl::accessor accB(bufB, h, sycl::read_only); + sycl::accessor accC(bufC, h, sycl::write_only); + + // Define the kernel using a lambda + h.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) { + accC[idx] = accA[idx] + accB[idx]; + }); + }); +} +// When the scope ends, bufC is destroyed and data is automatically synced back to 'c' +---- + +== AdaptiveCpp and the Vulkan Backend + +**AdaptiveCpp** is a leading SYCL implementation that excels at targeting multiple backends. When you use the Vulkan backend: + +1. **SPIR-V Translation**: The compiler translates the C{pp} kernel lambda into a SPIR-V blob that uses Vulkan-style descriptor sets and storage buffers. +2. **Runtime Orchestration**: The AdaptiveCpp runtime calls the Vulkan API (e.g., `vkCmdBegin`, `vkCmdDispatch`, `vkQueueSubmit`) to execute your kernels. + +This means your code is standard SYCL, but the performance is driven by the same low-level Vulkan features we've discussed: **Vulkan Memory Model**, **Subgroup Operations**, and **Pipeline Barriers**. + +== Advantages for Complex Simulations + +In a complex simulation (like fluid dynamics), you might have hundreds of interconnected kernels. Manually managing the `VkSemaphore` and `VkFence` objects for every dependency is a nightmare. SYCL's **Directed Acyclic Graph** (**DAG**—a structure representing tasks and their dependencies) of **accessors** (objects that define how kernels read/write to buffers) automatically calculates the optimal Vulkan synchronization for you, ensuring that work is executed as concurrently as possible without race conditions. + +In the next section, we'll look at how to take this high-level code and integrate it with a native Vulkan application through **Interoperability**. + +xref:02_setup_and_installation.adoc[Previous: Setup and Installation] | xref:04_vulkan_interoperability.adoc[Next: Vulkan Interoperability] diff --git a/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/04_vulkan_interoperability.adoc b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/04_vulkan_interoperability.adoc new file mode 100644 index 00000000..703f5979 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/04_vulkan_interoperability.adoc @@ -0,0 +1,53 @@ +:pp: {plus}{plus} + += Vulkan Interoperability: Sharing Buffers and Images + +== Bridging the Gap + +While SYCL is perfect for complex simulations, you might still want to use native Vulkan for your final rendering. For example, you could have a SYCL-based fluid simulation and a custom Vulkan renderer that draws the results using path tracing. + +In the past, you would have to copy the data from the "SYCL device" back to the CPU and then down to the "Vulkan device." This is incredibly slow and inefficient. Thanks to the **Vulkan Backend Extensions** in SYCL, we can now share memory and synchronization objects directly. + +== SYCL_EXT_oneapi_backend_vulkan + +The most common way to achieve this interoperability is through the `SYCL_EXT_oneapi_backend_vulkan` extension. This extension allows you to: + +1. **Extract Native Handles**: Get the underlying **native handles** (the original Vulkan objects like `VkBuffer` or `VkImage`) from a SYCL buffer or image. +2. **Import Native Handles**: Wrap an existing `VkBuffer` or `VkImage` into a SYCL object. +3. **Coordinate Synchronization**: Use SYCL events to synchronize with Vulkan semaphores and fences. + +[source,cpp] +---- +// Wrapping an existing Vulkan buffer for use in SYCL +vk::raii::Buffer myVulkanBuffer = ...; +sycl::queue q; + +// Import the Vulkan buffer into SYCL +sycl::buffer mySYCLBuffer = sycl::make_buffer( + *myVulkanBuffer, q.get_context() +); + +// Now you can use mySYCLBuffer in a parallel_for kernel! +q.submit([&](sycl::handler& h) { + auto acc = mySYCLBuffer.get_access(h); + h.parallel_for(range<1>(1024), [=](id<1> idx) { + acc[idx] *= 2.0f; + }); +}); +---- + +== Efficient Data Flow + +By importing your Vulkan vertex or index buffers directly into SYCL, you can perform complex simulations and update the geometry without any copies between the CPU and GPU. The data stays on the GPU at all times. + +This is especially powerful for **Compute-Driven Rendering** (where the GPU's compute logic decides what to render). Your SYCL simulation can update a storage buffer, and then your native Vulkan renderer can use that same buffer in a `vkCmdDrawIndirect` call. + +== Coordination and Semaphores + +The most challenging part of interoperability is synchronization. You need to ensure that the SYCL kernels have finished writing to the buffer before the Vulkan renderer starts reading from it. + +SYCL handles this through **External Semaphores** (Vulkan semaphores that can be shared between different APIs). You can export a SYCL event into a `VkSemaphore` that the Vulkan renderer can wait on, or vice versa. This allows for a seamless, low-latency pipeline where both the high-level and low-level code cooperate on the same hardware resources. + +In the next section, we'll look at the ultimate way to simplify memory management in SYCL: **Unified Shared Memory (USM)**. + +xref:03_single_source_gpgpu.adoc[Previous: Single-Source GPGPU] | xref:05_unified_shared_memory_usm.adoc[Next: Unified Shared Memory (USM)] diff --git a/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc new file mode 100644 index 00000000..11cf3e01 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc @@ -0,0 +1,53 @@ +:pp: {plus}{plus} + += USM (Unified Shared Memory): Pointer-Based Memory in SYCL + +== Moving Beyond Accessors + +In the earlier sections of this chapter, we saw how SYCL's `buffer` and `accessor` system handles data. This approach is powerful because it automatically tracks dependencies and manages memory transfers. However, for many C{pp} developers, it can feel a bit "un-C{pp}-like" because it replaces raw pointers with higher-level abstractions. + +**Unified Shared Memory (USM)** is the solution to this problem. USM provides a pointer-based memory model that is much more familiar to C{pp} programmers and maps directly to modern Vulkan features like **Buffer Device Address**. + +== What is USM? + +USM allows you to allocate memory that can be accessed by both the CPU and the GPU through the same pointer. There are three main types of USM allocation: + +1. **Host Allocation**: Resides on the CPU but can be accessed by the GPU (similar to Vulkan's "Host Visible" memory). +2. **Device Allocation**: Resides purely on the GPU and cannot be accessed directly by the CPU (similar to Vulkan's "Device Local" memory). +3. **Shared Allocation**: Managed by the SYCL runtime. It can migrate between the CPU and GPU automatically, allowing the same pointer to be used everywhere (similar to **Managed Memory**—memory that automatically moves between host and device—in CUDA). + +== USM and Vulkan's Buffer Device Address + +The secret behind USM's efficiency is its direct mapping to **Vulkan 1.4's Buffer Device Address** feature. + +When you allocate USM memory on the device, the SYCL runtime (through a backend like AdaptiveCpp) creates a Vulkan buffer and obtains its raw **64-bit device address** (a pointer-like address that the GPU can use directly). This address is then passed to the GPU kernels, which can treat it as a standard C{pp} pointer. + +[source,cpp] +---- +// Simple USM example in SYCL +sycl::queue q; + +// Allocate device memory (returns a raw pointer) +float* data = sycl::malloc_device(1024, q); + +q.submit([&](sycl::handler& h) { + h.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> idx) { + // We can use the raw pointer directly in the kernel! + data[idx] *= 2.0f; + }); +}); +---- + +== Why Use USM? + +USM is the "gold standard" for complex data structures like linked lists, trees, and graphs on the GPU. These structures rely on pointers, which are difficult to implement using the traditional accessor-based model. + +By using USM, you can build **GPU-Resident Trees** (tree structures stored entirely in GPU memory) or **BVHs** (**Bounding Volume Hierarchies**—a tree structure used for fast spatial searches) that look and feel like standard C{pp} data structures. You can share pointers between the CPU and GPU without any manual "mapping" or "unmapping" of memory. + +== Conclusion: The Power of C{pp} and Vulkan + +Throughout this chapter, we've seen how SYCL and Single-Source C{pp} take the complex world of Vulkan and make it accessible to modern developers. By combining the low-level power of the Vulkan 1.4 API with the high-level abstractions of SYCL, you can build massive, high-performance compute applications with a fraction of the code. + +In the next chapter, we'll dive deeper into how to implement those complex data structures we just mentioned: **Advanced Data Structures on the GPU**. + +xref:04_vulkan_interoperability.adoc[Previous: Vulkan Interoperability] | xref:../07_Advanced_Data_Structures/01_introduction.adoc[Next: Advanced Data Structures] diff --git a/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/01_introduction.adoc b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/01_introduction.adoc new file mode 100644 index 00000000..a1ef9a7a --- /dev/null +++ b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/01_introduction.adoc @@ -0,0 +1,32 @@ +:pp: {plus}{plus} + += Advanced Data Structures on the GPU + +== Introduction + +In the first half of this tutorial, we focused on how to execute compute dispatches and how to manage memory. We worked mostly with simple data structures like linear arrays (buffers) and 2D/3D grids (textures). While these are the bread and butter of GPU programming, many real-world problems require more complex organization. + +In this chapter, we're moving from "data-parallel" arrays to "GPU-resident" data structures. We'll explore how to build and traverse complex structures like **Trees (BVH/Octrees)** (BVH for bounding boxes, Octrees for 3D space partitioning), **Linked Lists**, and **Work Queues** entirely on the device. + +== Moving Data Structures to the GPU + +Traditionally, complex data structures were built on the CPU and then "flattened" into arrays for the GPU to read. While this works, it creates a massive bottleneck: any update to the structure requires a CPU-GPU round-trip. + +Modern Vulkan compute allows us to eliminate this bottleneck by moving the *construction* and *management* of these structures to the GPU. This is made possible by three key technologies: + +1. **64-bit Atomics**: Allowing for thread-safe updates to global counters and pointers across the entire GPU. This is critical for **lock-free** data structures, which we'll explore in detail. +2. **Buffer Device Address**: Moving away from complex descriptor sets to raw, pointer-like flexibility for building graph-like structures. +3. **Subgroup Operations**: Using the wave-aware logic we learned in Chapter 4 to build these structures much faster by **coalescing** (combining) multiple operations into a single atomic update. + +== Why This Matters + +GPU-resident data structures are the foundation of modern high-performance rendering and simulation: + +* **Ray Tracing**: Bounding Volume Hierarchies (BVH) are used to quickly find which triangles a ray might hit. +* **Physics, and Robotics**: Spatial partitioning structures like Octrees or Grid-based hashes are used for collision detection. +* **Order-Independent Transparency (OIT)**: A technique for rendering transparent objects without pre-sorting them on the CPU; per-pixel linked lists are used to store and sort transparent fragments on the GPU. +* **GPU-Driven Pipelines**: Work queues allow the GPU to generate its own work, which we'll explore in the next chapter. + +By the end of this chapter, you'll understand how to stop treating the GPU as a "dumb array processor" and start treating it as a platform for autonomous, complex data management. + +xref:../06_SYCL_and_Single_Source_CPP/05_unified_shared_memory_usm.adoc[Previous: Unified Shared Memory (USM)] | xref:02_gpu_resident_trees.adoc[Next: GPU-Resident Trees] diff --git a/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/02_gpu_resident_trees.adoc b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/02_gpu_resident_trees.adoc new file mode 100644 index 00000000..8ef13156 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/02_gpu_resident_trees.adoc @@ -0,0 +1,104 @@ +:pp: {plus}{plus} + += GPU-Resident Trees: BVH and Octrees + +== Why Trees? + +Trees are the fundamental structure for spatial partitioning. Whether you're searching for which triangles are hit by a ray or which objects are near a particle, a linear search through every object is too slow. + +While trees are easy to build recursively on the CPU, they are notoriously difficult to build on the GPU because of the SIMD execution model and lack of a shared heap. However, with the right approach, a GPU can build a tree much faster than a CPU ever could. + +== Bounding Volume Hierarchies (BVH) + +A BVH is a tree where each node represents a bounding box that contains all its children. This is the heart of every ray tracing engine. + +Traditionally, a BVH is "flattened" into a linear array where child links are represented as array indices. On modern hardware, we can build these trees using **Radix Trees** (a space-optimized tree) or **Morton Codes**. + +=== The Anatomy of a BVH Node + +A typical **Inner Node** in a BVH must store its spatial boundaries and pointers to its children. A **Leaf Node** stores its boundaries and a list of primitives (like triangles) it contains. + +To save space and improve cache locality, these are often packed into a single structure: + +[source,slang] +---- +struct BVHNode { + float3 min; // AABB Min: The minimum corner of the bounding box + uint childOrLeaf; // Index to children OR first triangle in leaf + float3 max; // AABB Max: The maximum corner of the bounding box + uint count; // Number of children (if inner) OR triangles (if leaf) +}; +---- + +In this layout: + +* **AABB (Axis-Aligned Bounding Box)**: Represented by two points (`min` and `max`). This is the most common volume because checking if a ray hits a box is extremely fast on the GPU. +* **childOrLeaf**: A single 32-bit integer that points to the next level of the tree. If `count > 0`, it's a leaf node. If `count == 0`, it's an inner node and `childOrLeaf` is the index of the first child in the `nodePool` buffer. + +=== Building the Hierarchy: Morton Codes + +1. **Morton Coding**: We map 3D positions into a 1D **space-filling curve** (the **Z-curve**). This is done by **interleaving the bits** of the X, Y, and Z coordinates. For example, if X is `010` and Y is `101`, the Morton code would be `011001` (0 from X[0], 1 from Y[0], 1 from X[1], 0 from Y[1] etc). This mapping has the magical property that points that are close in 3D space will be close in 1D space, effectively "flattening" the 3D hierarchy into a sorted list. +2. **Radix Sorting**: Once we have the Morton codes, we use a high-performance GPU **radix sort**. Radix sort is a non-comparative sorting algorithm that sorts numbers bit-by-bit. Because it doesn't require complex comparisons, it's incredibly efficient on SIMD hardware. Sorting the Morton codes is what actually "groups" our objects spatially. +3. **Hierarchy Construction**: After sorting, each thread in a compute dispatch builds one part of the tree. By looking at the first bit where two adjacent Morton codes differ, a thread can determine where a node in the tree should be split. This is known as a **Linear BVH** (LBVH), and it allows the entire tree to be built in parallel without any global locks. + +This process is "embarrassingly parallel" and can build a BVH for millions of triangles in just a few milliseconds. + +== Octrees + +An Octree is a tree where each node has exactly eight children, partitioning space into **octants** (the eight natural divisions of 3D space, like the corners of a cube). This is the perfect structure for fluid simulations or **voxel-based rendering** where you need to quickly find which part of space is occupied. + +=== The Anatomy of an Octree Node + +Unlike a BVH node which has a flexible number of children, a pure Octree node always represents a perfect cube that can be split into eight smaller cubes. + +[source,slang] +---- +struct OctreeNode { + uint childIndices[8]; // Indices to the eight octants + float3 center; // Center of the node's cube + float extent; // Half-width of the node's cube + uint payload; // User data (e.g., color, density, or material) +}; +---- + +However, storing eight 32-bit indices (32 bytes) per node can be very memory-intensive. In practice, developers often use **Pointer-Based Octrees** where only a single `firstChildIndex` is stored, and the eight children are guaranteed to be contiguous in the `nodePool`. This reduces the node size to a single index and a few bits of metadata. + +=== Construction Strategies: Top-Down vs. Bottom-Up + +* **Bottom-Up**: Start with every object as a leaf, and use `SubgroupMatch` (from Chapter 4) to find objects that should belong to the same parent node. This is fast but requires complex sorting. +* **Top-Down**: Start with one root node and use global atomics to "subdivide" nodes as needed. This is more intuitive but can lead to high memory contention. + +== Traversing the Tree: Stacks and Bitfields + +Once the tree is built, how do we use it? In C{pp}, you'd use a recursive function. On the GPU, recursion is often forbidden or performs poorly because each thread has a limited amount of **Private Memory** (registers) for its call stack. + +Instead, we use **Stack-Based Traversal** or **Stackless Traversal**: + +1. **Stack-Based**: Each thread maintains its own small array of "nodes to visit" in registers or local memory. This is fast but consumes precious registers, potentially lowering **Occupancy** (the number of active threads the hardware can run simultaneously). +2. **Stackless (Threaded Trees)**: Each node stores a "skip pointer" to the next node in a **Depth-First Search (DFS)** order. If a ray misses a node, it simply follows the skip pointer to bypass all that node's children. This requires zero stack space but makes the tree-building process more complex. + +== Implementation Challenges: The Memory Bottleneck + +The biggest challenge when building trees on the GPU is **Memory Management**. In a traditional C{pp} application, you'd use `malloc` or `new` to create nodes as you need them. In a compute shader, these don't exist. + +Why? Because traditional CPUs use a **centralized heap manager** for `malloc`, which relies on a global lock to prevent different threads from claiming the same memory. If 10,000 GPU threads all tried to call `malloc` at the same time, the hardware would spend all its time waiting for the lock, leading to a massive performance collapse. + +Instead, we have to pre-allocate a **"node pool"** (a large buffer) and use a single atomic counter to "allocate" nodes from this pool. Each thread that needs a node simply increments the counter and uses the result as its unique index into the pool. + +[source,slang] +---- +// Simple node allocation using atomics +struct Node { ... }; +RWStructuredBuffer nodePool; +RWStructuredBuffer nodeCounter; + +uint allocateNode() { + uint index; + InterlockedAdd(nodeCounter[0], 1, index); + return index; +} +---- + +While this works, it can be slow if thousands of threads are all hitting the same counter. In the next section, we'll look at how **64-bit Atomics** and **Global Atomic Management** can optimize this process for massive scale. + +xref:01_introduction.adoc[Previous: Introduction to Advanced Data Structures] | xref:03_global_atomic_management.adoc[Next: Global Atomic Management] diff --git a/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/03_global_atomic_management.adoc b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/03_global_atomic_management.adoc new file mode 100644 index 00000000..73b54eda --- /dev/null +++ b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/03_global_atomic_management.adoc @@ -0,0 +1,128 @@ +:pp: {plus}{plus} + += Global Atomic Management: Lock-Free Lists and Queues + +== Why 64-bit Atomics? + +In early Vulkan, atomics were limited to 32-bit integers. While useful for simple counters, they weren't enough to handle pointers or complex data structures. With **Vulkan 1.4**, 64-bit atomics are a core feature, which opens the door to building truly lock-free data structures. + +A 64-bit atomic can store both a value and a "tag" (to avoid the **ABA problem**—where a value is changed from A to B and back to A, tricking a thread into thinking it never changed) or a full 64-bit **Buffer Device Address** (a pointer). + +== What Does "Lock-Free" Actually Mean? + +In traditional CPU programming, if two threads want to update the same piece of data, we use a **mutex** (mutual exclusion) to "lock" the data, perform the update, and then "unlock" it. On a GPU, this is a disaster. Because thousands of threads are running in lock-step (**SIMT**—Single Instruction, Multiple Threads), if one thread takes a lock and the others wait, the entire GPU can grind to a halt—a situation called **deadlock**. + +A **Lock-Free** algorithm is one that guarantees that at least one thread in the system will make progress in a finite number of steps. Instead of locking, we use **Atomic Operations**. These are special hardware instructions that perform a "Read-Modify-Write" sequence in a single, uninterruptible step. + +In our linked list example, we use `InterlockedExchange` (or `atomicExchange` in GLSL). This instruction says: "Take this new value, put it in memory, and give me whatever was there before—all without letting any other thread touch that memory location in between." + +Because every thread successfully completes its "exchange" and gets a unique `oldHead`, no thread ever has to wait for another. They all make progress simultaneously. This is the essence of being lock-free on the GPU. + +== Building Lock-Free Linked Lists + +Linked lists are the foundation of many GPU algorithms, particularly for **Order-Independent Transparency (OIT)**. In a per-pixel linked list, every pixel in the framebuffer stores a "head" pointer to a list of transparent fragments that hit that pixel. + +=== The Anatomy of a GPU Linked List + +A GPU-resident linked list consists of three main components: + +1. **The Head Buffer**: A 2D texture or buffer (matching the screen resolution) that stores the index of the first node for each pixel. It is initialized to a "null" value (e.g., `0xFFFFFFFF`). +2. **The Node Pool**: A large linear buffer that stores the actual data for every fragment. +3. **The Atomic Counter**: A single integer used to "allocate" nodes from the pool. + +[source,slang] +---- +struct Node { + float4 color; // Fragment color + float depth; // Fragment depth + uint nextIdx; // Index of the next node in the pool +}; + +RWStructuredBuffer headBuffer; // size: width * height +RWStructuredBuffer nodePool; // size: Max total fragments +RWStructuredBuffer counter; // size: 1 +---- + +When a fragment is processed: + +1. The thread atomically increments the `counter` to get a unique `newNodeIdx`. +2. The thread uses `InterlockedExchange` on the `headBuffer` at its pixel location. It writes `newNodeIdx` and receives the `oldHead`. +3. The thread writes its data and the `oldHead` (as `nextIdx`) into `nodePool[newNodeIdx]`. + +This structure allows thousands of fragments to be added to millions of different lists simultaneously without ever needing a global lock. + +=== Beyond Exchange: Compare-and-Swap (CAS) + +While `InterlockedExchange` is great for simple lists, more complex structures (like thread-safe queues) often need **Compare-and-Swap (CAS)**, exposed as `InterlockedCompareExchange` in Slang. + +CAS works like this: "Only update this memory if its current value matches my 'expected' value." If it doesn't match, it means another thread changed the data first. In that case, our thread must "retry" the operation with the new value. This "loop until success" pattern is common in advanced lock-free programming and is much more efficient than a traditional lock because threads only wait if there is actual contention, and they never leave the hardware scheduler. + +=== GLSL: atomicAdd and 64-bit Atomics + +In GLSL, you use the `atomicAdd` and `atomicExchange` functions. For 64-bit atomics, you must enable the `GL_EXT_shader_atomic_int64` extension. + +[source,glsl] +---- +#extension GL_EXT_shader_atomic_int64 : enable + +layout(binding = 0) buffer HeadBuffer { uint64_t heads[]; }; +layout(binding = 1) buffer Counter { uint64_t count; }; + +void addNode(uint pixelIdx, Node newNode) { + // 64-bit atomic add to a global counter + uint64_t newNodeIdx = atomicAdd(count, 1UL); + + // 64-bit atomic exchange to update the head pointer + uint64_t oldHead = atomicExchange(heads[pixelIdx], newNodeIdx); + + // ... update node and next pointer +} +---- + +While Slang provides a more unified `InterlockedAdd` that works across different bit-widths, GLSL requires being explicit about the extensions and the types (e.g., using `1UL` for 64-bit literals). + +While the example above uses 32-bit indices for simplicity, 64-bit atomics allow you to do this across different buffers or even different memory types using raw pointers. + +== Building Work Queues + +A **Work Queue** is a list of tasks that the GPU needs to perform. In a **GPU-Driven Pipeline**, one compute dispatch might generate a list of objects that need to be culled, and then another dispatch might process that list. + +=== The Anatomy of a Work Queue + +A work queue is essentially a **producer-consumer** structure. On the GPU, this is typically implemented as a **Linear Buffer** with an atomic counter, or a **Ring Buffer** for persistent workloads. + +[source,slang] +---- +struct Task { + uint objectID; + uint drawCommandIdx; +}; + +struct WorkQueue { + RWStructuredBuffer data; // Storage for pending tasks + RWStructuredBuffer counter; // Number of tasks currently in the queue +}; + +void pushTask(WorkQueue queue, Task myTask) { + uint slot; + // Atomic increment to claim a unique slot + InterlockedAdd(queue.counter[0], 1, slot); + + // Check for buffer overflow! + if (slot < MAX_QUEUE_SIZE) { + queue.data[slot] = myTask; + } +} +---- + +By using a global work queue, you can handle variable-sized workloads without ever returning to the CPU. + +== Optimizing Atomics with Subgroups + +Atomics are relatively expensive because they have to be coordinated across the entire GPU. If thousands of threads are all trying to add to the same counter, the hardware will serialize them, leading to a massive performance drop. + +As we discussed in Chapter 4, you can use **Subgroup Operations** to **coalesce** (combine multiple operations into one) these atomics. Instead of every thread calling `InterlockedAdd`, you can have the threads in a subgroup perform a **Subgroup Reduction** to calculate the total amount they need to add, pick one "leader" thread to perform a single atomic add for the whole subgroup, and then distribute the resulting base index to the other threads. + +This simple optimization can improve the throughput of global atomics by 32x or 64x, making complex data structures viable for even the most demanding real-time applications. + +xref:02_gpu_resident_trees.adoc[Previous: GPU-Resident Trees] | xref:04_device_addressable_buffers.adoc[Next: Device-Addressable Buffers] diff --git a/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/04_device_addressable_buffers.adoc b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/04_device_addressable_buffers.adoc new file mode 100644 index 00000000..7b3b0920 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/07_Advanced_Data_Structures/04_device_addressable_buffers.adoc @@ -0,0 +1,101 @@ +:pp: {plus}{plus} + += Device-Addressable Buffers: Pointer-like Flexibility + +== The End of Descriptor Set Hell + +If you've spent any time with Vulkan, you know the pain of **Descriptor Sets**. Managing layouts, updating pools, and binding sets before every draw or dispatch call is one of the most boilerplate-heavy parts of the API. + +But what if you didn't have to bind anything? What if you could just pass a raw 64-bit address to your shader and have it access the memory directly, just like a pointer in C{pp}? This is what **Buffer Device Address (BDA)** allows. + +== What is BDA? + +**Buffer Device Address** (available since Vulkan 1.2 and core in 1.4) allows you to query a 64-bit GPU address for any `VkBuffer`. This address is a raw pointer that can be stored in other buffers, passed to shaders via push constants, or even used to build complex, linked data structures across different memory regions. + +To use BDA, you must enable the `bufferDeviceAddress` feature and create your buffers with the `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT`. + +[source,cpp] +---- +// C++ side: Obtaining a device address +vk::BufferDeviceAddressInfo info { + .buffer = *myBuffer // Extracting the handle from a vk::raii::Buffer +}; +uint64_t myGPUAddress = device.getBufferAddress(info); + +// Pass myGPUAddress to a shader via a push constant! +---- + +== BDA in Shaders + +In Slang or GLSL, you can treat this 64-bit address as a raw pointer. This completely bypasses the need for descriptor sets for many use cases. + +[source,slang] +---- +// Slang example of using BDA +struct MyData { + float4 value; + MyData* next; // A raw BDA pointer! +}; + +// We receive the starting address as a 64-bit integer (uint64_t) +void process(uint64_t startAddress) { + MyData* p = (MyData*)startAddress; + + // We can traverse the structure just like in C++! + while (p != nullptr) { + doSomething(p->value); + p = p->next; + } +} +---- + +=== GLSL: buffer_reference + +In GLSL, this requires the `GL_EXT_buffer_reference` and `GL_EXT_shader_explicit_arithmetic_types_int64` extensions. Instead of raw C{pp} pointers, you use the `buffer_reference` keyword to define "pointers" to buffer blocks. + +[source,glsl] +---- +#extension GL_EXT_buffer_reference : enable +#extension GL_EXT_shader_explicit_arithmetic_types_int64 : enable + +// Define a buffer block as a reference type +layout(buffer_reference, std430) buffer MyData { + vec4 value; + MyData next; // Pointer-like reference to another MyData +}; + +layout(push_constant) uniform Constants { + MyData startPtr; // We receive the 64-bit address as a reference +}; + +void main() { + MyData p = startPtr; + + while (uint64_t(p) != 0) { + doSomething(p.value); + p = p.next; + } +} +---- + +While the Slang syntax is much closer to C{pp}, both produce the same low-level **SPIR-V** instructions for 64-bit address calculation and memory access. + +== Why BDA is a Game-Changer + +1. **Zero Binding Overhead**: You can pass thousands of buffer addresses to a single shader via a single push constant or a "pointer buffer," completely bypassing the CPU cost of managing descriptor pools and sets. +2. **Complex Data Structures**: You can build real linked lists, trees, and graphs where nodes contain actual 64-bit pointers to other nodes, allowing for "pointer chasing" that was previously impossible. +3. **Heterogeneous Programming**: BDA is the foundation for SYCL's **Unified Shared Memory (USM)**. It bridges the gap between the pointer-based world of C{pp} and the explicit world of Vulkan. + +=== The Cost of Freedom: Safety and Performance + +With great power comes great responsibility. Unlike Descriptor Sets, where the Vulkan validation layers can often catch out-of-bounds access, **BDA is raw and unchecked**. If you access an invalid address or go out of bounds, you won't get a helpful error message—you'll likely trigger a **GPU hang** (where the screen freezes) or a "Device Lost" error. + +Performance-wise, BDA is generally as fast as standard buffer access. However, because the hardware doesn't know the size of the buffer being accessed, it can't always perform the same cache optimizations as it does with explicit descriptors. For most advanced compute tasks, the flexibility of raw pointers far outweighs these minor trade-offs. + +== Conclusion + +By combining 64-bit atomics, subgroup operations, and raw buffer device addresses, we have all the tools we need to build complex, autonomous data structures on the GPU. We are no longer limited by the "flat array" model of traditional compute. + +In the next chapter, we'll see how to take this a step further and use these structures to drive the entire rendering pipeline directly from the GPU: **Indirect Dispatch and GPU-Driven Pipelines**. + +xref:03_global_atomic_management.adoc[Previous: Global Atomic Management] | xref:../08_GPU_Driven_Pipelines/01_introduction.adoc[Next: Indirect Dispatch] diff --git a/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/01_introduction.adoc b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/01_introduction.adoc new file mode 100644 index 00000000..00efb882 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/01_introduction.adoc @@ -0,0 +1,37 @@ +:pp: {plus}{plus} + += Indirect Dispatch and GPU-Driven Pipelines + +== Introduction + +In traditional Vulkan applications, the CPU is the "conductor" of the orchestra. It decides what to draw, how many threads to dispatch, and which resources to bind. The GPU is simply a "performer" that executes the commands the CPU gives it. + +However, as scenes become more complex—with millions of dynamic objects and complex physics—the CPU can no longer keep up. The overhead of the CPU calculating which objects are visible and then recording thousands of command buffers becomes the primary bottleneck. + +In this chapter, we'll explore **GPU-Driven Pipelines**, where the GPU takes over the role of the conductor. + +== Moving Beyond Static Dispatches + +A static dispatch (`vkCmdDispatch`) requires the CPU to know exactly how many workgroups to run. If you're doing something like object culling, the CPU doesn't know how many objects will pass the cull until the GPU has finished its work. + +With **Indirect Dispatch** (`vkCmdDispatchIndirect`), the CPU doesn't provide the dispatch size. Instead, it provides a **Vulkan Buffer** that contains the dispatch parameters. The GPU itself can then write to this buffer, effectively deciding how much work it needs to do. + +== The Autonomous GPU + +GPU-driven pipelines take this even further with features like: + +1. **GPU-Side Command Generation**: Utilizing modern engine features to build entire chains of commands on the GPU, allowing it to "decide" its own execution path. +2. **Multi-Draw Indirect (MDI)**: A feature allowing a single compute dispatch to generate thousands of draw calls, effectively rendering an entire scene without a single CPU-side loop. +3. **Variable-Sized Workloads**: Handling everything from particle systems to high-fidelity culling without any CPU-side intervention. + +== Why This Matters + +By moving the "decision-making" to the GPU, we can: + +* **Eliminate CPU Bottlenecks**: Free up the CPU for AI, game logic, and other tasks. +* **Minimize Latency**: Eliminate the round-trip delay between a GPU's compute analysis and its subsequent rendering. +* **Scale to Millions**: Handle scene complexity that would be impossible with traditional CPU-bound pipelines. + +In this chapter, we'll learn how to build these autonomous pipelines, starting with the fundamental building block: **Indirect Dispatch**. + +xref:../07_Advanced_Data_Structures/04_device_addressable_buffers.adoc[Previous: Device-Addressable Buffers] | xref:02_indirect_dispatch.adoc[Next: Indirect Dispatch] diff --git a/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/02_indirect_dispatch.adoc b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/02_indirect_dispatch.adoc new file mode 100644 index 00000000..8cb65324 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/02_indirect_dispatch.adoc @@ -0,0 +1,97 @@ +:pp: {plus}{plus} + += Indirect Dispatch: Building Parameters on the GPU + +== The Core of GPU Autonomy + +In a traditional compute pipeline, the CPU calls `vkCmdDispatch(x, y, z)`. The values of `x, y, z` are fixed at the moment the command buffer is recorded. + +But what if the number of workgroups you need depends on the result of a previous compute shader? For example, if you're culling a list of objects, only the GPU knows how many survived. + +**Indirect Dispatch** (`vkCmdDispatchIndirect`) solves this by reading the workgroup counts from a **Vulkan Buffer** (a `VkBuffer`) instead of the command buffer. + +== How It Works + +1. **Preparation**: Create a buffer with the `VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT`. +2. **GPU Update**: Run a "culling" or "analysis" compute shader. This shader calculates the number of workgroups needed for the next step and writes that value into the indirect buffer. +3. **The Dispatch**: The CPU records a call to `vkCmdDispatchIndirect(myIndirectBuffer, offset)`. + +[source,cpp] +---- +// The layout of the data in the indirect buffer (matching vk::DispatchIndirectCommand) +struct IndirectCommand { + uint32_t x; + uint32_t y; + uint32_t z; +}; +---- + +== Writing the Indirect Command from a Shader + +To use this, your compute shader (the "producer") must write to a buffer that matches the `VkDispatchIndirectCommand` layout. + +[source,slang] +---- +// Slang example: Writing the dispatch counts +struct IndirectCommand { + uint3 x; +}; + +[[vk::binding(0, 0)]] +RWStructuredBuffer cmdBuffer; + +[numthreads(1, 1, 1)] +void main() { + uint numWorkgroups = calculateRequiredWorkgroups(); + cmdBuffer[0].x = uint3(numWorkgroups, 1, 1); +} +---- + +=== GLSL: Manual Buffer Layout + +In GLSL, you define a `buffer` block that matches the expected structure. It's crucial to use the correct alignment (`std430`) to ensure the GPU reads the values at the correct offsets. + +[source,glsl] +---- +layout(std430, binding = 0) buffer IndirectBuffer { + uint x; + uint y; + uint z; +} cmd; + +void main() { + uint numWorkgroups = calculateRequiredWorkgroups(); + cmd.x = numWorkgroups; + cmd.y = 1; + cmd.z = 1; +} +---- + +The "win" here is that by using the same buffer in your `vkCmdDispatchIndirect` call, the GPU can autonomously determine its own workload size without any CPU intervention. + +== Synchronization is Key + +Because the GPU is writing to the buffer that it will later read from, you must ensure that the write has finished and is **visible** to the indirect dispatch hardware. + +This requires a **Vulkan Barrier** with the following settings: + +* `srcStage`: `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT` +* `dstStage`: `VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT` +* `srcAccess`: `VK_ACCESS_SHADER_WRITE_BIT` +* `dstAccess`: `VK_ACCESS_INDIRECT_COMMAND_READ_BIT` + +Failure to include this barrier will result in the GPU reading "garbage" or stale data, leading to incorrect dispatches or even device crashes. + +== Practical Example: Variable-Sized Workloads + +Imagine you have a particle system where particles can die or be born every frame. + +1. **Dispatch 1 (Cull)**: A compute shader iterates over all particles, calculates which ones are alive, and stores their IDs in a "live" buffer. It also increments an atomic counter. +2. **Barrier**: Wait for the cull to finish and make the counter visible to the indirect hardware. +3. **Dispatch 2 (Update)**: Call `vkCmdDispatchIndirect`. The GPU reads the counter and dispatches exactly enough workgroups to update only the alive particles. + +This approach is much more efficient than always dispatching for the "maximum" number of particles, which would result in thousands of idle threads. + +In the next section, we'll look at how the GPU can go beyond just changing its dispatch size and start generating its own **Command Chains**. + +xref:01_introduction.adoc[Previous: Introduction to GPU-Driven Pipelines] | xref:03_gpu_side_command_generation.adoc[Next: GPU-Side Command Generation] diff --git a/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/03_gpu_side_command_generation.adoc b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/03_gpu_side_command_generation.adoc new file mode 100644 index 00000000..a982eba8 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/03_gpu_side_command_generation.adoc @@ -0,0 +1,41 @@ +:pp: {plus}{plus} + += GPU-Side Command Generation: Autonomous Execution + +== Building Command Chains on the GPU + +In a typical Vulkan application, the CPU records a series of commands: `vkCmdBindPipeline`, `vkCmdBindDescriptorSets`, `vkCmdDispatch`, etc. These commands are "baked" into the command buffer. + +With **GPU-Side Command Generation** (often utilizing the engine's built-in indirect buffers and BDA), we can go a step further. Instead of the CPU deciding the entire sequence of commands, the GPU can build a list of "commands" that it wants to execute. + +This is a key component of **GPU-Driven Pipelines**. The GPU can analyze its own state and decide to: + +1. Dispatch a compute shader to update a set of physics. +2. Dispatch another compute shader to build a BVH. +3. Generate a series of draw calls to render the updated scene. + +== How It Works: The Buffer-First Approach + +In a GPU-driven pipeline, the CPU typically records a "master" compute dispatch. This dispatch iterates over your objects or tasks and writes to a series of **Indirect Command Buffers**: + +* **Dispatch Indirect Buffer**: Stores the `x, y, z` parameters for future compute shaders. +* **Draw Indirect Buffer**: Stores the `vertexCount`, `instanceCount`, etc., for future rendering. +* **Resource Buffer**: Stores the raw pointers (Buffer Device Address) that those shaders and draws will need. + +== Why This Matters for Performance + +When the GPU generates its own commands, the CPU is completely out of the loop. There is no longer any need to: + +* **Re-record Command Buffers**: No CPU overhead for every frame. +* **CPU-Side Culling**: No "back-and-forth" data exchange. +* **Synchronization Overhead**: Synchronization happens entirely within the GPU's command stream. + +== The Power of Autonomy + +This autonomous model allows for **Single-Pass Rendering** (performing culling and drawing in a single GPU pass). Instead of the CPU having to wait for the GPU's culling result to know what to draw, the GPU can cull objects and then draw them in the same command stream. + +This is the standard architecture for many modern, high-end rendering engines. It scales to millions of objects because the cost of "culling and drawing" is independent of the CPU's performance. + +In the next section, we'll look at the final piece of the puzzle: **Multi-Draw Indirect (MDI)**, which bridges our compute analysis to the graphics pipeline. + +xref:02_indirect_dispatch.adoc[Previous: Indirect Dispatch] | xref:04_multi_draw_indirect_mdi.adoc[Next: Multi-Draw Indirect (MDI)] diff --git a/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc new file mode 100644 index 00000000..6a0c7602 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc @@ -0,0 +1,47 @@ +:pp: {plus}{plus} + += Multi-Draw Indirect (MDI): Bridging Compute to Graphics + +== Bridging the Gap + +Throughout this chapter, we've focused on how to make compute shaders more autonomous. But the final goal of most graphics applications is... well, graphics. We need a way to take the results of our compute-based culling or analysis and turn them into draw calls. + +**Multi-Draw Indirect (MDI)** is the ultimate bridge between the compute and graphics pipelines. It allows a single Vulkan command to execute an arbitrary number of draw calls, where the parameters for each draw call come from a GPU-side buffer. + +== How It Works: The MDI Pipeline + +1. **Cull Phase (Compute)**: A compute shader analyzes your scene (e.g., millions of objects) and decides which ones are visible. It writes the `vertexCount`, `instanceCount`, etc., for each visible object into a large **Indirect Buffer**. +2. **Count Buffer**: The compute shader also keeps an atomic counter of how many objects were visible and writes this count into a separate **Draw Count Buffer**. +3. **The Draw (Graphics)**: The CPU calls `vkCmdDrawIndexedIndirectCount`. This single command tells the GPU to read its own counts and parameters and draw the objects. + +[source,cpp] +---- +// The layout of the data in the MDI buffer (matching vk::DrawIndexedIndirectCommand) +struct IndirectDrawCommand { + uint32_t indexCount; + uint32_t instanceCount; + uint32_t firstIndex; + int32_t vertexOffset; + uint32_t firstInstance; +}; +---- + +== Why MDI is Essential for GPU-Driven Rendering + +Without MDI, the CPU would have to read back the visibility count from the GPU and then record a separate `vkCmdDraw` for every visible object. For a scene with 10,000 visible objects, that would be 10,000 CPU calls and 10,000 command records every frame. + +With MDI, those 10,000 objects are rendered with **one command**. This is how modern engines can handle massive "culling-first" architectures. + +== Best Practices for MDI + +* **Max Draw Count**: Always specify a reasonable maximum draw count in the `vkCmdDrawIndexedIndirectCount` call to prevent the GPU from over-reading its buffers in case of errors. +* **Buffer Alignment**: Ensure that your indirect buffer follows the correct alignment and stride requirements for your hardware. +* **Combine with BDA**: Use **Buffer Device Address** (from Chapter 7) to pass object-specific data (like materials and transforms) to your shaders, bypassing traditional descriptor sets. + +== Conclusion: The Future is GPU-Driven + +By mastering **Indirect Dispatch**, **GPU-Side Command Generation**, and **Multi-Draw Indirect**, you've moved from a traditional "CPU-lead" pipeline to a modern "GPU-driven" architecture. Your applications are now more scalable, lower latency, and more efficient. + +In the next chapter, we'll look at how to coordinate these heavy compute workloads with your graphics rendering using **Asynchronous Compute Orchestration**. + +xref:03_gpu_side_command_generation.adoc[Previous: GPU-Side Command Generation] | xref:../09_Asynchronous_Compute/01_introduction.adoc[Next: Asynchronous Compute Orchestration] diff --git a/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/01_introduction.adoc b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/01_introduction.adoc new file mode 100644 index 00000000..2e76112f --- /dev/null +++ b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/01_introduction.adoc @@ -0,0 +1,20 @@ +:pp: {plus}{plus} += Asynchronous Compute Orchestration + +In the earlier chapters, we've focused heavily on making individual kernels as fast as possible. We've optimized memory access, leveraged subgroups, and even built entire data structures on the GPU. But there's a higher level of optimization that often goes overlooked: how we schedule these dispatches alongside the rest of the engine's workload. + +Modern GPUs aren't just single, monolithic processors; they are complex systems with multiple hardware engines capable of working in parallel. To understand asynchronous compute, we first have to understand the physical hardware. A typical high-performance GPU has several specialized engines: + +* **Graphics Engine**: The primary engine, capable of vertex processing, rasterization, and fragment shading, as well as general-purpose compute. +* **Asynchronous Compute Engine (ACE)**: A dedicated scheduler and hardware path for compute dispatches. These can often run entirely in parallel with the graphics engine, using compute units (CUs) or streaming multiprocessors (SMs) that aren't being fully utilized by the graphics workload. +* **Transfer/Copy Engine**: A specialized DMA (Direct Memory Access) engine for moving data between host and device memory without consuming any compute resources. + +Vulkan exposes these hardware engines through **Queue Families**. Each family has a set of **capabilities** (e.g., `VK_QUEUE_GRAPHICS_BIT`, `VK_QUEUE_COMPUTE_BIT`, `VK_QUEUE_TRANSFER_BIT`). While the main graphics queue family usually supports everything, a "Dedicated Compute" or "Async Compute" family might *only* support compute and transfer. + +By using separate compute queues from these dedicated families, we can overlap heavy compute dispatches—like path-trace denoising, physics simulations, or complex AI pathfinding—with the main graphics rendering pass. While the graphics hardware is busy processing geometry and rasterizing triangles, the compute units can be simultaneously crunching numbers for your simulation. + +In this chapter, we're going to move beyond the simple "one queue for all" model. We'll explore how to use Vulkan's **Synchronization 2** (`VK_KHR_synchronization2`) to orchestrate complex, concurrent workloads without causing **pipeline stalls** (where the GPU sits idle waiting for a resource). We'll also look at **Queue Priority**, a feature that allows us to tell the hardware which tasks are truly latency-critical, ensuring that a background simulation doesn't delay a time-sensitive physics update. + +Orchestrating these workloads requires a shift in how we think about the GPU's timeline. It's no longer just a linear sequence of commands, but a multi-lane highway where different types of traffic can move at different speeds, occasionally merging or yielding to ensure the overall throughput is maximized. + +xref:../08_GPU_Driven_Pipelines/04_multi_draw_indirect_mdi.adoc[Previous: Multi-Draw Indirect (MDI)] | xref:02_concurrent_execution.adoc[Next: Concurrent Execution] diff --git a/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/02_concurrent_execution.adoc b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/02_concurrent_execution.adoc new file mode 100644 index 00000000..b72ed05b --- /dev/null +++ b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/02_concurrent_execution.adoc @@ -0,0 +1,64 @@ +:pp: {plus}{plus} += Concurrent Execution and Synchronization 2 + +To achieve true parallelism between graphics and compute, we need to talk about the Vulkan timeline. In a standard single-queue setup, everything happens sequentially—you dispatch compute, wait for it to finish, and then begin your graphics work. This is straightforward but inefficient. By using multiple queues, we can submit work to a dedicated **compute queue** and have the GPU execute it alongside a main **graphics queue**. + +The challenge, however, isn't just submitting the work; it's making sure it stays synchronized where it matters. If our compute dispatch is generating a texture that the graphics pass needs, we can't let the graphics start reading until the compute is done. But we *can* let the graphics do everything else—like clearing buffers, processing vertices, or even rasterizing other, unrelated objects. + +This is where **Synchronization 2** (`VK_KHR_synchronization2`) shines. The older Vulkan synchronization was powerful but notoriously complex and difficult to read. It relied on bitmasks for pipeline stages and access types that were often redundant. Synchronization 2 simplifies this by grouping them into more logical structures and, more importantly, it introduces a more robust way to express **dependency chains** across different queues. + +== Async Compute vs. Concurrent Execution +It's important to distinguish between "overlapping" work on a single queue and "asynchronous" work on separate hardware queues. + +On a single queue, the GPU can still overlap work—for example, it can start a new vertex shader while fragment shaders from a previous draw are still finishing. This is **Concurrent Execution**. However, it still follows a single command stream. + +**Asynchronous Compute** uses separate hardware engines (ACE) to feed the compute units (CU/SM). This means the compute engine is pulling commands from a completely different memory stream than the graphics engine. This is where true parallelism happens, allowing the GPU to keep its ALUs (Arithmetic Logic Units) saturated even when the graphics engine is bottlenecked by other factors like the ROPs (Raster Output Processors) or fixed-function geometry hardware. + +== Queue Ownership Transfer: The Handshake +One of the most cryptic, but essential, parts of multi-queue Vulkan is **Queue Ownership Transfer**. Most Vulkan resources (like `VkBuffer` or `VkImage`) are created with a sharing mode of `VK_SHARING_MODE_EXCLUSIVE` by default. This means they are owned by exactly one queue family at a time. + +To move a resource from a compute queue to a graphics queue, you must perform a "handshake" consisting of two parts: + +1. **Release**: A barrier on the **source** queue that "releases" ownership. +2. **Acquire**: A barrier on the **destination** queue that "acquires" ownership. + +If you omit either part, you have **undefined behavior** and potential data corruption. Synchronization 2 makes this explicit by including the `srcQueueFamilyIndex` and `dstQueueFamilyIndex` in the barrier structures. + +[source,cpp] +---- +// ON THE COMPUTE QUEUE (Source) +vk::ImageMemoryBarrier2 releaseBarrier { + .srcStageMask = vk::PipelineStageFlagBits2::eComputeShader, + .srcAccessMask = vk::AccessFlagBits2::eShaderWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eNone, // We don't care about the destination stage yet + .dstAccessMask = vk::AccessFlagBits2::eNone, // Or the access mask + .oldLayout = vk::ImageLayout::eGeneral, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .srcQueueFamilyIndex = computeQueueFamilyIndex, + .dstQueueFamilyIndex = graphicsQueueFamilyIndex, + .image = *sharedImage // Extract handle from vk::raii::Image +}; +// ... (subresourceRange setup) + +// ON THE GRAPHICS QUEUE (Destination) +vk::ImageMemoryBarrier2 acquireBarrier { + .srcStageMask = vk::PipelineStageFlagBits2::eNone, // We don't care about the source stage here + .srcAccessMask = vk::AccessFlagBits2::eNone, // Or the access mask + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eGeneral, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .srcQueueFamilyIndex = computeQueueFamilyIndex, + .dstQueueFamilyIndex = graphicsQueueFamilyIndex, + .image = *sharedImage // Extract handle from vk::raii::Image +}; +// ... (subresourceRange setup) +---- + +Notice how the `oldLayout` and `newLayout` must **match exactly** in both barriers. This is a critical requirement. The "Release" barrier ensures the memory is **available** (written out to memory/L2 cache), and the "Acquire" barrier ensures it is **visible** (invalidating read caches on the destination engine). + +The real "magic" happens when we use **Semaphore-based synchronization** (using `VkSemaphore` objects to coordinate work between queues) between queues. We submit our compute workload with a "signal" semaphore, and our graphics workload with a "wait" semaphore. The GPU handles the internal scheduling, stalling the graphics queue only when it reaches the specific pipeline stage that needs the compute result. This allows the GPU's hardware scheduler to keep the compute units busy during the geometry-heavy parts of the graphics pass, effectively "hiding" the cost of the compute work. + +Remember, though, that not all hardware is created equal. Some mobile GPUs have unified hardware for compute and graphics, where "concurrency" might just mean the scheduler interleaved the tasks. High-end desktop GPUs, on the other hand, often have dedicated compute pipes that can run entirely in parallel with the graphics engines. Profiling is your only way to know if your orchestration is truly delivering the performance gains you expect. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_timeline_semaphores.adoc[Next: Timeline Semaphores] diff --git a/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/03_timeline_semaphores.adoc b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/03_timeline_semaphores.adoc new file mode 100644 index 00000000..5571693c --- /dev/null +++ b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/03_timeline_semaphores.adoc @@ -0,0 +1,50 @@ +:pp: {plus}{plus} += Timeline Semaphores: Unified Synchronization + +While binary semaphores (the classic `VkSemaphore`) are useful for simple "wait/signal" relationships between queues, they quickly become a management nightmare in complex asynchronous pipelines. Each binary semaphore can only be signaled once before it must be waited on and reset, leading to a proliferation of semaphore objects that are difficult to track. + +This is why **Timeline Semaphores** (introduced in Vulkan 1.2 and via `VK_KHR_timeline_semaphore`) are a game-changer for asynchronous compute. Instead of a simple boolean "on/off" state, a timeline semaphore contains a monotonically increasing **64-bit integer value**. + +== The Power of a Single Value +With a timeline semaphore, you don't just wait for a semaphore to be signaled; you wait for it to reach a **specific value**. This allows you to represent an entire timeline of work with a single object. For example: + +* **Value 10**: Physics simulation finished. +* **Value 11**: Denoising pass finished. +* **Value 12**: Frame ready for UI composition. + +Different queues can signal the same semaphore with different values, and other queues can wait for exactly the level of progress they need. + +== Wait-Before-Signal (The Host Side) +One of the most powerful features of timeline semaphores is that you can submit a command buffer that waits for a value that **hasn't been reached yet**. In fact, the signal operation doesn't even have to be submitted to the GPU when the wait is submitted. + +This allows the CPU to build complex dependency graphs and submit them all at once to different queues. The GPU hardware will handle the stalls and wake-ups automatically as the counter increments. + +[source,cpp] +---- +// Defining a wait for a specific timeline value +vk::TimelineSemaphoreSubmitInfo timelineInfo { + .waitSemaphoreValueCount = 1, + .pWaitSemaphoreValues = &requiredValue, + .signalSemaphoreValueCount = 1, + .pSignalSemaphoreValues = &newValue +}; + +vk::SubmitInfo submitInfo { + .pNext = &timelineInfo, + .waitSemaphoreCount = 1, + .pWaitSemaphores = &*timelineSemaphore // Extract handle from vk::raii::Semaphore +}; +// ... +---- + +== Host Querying and Waiting +Timeline semaphores also bridge the gap between the GPU and the CPU. The CPU can query the current value of a semaphore at any time using `vkGetSemaphoreCounterValue`. Even better, the CPU can block until a semaphore reaches a certain value using `vkWaitSemaphores`. + +This replaces the need for `VkFence` in many scenarios. Instead of waiting for an entire command buffer to finish (which is what a fence does), the CPU can wait for a specific point in the GPU's timeline. This is incredibly useful for **pipelined resource management**—for example, the CPU can wait for the GPU to reach value `N`, knowing that it's now safe to reuse a buffer that was used by the command that signaled value `N`. + +== Why it matters for Async Compute +In an asynchronous compute setup, you often have multiple streams of work with cross-dependencies. For instance, your physics engine (Compute Queue) might produce data needed by the particle system (Graphics Queue), which in turn produces data needed by the denoiser (Compute Queue). + +Using binary semaphores for this would require a complex web of "Signal A -> Wait A -> Signal B -> Wait B". With timeline semaphores, you simply have a single "Engine Timeline". Every task signals its completion by incrementing the counter, and every dependent task waits for its specific prerequisite value. This drastically simplifies the orchestration logic and reduces the overhead of semaphore management. + +xref:02_concurrent_execution.adoc[Previous: Concurrent Execution] | xref:04_queue_priority.adoc[Next: Queue Priority] diff --git a/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/04_queue_priority.adoc b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/04_queue_priority.adoc new file mode 100644 index 00000000..97164763 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/09_Asynchronous_Compute/04_queue_priority.adoc @@ -0,0 +1,39 @@ +:pp: {plus}{plus} += Queue Priority: Managing Latency-Critical Workloads + +When you're dealing with multiple queues, you're essentially handing the GPU multiple streams of work and letting its hardware scheduler decide how to prioritize them. By default, the scheduler tries to be fair, but in a high-performance engine, "fair" isn't always what you want. You might have a critical physics update that *must* finish before the next frame, while a background path-trace denoiser can afford to take a few extra milliseconds. + +Vulkan provides a way to influence this through **Queue Priority**. When you create a logical device (`VkDeviceCreateInfo`), you specify a `priority` for each queue within a queue family. This value is a floating-point number between 0.0 and 1.0, where 1.0 represents the highest priority. + +[source,cpp] +---- +// Setting up high-priority compute queues +float priorities[] = { 1.0f, 0.1f }; // High priority for physics, low for denoising +vk::DeviceQueueCreateInfo queueCreateInfo { + .queueFamilyIndex = computeQueueFamilyIndex, + .queueCount = 2, + .pQueuePriorities = priorities +}; +---- + +== How the Scheduler Uses Priority +It's important to understand that queue priority is a **hint**, not a guarantee. The exact behavior depends heavily on the hardware's internal scheduler. Most modern GPUs use one of two main strategies: + +1. **Strict Priority**: The scheduler will always pick a task from a higher-priority queue if one is ready. This is great for responsiveness but can lead to **starvation** of low-priority tasks if the high-priority queue is constantly busy. +2. **Weighted Round-Robin**: The scheduler assigns a certain percentage of execution time to each queue based on its priority. For example, a queue with priority 1.0 might get twice as many "scheduling slots" as a queue with priority 0.5. + +High-end desktop GPUs often have sophisticated hardware that can **preempt** a low-priority task (e.g., stop a long-running compute shader) to make room for a high-priority one. However, preemption is not free; it involves saving and restoring the GPU's state, which can take several microseconds. + +== Global Queue Priority +If you're building a system where latency is truly the only thing that matters—like a VR (Virtual Reality) compositor or an AR (Augmented Reality) spatial tracker—you might need even more control. This is where `VK_EXT_global_priority` comes in. This extension allows you to request **Real-Time** priority for a queue. + +Unlike standard queue priority, which only works relative to other queues on your device, global priority tells the driver (and the OS) that your workload is more important than even other applications running on the same GPU. Use this sparingly, as it can cause stuttering in the rest of the system if used incorrectly. + +== Avoiding the "Single-Queue Trap" +A common mistake is to create multiple high-priority queues within the same queue family. If you do this, you've essentially returned to a "first-come, first-served" model. The hardware scheduler can only do its job if you provide clear, distinct priorities. + +Another critical consideration is **Queue Family** (a group of queues with similar capabilities) selection. Some Vulkan implementations offer multiple queue families, each with different capabilities. For example, a "Dedicated Compute" queue family might have specialized hardware for compute dispatches that don't share any resources with the graphics engine, making them more efficient and less likely to cause **pipeline bubbles** (gaps in the GPU's execution timeline). Always check the `VkQueueFamilyProperties` to understand what each queue family offers. + +In practice, managing queue priorities is a balancing act. Used correctly, it's a powerful tool for ensuring that your engine remains responsive and that the most critical tasks are always handled with the urgency they require. This orchestration is the hallmark of a truly advanced Vulkan engine—moving beyond just "doing the work" to "doing the work in the right order at the right time." + +xref:03_timeline_semaphores.adoc[Previous: Timeline Semaphores] | xref:../10_Specialized_Math/01_introduction.adoc[Next: Specialized Math] diff --git a/en/Advanced_Vulkan_Compute/10_Specialized_Math/01_introduction.adoc b/en/Advanced_Vulkan_Compute/10_Specialized_Math/01_introduction.adoc new file mode 100644 index 00000000..b40ba814 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/10_Specialized_Math/01_introduction.adoc @@ -0,0 +1,12 @@ +:pp: {plus}{plus} += Cooperative Matrices and Specialized Math + +In the previous chapters, we've looked at how to move data efficiently and how to orchestrate complex workloads. But what happens when the workload itself is computationally intense? For a long time, GPU compute was synonymous with floating-point operations (FP32). However, modern hardware has evolved to include specialized units designed for a very specific type of work: high-speed linear algebra. + +This is the world of **Cooperative Matrices**. While you might have heard of "Tensor Cores" on NVIDIA or "Matrix Core" on AMD, Vulkan provides a vendor-neutral abstraction for these specialized units through the `VK_KHR_cooperative_matrix` extension (now part of Vulkan 1.4). These units aren't just for machine learning; they are incredibly powerful for any task that involves heavy matrix multiplication and accumulation (the **GEMM**—General Matrix-Matrix Multiplication—operation). + +Whether you're building a fluid simulation that requires solving large systems of linear equations or a signal processing pipeline that relies on complex transforms, Cooperative Matrices can provide a massive throughput boost. By performing small matrix multiplications directly in the hardware's specialized units, you can achieve performance that far exceeds what a standard compute shader loop could deliver. + +In this chapter, we're going to dive into how these specialized math units work. We'll explore how to use the `cooperative_matrix` types in Slang and GLSL, and we'll see how to leverage **Mixed Precision**—using FP16 or Int8 for calculations while maintaining accuracy where it counts. This is about more than just speed; it's about utilizing the full potential of modern GPU silicon for high-performance computing tasks. + +xref:../09_Asynchronous_Compute/04_queue_priority.adoc[Previous: Queue Priority] | xref:02_cooperative_matrices.adoc[Next: Cooperative Matrices] diff --git a/en/Advanced_Vulkan_Compute/10_Specialized_Math/02_cooperative_matrices.adoc b/en/Advanced_Vulkan_Compute/10_Specialized_Math/02_cooperative_matrices.adoc new file mode 100644 index 00000000..3e841ef4 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/10_Specialized_Math/02_cooperative_matrices.adoc @@ -0,0 +1,122 @@ +:pp: {plus}{plus} += Working with Cooperative Matrices + +[NOTE] +==== +This section requires the `VK_KHR_cooperative_matrix` extension and Vulkan 1.3 or higher. In Vulkan 1.4, this extension is promoted to the core API. +==== + +To understand why **Cooperative Matrices** are so powerful, we need to rethink how we approach matrix multiplication on a GPU. In a traditional "naive" loop, each invocation is responsible for calculating one or more elements of the result matrix. This involves a lot of redundant memory reads and is inherently bound by the hardware's standard floating-point throughput. + +Cooperative Matrices change the game by introducing a new way for a group of invocations (a **Subgroup**) to work together on a single matrix multiplication and accumulation (the **GEMM** operation). Instead of individual invocations working in isolation, the entire subgroup "cooperates" to perform the operation. + +== The Concept: Matrix Fragments and Subgroup Scope + +The key to cooperative matrices is the concept of a **Fragment**. When you declare a cooperative matrix type, the data is not stored in a single contiguous array that's accessible to any invocation. Instead, it's **distributed** across the invocations in the subgroup. + +Each invocation only owns a small piece of the matrix. You can think of it as the hardware "sharding" the matrix across its registers. This allows the GPU to use specialized hardware units (like **Tensor Cores** on NVIDIA or **Matrix Cores** on AMD) to perform the math directly on those registers without the overhead of traditional ALU instructions. + +Crucially, the operation happens at the **Subgroup Scope**. This means every invocation in a subgroup must participate in the load, multiply, and store operations simultaneously. If you try to call a cooperative matrix function inside a divergent branch where some members of the subgroup are inactive, you'll likely encounter undefined behavior or a GPU hang. + +The standard GEMM operation performed by these units is: +[latexmath] +++++ +D = A \times B + C +++++ +Where latexmath:[A] is an latexmath:[M \times K] matrix, latexmath:[B] is a latexmath:[K \times N] matrix, and latexmath:[C, D] are latexmath:[M \times N] matrices. + +== Memory Layout: Strides and Majorness + +When loading fragments from memory, you must specify how the matrix is laid out in your buffer. + +1. **Row-Major vs. Column-Major**: Most Vulkan applications prefer **Row-Major** (where elements of a row are contiguous). +2. **Stride**: This is the distance (in elements, not bytes) between the start of one row and the start of the next. For a simple tightly-packed matrix, the stride is equal to the number of columns. + +If your buffer contains a large matrix and you are only loading a small $16 \times 16$ tile, the stride would be the width of the *entire* large matrix. + +== Slang: Tiled Matrix Multiplication + +Slang treats cooperative matrices as first-class types, allowing for expressive tiled algorithms. Here is how you might implement a block of a larger matrix multiply: + +[source,slang] +---- +import slang_vulkan_compute; + +// Matrix dimensions supported by the physical device +const int M = 16; +const int N = 16; +const int K = 16; + +struct Params { + uint64_t addrA, addrB, addrC; + uint32_t strideA, strideB, strideC; + uint32_t totalK; +}; + +ParameterBlock cb; + +[numthreads(32, 1, 1)] // Subgroup size must match hardware expectations +void computeMain(uint3 threadId : SV_GroupThreadID, uint3 groupId : SV_GroupID) { + // Each subgroup handles one (M x N) tile of the output matrix C + CooperativeMatrix acc = 0.0f; + + // Loop over the K dimension in blocks of 'K' + for (uint32_t k = 0; k < cb.totalK; k += K) { + CooperativeMatrix matA; + CooperativeMatrix matB; + + // Load tiles from memory using Buffer Device Address + matA.load((float16*)cb.addrA, getOffsetA(groupId, k), cb.strideA); + matB.load((float16*)cb.addrB, getOffsetB(groupId, k), cb.strideB); + + // Accumulate product: acc = matA * matB + acc + acc = mul(matA, matB) + acc; + } + + // Store the final accumulated tile + acc.store((float*)cb.addrC, getOffsetC(groupId), cb.strideC); +} +---- + +== GLSL: The Low-Level Win + +While Slang makes the code look like standard matrix math, it's helpful to see the GLSL equivalent to understand the "win" that Vulkan 1.4 provides through the `GL_KHR_cooperative_matrix` extension. Note the explicit "Use" types which hint to the compiler how to optimize register allocation. + +[source,glsl] +---- +#extension GL_KHR_cooperative_matrix : enable +#extension GL_EXT_shader_explicit_arithmetic_types_float16 : enable + +// Define the fragments with explicit scopes and uses +layout(constant_id = 0) const int M = 16; +layout(constant_id = 1) const int N = 16; +layout(constant_id = 2) const int K = 16; + +// UseA and UseB are inputs, UseAccumulator is for C and D +coopmat matA; +coopmat matB; +coopmat acc; + +void main() { + // Explicit loading requires byte-offset and row-stride + coopMatLoad(matA, dataA, offsetA, strideA, gl_CooperativeMatrixLayoutRowMajor); + coopMatLoad(matB, dataB, offsetB, strideB, gl_CooperativeMatrixLayoutRowMajor); + + // acc = matA * matB + acc + acc = coopMatMulAdd(matA, matB, acc); + + coopMatStore(acc, dataC, offsetC, strideC, gl_CooperativeMatrixLayoutRowMajor); +} +---- + +== Hardware Constraints and Capabilities + +The physical dimensions (latexmath:[M, N, K]) are not arbitrary. You must query `VkPhysicalDeviceCooperativeMatrixPropertiesKHR` to find supported combinations. + +* **Subgroup Size**: On NVIDIA, these units typically expect a subgroup size of 32. On AMD, it might be 64. Using the wrong subgroup size in your `[numthreads]` will result in a failure to initialize the cooperative matrix types. +* **Precision Trade-offs**: It is standard practice to use `float16` for the input matrices (A and B) to maximize throughput and save bandwidth, while using `float32` for the accumulator (C and D). This "Mixed Precision GEMM" provides the best balance of speed and numerical stability. +* **Alignment**: Memory addresses passed to `.load()` and `.store()` usually require specific alignment (e.g., 16 bytes). Loading from a misaligned address can lead to a device lost error. + +By leveraging these specialized units, you can achieve throughput that is often an order of magnitude higher than what's possible with standard floating-point units. This makes cooperative matrices essential for any performance-critical linear algebra on the GPU. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_mixed_precision.adoc[Next: Mastering Mixed Precision] diff --git a/en/Advanced_Vulkan_Compute/10_Specialized_Math/03_mixed_precision.adoc b/en/Advanced_Vulkan_Compute/10_Specialized_Math/03_mixed_precision.adoc new file mode 100644 index 00000000..c74521f3 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/10_Specialized_Math/03_mixed_precision.adoc @@ -0,0 +1,103 @@ +:pp: {plus}{plus} += Mastering Mixed Precision: FP16 and Int8 + +In high-performance computing, we've traditionally relied on 32-bit floating-point precision (FP32) for almost everything. But as datasets grow larger and we demand higher throughput, it's worth asking: do we *really* need 32 bits for every single calculation? This is where **Mixed Precision** comes in. + +The core idea is simple: use lower-precision types like **FP16** (half-precision float) or **Int8** (8-bit integer) for the bulk of your calculations, and only use higher precision where it's absolutely necessary. Modern GPU architectures are heavily optimized for these lower-precision types. For example, many GPUs can perform twice as many FP16 operations as FP32 operations in the same amount of time. + +== Why Mixed Precision? + +There are two primary reasons to embrace lower precision: + +1. **Arithmetic Throughput**: Many modern GPUs have "packed" math units. A single 32-bit register can hold two 16-bit values, and the hardware can perform two 16-bit operations in the same cycle it would take for one 32-bit operation. +2. **Memory Bandwidth**: Data is expensive to move. By using 16-bit or 8-bit types, you're effectively doubling or quadrupling the amount of data you can move through the same memory bus. + +== Precision vs. Range: FP16 and BFloat16 + +When dropping from 32-bit to 16-bit, you have to choose what to sacrifice. + +* **FP16 (IEEE 754)**: 1 sign bit, 5 exponent bits, 10 mantissa bits. This provides decent precision but a very limited range (max value ~65,504). +* **BFloat16 (Brain Float)**: 1 sign bit, 8 exponent bits, 7 mantissa bits. This has the *same range* as FP32 but much lower precision. It's often preferred for machine learning because it's more robust to overflows. + +In Vulkan, FP16 is widely supported via the `VK_KHR_shader_float16_int8` extension, while BFloat16 is typically accessed through the `VK_KHR_shader_float_controls` or vendor-specific extensions. + +== Slang: Natural Mixed Precision + +Slang makes it incredibly easy to use mixed precision because it treats `half` and `int8_t` as native types. It handles the low-level conversion instructions for you. + +[source,slang] +---- +// Using half-precision in Slang +void computeMain() { + // 16-bit floats (h suffix) + half a = 1.0h; + half b = 2.0h; + + // Mixed accumulation: Perform 16-bit math, accumulate in 32-bit + float accumulator = 0.0f; + for(int i = 0; i < 100; i++) { + // Explicit cast to float to ensure the addition is 32-bit + accumulator += (float)(a * b); + } +} +---- + +== Int8 and Dot Products (DP4A) + +For even higher throughput, many GPUs support specialized instructions for 8-bit integer math. One of the most common is **DP4A** (Dot Product with 4-way Accumulation). + +The hardware takes two 32-bit registers, each containing four 8-bit values ($x_0, x_1, x_2, x_3$ and $y_0, y_1, y_2, y_3$). It performs: +[latexmath] +++++ +Result = (x_0 \times y_0) + (x_1 \times y_1) + (x_2 \times y_2) + (x_3 \times y_3) + Accumulator +++++ +All of this happens in a single cycle. In Slang, you can trigger this by using `dot` on packed 8-bit vectors: + +[source,slang] +---- +RWStructuredBuffer output; + +void computeMain(uint3 threadId : SV_DispatchThreadID) { + uint32_t packedA = loadPackedA(threadId.x); + uint32_t packedB = loadPackedB(threadId.x); + + // Reinterpret the uint32 as a vector of four 8-bit ints + int8_t4 vecA = BitCast(packedA); + int8_t4 vecB = BitCast(packedB); + + // The dot product intrinsic maps directly to DP4A hardware + int result = dot(vecA, vecB); + output[threadId.x] = result; +} +---- + +== C{pp} Side: Preparing the Data + +To feed these shaders, you must pack your data correctly on the CPU. Since standard C{pp} doesn't have a native 16-bit float type (until C{pp}23's `std::float16_t`), you'll often use a library like `glm` or perform manual bit-packing. + +[source,cpp] +---- +// Example of packing four 8-bit integers into one 32-bit uint +uint32_t packInt8(int8_t a, int8_t b, int8_t c, int8_t d) { + return (uint32_t(a) << 0) | (uint32_t(b) << 8) | + (uint32_t(c) << 16) | (uint32_t(d) << 24); +} + +// Uploading to a Vulkan buffer using RAII +void uploadData(vk::raii::Device& device, const std::vector& packedData) { + vk::BufferCreateInfo createInfo({}, packedData.size() * sizeof(uint32_t), + vk::BufferUsageFlagBits::eStorageBuffer); + vk::raii::Buffer storageBuffer(device, createInfo); + // ... bind memory and copy data ... +} +---- + +== Managing Dynamic Range: Loss Scaling + +The biggest challenge with mixed precision, particularly with **FP16**, is its limited **Dynamic Range**. FP16 has a much smaller range than FP32, which means it's much easier to **overflow** (exceed the maximum value) or **underflow** (become too small to represent). + +Managing this requires a technique known as **Loss Scaling**. You multiply your values by a scaling factor (e.g., 128.0) before performing your low-precision calculations to keep them within a safe range, and then divide by that same factor when you're done. + +By mastering mixed precision, you're not just "squeezing out more performance"; you're being smarter about how you use the hardware's resources. Whether you're optimizing a fluid simulation or a real-time signal processing engine, these techniques are essential for pushing the boundaries of what's possible on modern GPUs. + +xref:02_cooperative_matrices.adoc[Previous: Cooperative Matrices] | xref:../11_Performance_Optimization/01_introduction.adoc[Next: Performance Optimization] diff --git a/en/Advanced_Vulkan_Compute/11_Performance_Optimization/01_introduction.adoc b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/01_introduction.adoc new file mode 100644 index 00000000..bea8d91b --- /dev/null +++ b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/01_introduction.adoc @@ -0,0 +1,28 @@ +:pp: {plus}{plus} += Performance Auditing and Optimization + +We've covered a vast range of advanced Vulkan compute topics—from low-level architecture to high-level abstractions like SYCL. But there's one question that every developer eventually faces: "Is this as fast as it can be?" Answering this question is not about guesswork or intuition; it's about a rigorous, methodical approach to **Performance Auditing**. + +In the world of GPU compute, a "fast" kernel can be held back by many things. It might be waiting on memory (**memory-bound**), it might be overwhelmed by complex arithmetic (**compute-bound**), or it might be suffering from "divergence"—where different invocations in a **subgroup** (or **warp/wavefront**) are forced to take different execution paths, causing the hardware to serialize their work. + +Optimization is not just about writing "clever" code. It's about understanding the **bottlenecks**. If your kernel is memory-bound, adding more arithmetic operations won't slow it down, but it also won't make it faster. Conversely, if you're compute-bound, optimizing your memory access pattern might not yield any noticeable gains. + +== Moving Beyond Naive Optimization + +When we talk about optimization in a massively parallel environment like Vulkan, we need a standard set of metrics and models to guide us. In this chapter, we will introduce: + +* **The Roofline Model**: A fundamental analytical tool that allows us to visualize whether a kernel is limited by the peak bandwidth of **VRAM** (Video Random Access Memory) or the peak throughput of the **ALU** (Arithmetic Logic Unit). +* **Instruction Throughput Analysis**: Understanding the cost of individual **ISA** (Instruction Set Architecture) commands, and how to identify "heavy" operations like double-precision floats or complex transcendental functions. +* **Divergence Audits**: A methodology for identifying where **SIMD** (Single Instruction, Multiple Data) execution breaks down, causing lanes to sit idle while others work. + +We'll move beyond looking at high-level Slang or GLSL code and start thinking about what the hardware actually sees. This involves understanding the **Occupancy** of the **CU** (Compute Unit) or **SM** (Streaming Multiprocessor) and how to minimize **pipeline stalls** caused by memory latency. + +By the end of this chapter, you'll be equipped with the methodology to move from "making it work" to "making it fly." + +== Chapter Roadmap + +1. **Instruction Throughput Analysis**: Learning to identify compute-bound vs. memory-bound kernels using the Roofline Model. +2. **The Divergence Audit**: Techniques for visualizing and refactoring divergent branching logic. + +[horizontal] +*Previous:* xref:../10_Specialized_Math/03_mixed_precision.adoc[Mastering Mixed Precision] | *Next:* xref:02_instruction_throughput.adoc[Instruction Throughput Analysis] diff --git a/en/Advanced_Vulkan_Compute/11_Performance_Optimization/02_instruction_throughput.adoc b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/02_instruction_throughput.adoc new file mode 100644 index 00000000..fbd953e7 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/02_instruction_throughput.adoc @@ -0,0 +1,49 @@ +:pp: {plus}{plus} += Analyzing Instruction Throughput: Compute-Bound vs. Memory-Bound + +Every GPU kernel has a "heartbeat"—a rate at which it processes instructions and accesses memory. Understanding this heartbeat is the key to effective optimization. To do this, we need to distinguish between two primary types of bottlenecks: **Compute-Bound** and **Memory-Bound**. + +== The Roofline Model + +A powerful way to visualize this is the **Roofline Model**. Imagine a graph where the x-axis is **Arithmetic Intensity** (the ratio of math operations to memory bytes accessed) and the y-axis is **Performance** (GFLOPS). + +The "roof" of this model is determined by the hardware's peak theoretical performance. + +* If your kernel has low arithmetic intensity (lots of memory access, little math), it's trapped on the "slope" of the roof—it's **Memory-Bound**. +* If your kernel has high arithmetic intensity, it hits the flat part of the roof—it's **Compute-Bound**. + +== Identifying the Bottleneck + +To identify these bottlenecks, you need to look at **Hardware Metrics** using profiling tools like NVIDIA Nsight, AMD Radeon GPU Profiler (RGP), or Intel VTune. + +=== Compute-Bound Kernels +A **Compute-Bound** kernel is one where the hardware's arithmetic units (**ALUs**) are fully occupied. These kernels are characterized by: + +* **High ALU Utilization**: The ALUs are active for a large percentage of the time. +* **Low Memory Throughput**: The memory bus is relatively idle. +* **Fix**: Simplify your math, use **Mixed Precision**, or leverage specialized units like **Cooperative Matrices**. + +=== Memory-Bound Kernels +A **Memory-Bound** kernel is one where the ALUs are often idle, waiting for data from VRAM. These kernels show: + +* **Low ALU Utilization**: The arithmetic units are "stalled" waiting for memory. +* **High VRAM Throughput**: You're hitting the hardware's bandwidth limits. +* **Fix**: Improve **Memory Coalescing**, use **Shared Memory (LDS)** to reuse data, or use **Subgroup Operations** to share data without touching VRAM. + +== Understanding Stall Reasons + +Modern profilers can tell you *why* a wavefront is stalled. Common reasons include: + +* **Instruction Fetch Stall**: The hardware can't fetch the next instruction fast enough (rare for compute). +* **Execution Stall**: The ALUs are busy with a long-running instruction (like a complex transcendental function). +* **Memory Dependency Stall**: The most common stall—the wavefront is waiting for a `load` from VRAM to complete. + +== Latency Hiding and Occupancy + +As we discussed in Chapter 2, the GPU hides memory latency by switching between active wavefronts. This is why **Occupancy** is so important. If you have low occupancy, the GPU might run out of "work" to do while it's waiting for memory, leading to idle ALUs and poor performance. + +However, be careful! Higher occupancy isn't always better. If your occupancy is too high, you might increase **Cache Contention**, where different wavefronts are constantly evicting each other's data from the L1 or L2 caches. Finding the "sweet spot" for occupancy is a critical part of the tuning process. + +Optimization is an iterative process. You profile, identify the bottleneck, apply a targeted fix, and then profile again. This is how you eventually arrive at a truly optimized solution that makes the most of the GPU's massive parallel potential. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_divergence_audit.adoc[Next: Divergence Audit] diff --git a/en/Advanced_Vulkan_Compute/11_Performance_Optimization/03_divergence_audit.adoc b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/03_divergence_audit.adoc new file mode 100644 index 00000000..e94c7edf --- /dev/null +++ b/en/Advanced_Vulkan_Compute/11_Performance_Optimization/03_divergence_audit.adoc @@ -0,0 +1,64 @@ +:pp: {plus}{plus} += The Divergence Audit: Identifying and Refactoring Branch Divergence + +One of the most insidious performance killers in GPU compute is **Branch Divergence**. To understand why, we need to remember that GPUs operate on groups of invocations (wavefronts or warps) that execute the same instruction in lock-step. When your code includes a branch—like an `if-else` statement—and some invocations in the subgroup take the `if` path while others take the `else` path, the hardware is forced to **serialize** those paths. + +The hardware will execute the `if` path for all relevant invocations (while masking out the others), and then it will execute the `else` path for the remaining invocations (masking out the first group). During this time, the ALUs for the inactive invocations are essentially idle, and you're effectively cutting your GPU's throughput in half. + +== Identifying Divergence + +A **Divergence Audit** is a methodical process for identifying where these "divergent" branches are occurring and refactoring your code to minimize their impact. + +=== Tool-Based Identification +Look for metrics in your profiler like **Active Lane Ratio** or **Instruction Execution Efficiency**. A low ratio indicates that many lanes in your subgroups are being idled by divergent control flow. + +For example, in NVIDIA Nsight, you might look at the "Warp Execution Efficiency" metric. If it's consistently below 50%, you likely have a significant divergence problem. + +=== In-Shader Visualization +You can also use **Subgroup Operations** (Chapter 4) to manually inspect divergence directly in your shader. By using `WaveActiveBallot()`, you can generate a bitmask of which invocations are taking a particular path. + +[source,slang] +---- +// Visualize divergence in your shader +bool local_test = data[globalID.x] > threshold; + +// Ballot tells us exactly which lanes in the subgroup are 'true' +uint4 lane_mask = WaveActiveBallot(local_test); + +// If only some lanes are true, we are divergent! +uint active_lanes = countbits(lane_mask.x) + countbits(lane_mask.y) + + countbits(lane_mask.z) + countbits(lane_mask.w); +---- + +== Refactoring Strategies + +Once you've identified a divergent branch, there are several ways to refactor it. + +=== Strategy 1: Subgroup-Level Branching +If a branch can be evaluated identically for all invocations in a subgroup, the hardware can execute it without any penalty. This is often called "Uniform Branching." + +[source,slang] +---- +// Refactored, subgroup-aware branch +bool local_test = data[globalID.x] > threshold; + +if (WaveActiveAllTrue(local_test)) { + // Fast path: everyone is doing the same work! + do_complex_work_fast(); +} else if (WaveActiveAnyTrue(local_test)) { + // Slow path: only some are doing work, but we only enter if necessary + do_complex_work_slow(); +} +---- + +=== Strategy 2: Replacing Control Flow with Data Flow +A more advanced technique is to **Replace Control Flow with Data Flow**. Instead of using an `if` to choose between two calculations, you can perform both and use a mathematical trick to select the result. This keeps the execution pipeline "saturated" and avoids the serialization penalty of branching. + +Functions like `lerp()`, `clamp()`, and `step()` are your best friends here. In many cases, performing a few extra arithmetic operations is faster than the cost of a divergent branch. + +=== Strategy 3: Work Sorting +If your divergence is caused by processing different types of data (e.g., in a ray tracer where some rays hit a complex material and others hit a simple one), you can use a **sorting pass** to group similar workloads together. By ensuring that all invocations in a subgroup are processing the same type of data, you can eliminate divergence entirely at the cost of the sort. + +By conducting regular divergence audits, you can identify the "hidden" costs in your compute kernels and refactor them into more efficient, SIMD-friendly patterns. This is the difference between code that "just runs" and code that truly masters the GPU's architecture. + +xref:02_instruction_throughput.adoc[Previous: Instruction Throughput Analysis] | xref:../12_Diagnostics_and_Refinement/01_introduction.adoc[Next: Diagnostics and Refinement] diff --git a/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/01_introduction.adoc b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/01_introduction.adoc new file mode 100644 index 00000000..3e0c93b4 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/01_introduction.adoc @@ -0,0 +1,28 @@ +:pp: {plus}{plus} += Diagnostics and AI-Assisted Compute Refinement: Introduction + +== Overview + +In this final chapter, we're going to explore the modern landscape of Vulkan compute development. As our kernels become more complex and our orchestration more elaborate, the traditional methods of debugging and optimization can sometimes feel inadequate. If you've spent any time writing compute shaders, you know the frustration: your code compiles, your dispatch returns success, but your output buffer is full of zeros—or worse, your entire system hangs with a "Device Lost" error. + +The GPU is often described as a "**Black Box**"—a powerful processor that performs millions of operations in parallel but offers very little visibility into what's actually happening inside. Unlike C{pp} code on the CPU, you can't easily set a breakpoint, step through your logic line by line, or inspect the state of every register. To build robust and efficient compute pipelines, we need a new set of tools and a new way of thinking about the development process. + +== The Diagnostic Pillars + +To pull back the curtain on the GPU, we'll focus on two modern techniques for runtime verification: + +* **GPU-Assisted Validation (GAV)**: This is a powerful feature of the Vulkan validation layers. Instead of just checking if your API calls are valid, GAV actually injects small amounts of diagnostic code directly into your shaders at runtime. This process, known as **instrumentation**, allows the layers to detect errors that would otherwise go completely unnoticed—from **Out-of-Bounds (OOB)** buffer access to invalid pointer dereferences when using **Buffer Device Address (BDA)**. +* **Shader printf**: We'll explore how to use standard `printf` logic inside a shader to "see" the values of your variables across thousands of parallel invocations. While it might seem primitive, in a massively parallel environment, it's often the only way to track down subtle logic errors. + +== AI-Assisted Development + +Finally, we'll look at the emerging role of **AI-Assisted Optimization**. **Large Language Models (LLMs)**—AI models trained on vast amounts of code—are becoming increasingly adept at understanding shader code and suggesting parallel-friendly refactors. + +Whether you're struggling to vectorize a naive loop or looking for a more efficient **Subgroup** pattern (using the **Wave** operations we learned in Chapter 4), an AI assistant can be a valuable partner in your development process. However, as we'll see, the key to using AI effectively is knowing how to "talk" to it using the specific **terms of art** we've mastered in this series—like **LDS (Local Data Store)**, **Barriers**, and **Occupancy**. + +== Chapter Roadmap + +1. **Compute Validation**: Setting up and using GPU-Assisted Validation to catch memory errors and using `printf` for shader debugging. +2. **Assistant-Led Optimization**: Leveraging AI to refactor naive compute kernels into wave-aware, high-performance patterns. + +xref:../11_Performance_Optimization/03_divergence_audit.adoc[Previous: Divergence Audit] | xref:02_compute_validation.adoc[Next: Compute Validation] diff --git a/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/02_compute_validation.adoc b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/02_compute_validation.adoc new file mode 100644 index 00000000..6f364287 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/02_compute_validation.adoc @@ -0,0 +1,71 @@ +:pp: {plus}{plus} += Compute Validation and GPU-Assisted Debugging + +Debugging a compute shader is notoriously difficult. Unlike CPU code, you can't easily set a breakpoint or step through your logic line by line. Most errors—like an out-of-bounds buffer access—will simply result in garbage data or, in the worst-case scenario, a "Device Lost" error that provides almost no information about what went wrong. + +== GPU-Assisted Validation (GAV) + +This is where **GPU-Assisted Validation** (GAV) comes in. Part of the standard Vulkan Validation Layers, GAV works by injecting small amounts of diagnostic code directly into your shaders at runtime. This **instrumentation** allows the layers to track and report errors that would otherwise be invisible. + +=== Enabling GAV in C++ +To enable GAV, you configure the `vk::ValidationFeaturesEXT` structure when creating your Vulkan instance. + +[source,cpp] +---- +// Enabling GPU-Assisted Validation via RAII +std::vector enabledFeatures = { + vk::ValidationFeatureEnableEXT::eGpuAssisted, + vk::ValidationFeatureEnableEXT::eGpuAssistedReserveBindingSlot +}; + +vk::ValidationFeaturesEXT validationFeatures { + .enabledValidationFeatureCount = static_cast(enabledFeatures.size()), + .pEnabledValidationFeatures = enabledFeatures.data() +}; + +vk::InstanceCreateInfo createInfo { + .pNext = &validationFeatures, + // ... other setup ... +}; +---- + +=== What GAV Detects +* **Out-of-Bounds Access**: If you try to read from `data[100]` when the buffer only has 50 elements, GAV will catch it. +* **Invalid Pointers**: When using **Buffer Device Address (BDA)**, GAV can detect if you're dereferencing a null or invalid pointer. +* **Uninitialized Descriptors**: It ensures that every descriptor your shader touches has been correctly bound and initialized. + +== Shader Printf: Seeing Inside the Kernel + +While GAV is great for catching errors, sometimes you just need to see the values of your variables. This is where `debugPrintfEXT` (from the `GL_EXT_debug_printf` extension) becomes your best friend. + +=== In the Shader (Slang) +Slang supports `printf` directly, which maps to the underlying Vulkan extension. + +[source,slang] +---- +// Using printf in a compute shader +void computeMain(uint3 globalID : SV_DispatchThreadID) { + float some_value = calculate_complex_math(globalID.x); + + if (some_value < 0.0f) { + // Output will appear in your application's debug callback + printf("Thread %d: Warning! Negative value detected: %f\n", globalID.x, some_value); + } +} +---- + +=== In the Host Code +To see the output from `printf`, you must: +1. Enable the `VK_KHR_shader_non_semantic_info` extension on your device. +2. Have a standard **Debug Messenger** callback registered. The output from your shader will arrive as a `VkDebugUtilsMessengerCallbackDataEXT` with a message ID that identifies it as a printf call. + +== Interpreting the Output + +When a validation error or a `printf` occurs, the output can be verbose. Look for: +* **The Shader Module**: Which shader triggered the message. +* **The Instruction Offset**: The specific SPIR-V instruction that failed. +* **The Value**: For `printf`, this is your formatted string. For GAV, it might be the invalid index or pointer address. + +While GAV and `printf` have a significant performance cost, they are indispensable for development. They turn the "black box" of the GPU into a transparent environment where you can build complex, reliable compute pipelines with confidence. + +xref:01_introduction.adoc[Previous: Introduction] | xref:03_assistant_led_optimization.adoc[Next: AI-Assisted Optimization] diff --git a/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc new file mode 100644 index 00000000..a9974b00 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc @@ -0,0 +1,67 @@ +:pp: {plus}{plus} += AI-Assisted Optimization and Refinement + +The field of GPU compute is evolving rapidly, and the sheer complexity of advanced Vulkan compute can sometimes feel overwhelming. This is where **AI-Assisted Optimization**—leveraging Large Language Models (LLMs) to analyze and refactor your code—is becoming a game-changer. + +== The New Workflow + +As a Vulkan developer, you're no longer alone in the optimization process. Modern AI assistants can act as a second pair of eyes, helping you navigate the pitfalls of SIMD architecture and memory consistency. + +=== Example: From Naive to Wave-Aware +Imagine you have a naive loop that calculates a prefix sum across a buffer. + +[source,slang] +---- +// Naive approach: One thread does all the work in a loop +void computeMain(uint3 globalID : SV_DispatchThreadID) { + if (globalID.x == 0) { + uint total = 0; + for (uint i = 0; i < bufferSize; i++) { + total += data[i]; + data[i] = total; + } + } +} +---- + +An AI assistant can instantly recognize this as a sequential bottleneck and suggest a **Subgroup-Aware** (Wave) refactoring using the techniques we discussed in Chapter 4. + +[source,slang] +---- +// AI-Suggested Refactor: Parallel prefix sum using Wave operations +void computeMain(uint3 globalID : SV_DispatchThreadID) { + uint val = data[globalID.x]; + + // Perform a parallel prefix sum within the subgroup + uint inclusive_sum = WavePrefixSum(val) + val; + + // WaveReadLaneAt allows us to get the total from the last lane + uint subgroup_total = WaveReadLaneAt(inclusive_sum, WaveGetLaneCount() - 1); + + // ... further logic to combine subgroup totals ... + data[globalID.x] = inclusive_sum; +} +---- + +== Effective Prompting for Shaders + +To get the most out of an AI assistant, you need to provide **Context**. Don't just paste your code; explain the constraints: + +* **"Refactor this Slang shader to use Wave operations for better throughput."** +* **"Identify potential bank conflicts in this groupshared memory access pattern."** +* **"How can I use Buffer Device Address to optimize this tree traversal?"** + +By framing your questions with the specific terms of art we've learned in this series—like **Subgroups**, **LDS**, **BDA**, and **Barriers**—you'll receive much more accurate and actionable suggestions. + +== The Golden Rule: Trust but Verify + +It's crucial to remember that an AI assistant is just that—an **assistant**. While it's great for generating suggestions and identifying patterns, you are still the primary architect. + +1. **Verify Correctness**: AI-generated code can sometimes have subtle bugs, especially with complex synchronization. Always run your code through **GPU-Assisted Validation** (Chapter 12, Section 2). +2. **Profile Performance**: A "clever" refactor might actually be slower on certain hardware. Always verify the AI's suggestions with a **Divergence Audit** or an **Instruction Throughput Analysis** (Chapter 11). + +== Closing the Loop + +As we move toward the final conclusion of this series, we've seen how modern tools like GPU-Assisted Validation and AI-led refactoring can transform the compute development workflow. In the next section, we'll summarize everything we've learned and look ahead to the future of high-performance Vulkan compute. + +xref:02_compute_validation.adoc[Previous: Compute Validation] | xref:../conclusion.adoc[Next: Series Conclusion] diff --git a/en/Advanced_Vulkan_Compute/conclusion.adoc b/en/Advanced_Vulkan_Compute/conclusion.adoc new file mode 100644 index 00000000..a126e038 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/conclusion.adoc @@ -0,0 +1,52 @@ +:pp: {plus}{plus} += Advanced Vulkan Compute: Conclusion + +Congratulations on completing the "Advanced Vulkan Compute" tutorial series! You have traveled from the basic concepts of compute shaders to the cutting edge of high-performance GPGPU development in Vulkan 1.4. + +== What We've Learned + +Throughout this series, we have explored the depths of modern GPU compute, moving beyond simple image processing into complex, heterogeneous systems: + +1. **Compute Architecture**: We mastered the mapping between workgroup grids and physical hardware (CUs and SMs), and learned how to maximize occupancy and hide latency. We also utilized Vulkan 1.4's scalar layouts for maximum bandwidth efficiency. +2. **Memory Models**: We demystified the Vulkan Memory Model, mastering availability, visibility, and domain operations to ensure thread safety without sacrificing performance. +3. **Subgroup Power**: We utilized subgroup shuffles, broadcasts, and arithmetic to exchange data at hardware speed, bypassing VRAM and shared memory (LDS) entirely. +4. **Heterogeneous Ecosystems**: We explored bridging legacy code with OpenCL (clspv/clvk) and modernizing development with single-source SYCL (AdaptiveCpp). +5. **Advanced Data Structures**: We implemented GPU-resident trees, lock-free linked lists, and utilized raw Buffer Device Addresses (BDA) for pointer-like flexibility. +6. **GPU-Driven Pipelines**: We transitioned control from the CPU to the GPU using indirect dispatches and autonomous command generation. +7. **Asynchronous Orchestration**: We harnessed the power of multiple hardware engines to run compute concurrently with graphics using Synchronization 2 and Timeline Semaphores. +8. **Specialized Math**: We utilized modern hardware features like Cooperative Matrices and mixed-precision (FP16/Int8) for massive throughput. +9. **Performance & Diagnostics**: We learned to audit our kernels for divergence, analyze throughput with the Roofline model, and debug complex race conditions with GPU-Assisted Validation (GAV). +10. **AI-Assisted Optimization**: We've seen how Large Language Models (LLMs) can act as a bridge between naive, sequential logic and the parallel, subgroup-aware patterns that are necessary for high GPU throughput. + +== Making it Click: The Compute Mindset + +If there is one takeaway from this series, it is this: **The GPU is not just a math coprocessor; it is a parallel throughput machine with its own complex hierarchy.** + +Developing for advanced compute requires a shift in mindset: + +- **Think in Waves**: Always look for opportunities to use subgroup (Wave) operations instead of workgroup-level barriers. +- **Explicit Synchronization**: Don't rely on luck. Use the Vulkan Memory Model and Synchronization 2 to define exactly how and when data becomes visible. +- **Data-First Design**: Design your data structures for the GPU's memory architecture (scalar layouts, LDS bank alignment) before you write a single line of logic. + +== Where to Go From Here + +The world of high-performance computing is vast. Now that you have a solid foundation, consider these paths: + +1. **Deep Dive into Machine Learning**: Use what you've learned about Cooperative Matrices and Mixed Precision to optimize neural network inference or training. +2. **Real-Time Path Tracing**: Combine GPU-Driven pipelines and Asynchronous Compute to build a high-performance ray tracer that handles complex spatial structures entirely on the device. +3. **Physical Simulations**: Implement advanced fluid dynamics (SPH) or rigid body solvers using the lock-free data structures we discussed. +4. **Vulkan Ecosystem**: Contribute to projects like `clspv`, `clvk`, or `AdaptiveCpp`, or build your own high-level compute abstraction. + +== Community and Resources + +As always, you are not alone in this journey. The Vulkan community is filled with experts and enthusiasts: + +- **Khronos Slack/Discord**: Great for technical deep dives into specific extensions. +- **Vulkan Specification**: Your ultimate source of truth for memory models and hardware constraints. +- **Vendor-Specific Documentation**: Explore NVIDIA's Nsight, AMD's RGP, and Intel's GPA documentation for hardware-specific optimization tricks. + +Thank you for following along with this series. We've moved from "making pixels pretty" to harnessing the full parallel power of modern hardware. The only limit now is your imagination. + +Happy Hacking! + +xref:12_Diagnostics_and_Refinement/03_assistant_led_optimization.adoc[Previous: Assistant-Led Optimization] | xref:../00_Introduction.adoc[Back to Home] diff --git a/en/Advanced_Vulkan_Compute/introduction.adoc b/en/Advanced_Vulkan_Compute/introduction.adoc new file mode 100644 index 00000000..6e2e6e23 --- /dev/null +++ b/en/Advanced_Vulkan_Compute/introduction.adoc @@ -0,0 +1,64 @@ +:pp: {plus}{plus} + += Advanced Vulkan Compute: The Power of Parallelism + +== Introduction + +Welcome to the "Advanced Vulkan Compute" tutorial series! This series is designed for developers who have mastered the basics of Vulkan compute shaders and are looking to push the boundaries of what's possible with modern GPU hardware. + +Vulkan is not just a graphics API; it is a powerful, low-level framework for general-purpose GPU programming (GPGPU). While the initial tutorials covered how to dispatch a simple compute shader, this series dives deep into the architecture, memory models, and advanced features that enable high-performance simulations, complex data structures, and heterogeneous execution. + +=== Beyond the Basics + +In a basic compute shader, you might just be multiplying an array of floats. In advanced compute, you are: + +* **Orchestrating thousands of threads** to work together on a single problem. +* **Managing memory consistency** to ensure that data written by one thread is safely read by another. +* **Leveraging specialized hardware** like subgroup shuffles and cooperative matrices to bypass slow **VRAM** (Video Random Access Memory). +* **Building GPU-resident data structures** like **BVH** (Bounding Volume Hierarchies) and Octrees that never need to touch the CPU. + +To do this effectively, you need more than just a passing knowledge of GLSL or Slang; you need to understand the underlying hardware architecture and the Vulkan execution model. + +=== What You Will Learn + +This tutorial series is organized into several key areas: + +1. **Compute Architecture** - Mapping workgroups to Compute Units (CU) and Streaming Multiprocessors (SM), and mastering occupancy. +2. **Memory Models and Consistency** - Understanding the Vulkan Memory Model, shared memory (**LDS** - Local Data Store), and fine-grained synchronization. +3. **Subgroup Operations** - Using cross-invocation communication to avoid VRAM round-trips and maximize **SIMD** (Single Instruction, Multiple Data) throughput. +4. **Heterogeneous Ecosystems** - Running OpenCL C and SYCL code on top of Vulkan using `clspv`, `clvk`, and AdaptiveCpp. +5. **Advanced Data Structures** - Moving complex structures like trees and linked lists entirely to the GPU using 64-bit atomics and **BDA** (Buffer Device Address). +6. **GPU-Driven Pipelines** - Moving command generation and workload management entirely to the GPU for autonomous execution. +7. **Asynchronous Orchestration** - Running compute and graphics concurrently using Synchronization 2 and multiple hardware queues. +8. **Advanced Math & Optimization** - Using Cooperative Matrices for linear algebra and auditing kernels for divergence and throughput. + +=== Prerequisites + +This series assumes you are comfortable with: + +* Standard Vulkan initialization (Instance, Device, Queues). +* Basic Compute Pipelines and Descriptor Sets. +* C{pp}20 and GLSL/Slang shader languages. +* The concepts covered in the xref:11_Compute_Shader.adoc[Compute Shader] chapter of the main tutorial. + +=== How to Use This Tutorial + +Each chapter is designed to be self-contained but builds on the concepts of previous ones. We recommend following them in order if you're new to advanced compute, or jumping to specific sections if you're looking to solve a particular problem. + +Let's dive into the world of high-performance GPU computing! + +== Chapters + +* xref:02_Compute_Architecture/01_introduction.adoc[The Compute Architecture and Execution Model] +* xref:03_Memory_Models/01_introduction.adoc[Memory Models and Consistency] +* xref:04_Subgroup_Operations/01_introduction.adoc[Subgroup Operations: The Hidden Power] +* xref:05_OpenCL_on_Vulkan/01_introduction.adoc[Heterogeneous Ecosystem: OpenCL on Vulkan] +* xref:06_SYCL_and_Single_Source_CPP/01_introduction.adoc[High-Level Abstraction: SYCL and Single-Source C{pp}] +* xref:07_Advanced_Data_Structures/01_introduction.adoc[Advanced Data Structures on the GPU] +* xref:08_GPU_Driven_Pipelines/01_introduction.adoc[Indirect Dispatch and GPU-Driven Pipelines] +* xref:09_Asynchronous_Compute/01_introduction.adoc[Asynchronous Compute Orchestration] +* xref:10_Specialized_Math/01_introduction.adoc[Cooperative Matrices and Specialized Math] +* xref:11_Performance_Optimization/01_introduction.adoc[Performance Auditing and Optimization] +* xref:12_Diagnostics_and_Refinement/01_introduction.adoc[Diagnostics and AI-Assisted Compute Refinement] + +xref:11_Compute_Shader.adoc[Previous: Basic Compute Shaders] | xref:02_Compute_Architecture/01_introduction.adoc[Next: Compute Architecture] \ No newline at end of file