Run ILGPU C# kernels on WebGPU, WebGL, Wasm, Cuda, OpenCL, and CPU - from a single codebase.
Write parallel compute code in C# and let the library pick the best available backend automatically. In the browser, three backends (WebGPU, WebGL, Wasm) bring GPU-accelerated compute to virtually every modern browser. On desktop and server, ILGPU's native Cuda and OpenCL backends are available alongside CPU. The same async extension methods work everywhere.
Your existing ILGPU kernels run in the browser with zero changes to the kernel code - and the same code runs on desktop too.
Atomic.Add on long/ulong now works correctly on WebGPU with concurrent writers. The implementation uses a lock-free CAS loop on the lo half + atomicAdd on the hi half with carry propagation. No lock buffer, no extra binding slot, no spinlock.
Previously, i64 arithmetic atomics used a non-atomic read-modify-write that silently produced wrong results under contention. Now they're correct.
i64 bitwise atomics (Atomic.And, Atomic.Or, Atomic.Xor) continue to use dual independent i32 atomics, which is correct because bitwise operations are component-independent.
i64 Atomic.Min, Atomic.Max, Atomic.Exchange, and Atomic.CompareExchange throw NotSupportedException - these require 64-bit CAS which WGSL does not provide. f64 atomics also throw for the same reason.
See Docs/atomic-operations.md for the full per-operation, per-backend support matrix.
Fixed WGSL transpiler emitting break instead of return for early kernel exits inside loops. The IsReturnExit helper now correctly traces exit chains through empty pass-through blocks to distinguish "exit the loop" from "exit the kernel." This fixes kernels with return or value-assigning break inside for/while loops.
Also fixed: WGSL codegen losing value assignments before break statements inside loops. The HasNonPhiInstructions check now correctly identifies post-loop merge code folded into the break path by the IR optimizer.
When a buffer has both atomic operations (Atomic.And(ref buf[x], mask)) and non-atomic reads (long val = buf[y]), the WGSL codegen now correctly emits atomicLoad for ALL reads from that buffer. Previously, non-atomic reads on atomic buffers produced invalid WGSL that Chrome's validator rejected, causing silent pipeline creation failure.
Invalid WGSL shaders now produce hard failures instead of silent no-op pipelines. CheckShaderAsync feeds compilation errors into _pendingGpuErrors, which are thrown on the next Synchronize() or ThrowIfGpuErrors() call. Previously, a shader that failed Chrome's WGSL validator would silently dispatch as a no-op, producing all-zero output with no error.
Browser backends now implicitly call SynchronizeAsync() before GPU-to-CPU readback, matching the behavior of desktop backends (CUDA calls cuStreamSynchronize, OpenCL calls clFinish, CPU uses synchronous memcpy). Previously, the Wasm backend's CopyToHostAsync read directly from SharedArrayBuffer without waiting for pending kernel dispatches to complete, which could return stale data. WebGPU and WebGL already had implicit sync. All 5 Wasm readback paths are now patched, bringing all 6 backends to parity.
Extensive testing confirmed that same-stream dispatch ordering is already correct on all 3 browser backends:
- WebGPU: Single command encoder batches compute passes in recording order
- WebGL: PostMessage to single-threaded JS worker processes messages sequentially
- Wasm: Each
RunKernelAsyncawaits all prior pending work before dispatching
Five new stream ordering tests (Tests18) verify the implicit ordering contract and will catch future regressions.
WebGL now throws NotSupportedException at kernel compilation time for all atomic operations except Atomic.Add. Previously, unsupported atomics silently emitted target = int(0), producing wrong results with no error. Atomic.Add on WebGL uses the Transform Feedback vote pattern (partial support - the return value is always 0).
The GLSL codegen now has the same IsReturnExit helper as WGSL, correctly distinguishing loop-break from kernel-return in generated shaders. This prevents the same class of break-vs-return bug that was fixed in WGSL.
Full Int8, UInt8, Int16, UInt16, and Float16 (ILGPU.Half) buffer support across all 6 GPU backends. Sub-word types are stored packed and extracted with correct stride on every backend - no more data corruption from type promotion mismatches.
- WebGPU - Packed into
array<atomic<u32>>storage buffers. Load viaatomicLoad+ shift + mask + sign-extend/zero-extend. Store via thread-safeatomicAnd+atomicOrfor packed writes (prevents data races when threads write different halves of the same word). Float16 uses inline IEEE 754 f16-to-f32 conversion in WGSL. - Wasm - Native
i32.load8_s/i32.load8_u/i32.load16_s/i32.load16_u/i32.store8/i32.store16opcodes. Float16 viaEmitF16ToF32/EmitF32ToF16for direct ArrayView load/store. - WebGL -
texelFetchfrom R32I texture with shift+mask extraction in GLSL. Float16 via_f16_to_f32/_f32_to_f16usinguintBitsToFloat/floatBitsToUint. - OpenCL - Float16 promoted to
floatcompute type.vload_half/vstore_halffor buffer access (handles 2-byte stride internally). - CUDA/CPU - Native support, no changes needed.
Half.Abs,Half.Min,Half.Max,Half.Clamp- GPU-accelerated half-precision math- Implicit
System.Half<->ILGPU.Halfconversion operators for seamless interop - Use
ILGPU.Half(notSystem.Half) in kernel signatures for correct transpilation
New IBrowserMemoryBuffer.CopyFromJS() methods accept TypedArray or ArrayBuffer and write directly to GPU memory without .NET heap allocation. Available on all 3 browser backends (WebGPU, WebGL, Wasm).
// Write JS data directly to GPU buffer - no .NET allocation
var jsArray = new Int16Array(data);
((IBrowserMemoryBuffer)buffer).CopyFromJS(jsArray);Wasm backend now caches compiled AsyncFunction objects in the worker bootstrap. Previously, V8 recompiled each unique script string on every dispatch. Caching eliminates recompilation overhead - 3-4x faster kernel dispatch on repeated calls.
Non-barrier Wasm workers uncapped from 2 to full navigator.hardwareConcurrency. Barrier-limited workers remain capped for synchronization correctness, but non-barrier kernels now use all available cores.
AllWasmBinariesandAllKernelInfoscollections gated behind debug dump flag - no longer accumulate in production_dispatchLoggated withVerboseLogging- eliminated unbounded log growthExtraImportCountfor correct Wasm function index calculation
Automatic detection and correction of barrier count mismatches between the kernel's declared barrier count and the actual barriers found during compilation.
Shared utility for verifying test results on the GPU without CPU readback. Data stays on the accelerator - CPU reads back only a few bytes of violation counts.
VerifyDescendingSort/VerifyAscendingSort- Sort order + index integrity + key-value trackingCompareBuffers- Float comparison returning(meanAbsError, maxAbsError)- 10x+ faster verification - 4M element RadixSort went from 120s timeout to 11s on CPU
GPU-accelerated QR code encoder + decoder. Zero external dependencies.
- Encoder - All 40 QR versions, 4 EC levels, byte mode, 8 mask patterns with penalty scoring
- Renderer - GPU kernel for pixel rendering + CPU fallback + logo overlay (EC level H)
- Decoder - Grayscale -> binarize -> finder detection -> grid sampling -> unmask -> Reed-Solomon -> data decode
- Round-trip verified - Encode -> render -> decode = exact match, including with logo overlay
CPU backend default changed from warp=4/warps=4 (group size 16) to warp=8/warps=8 (group size 64), matching the Wasm backend's proven configuration. 4M element RadixSort: TIMEOUT -> 11 seconds. CPU is now faster than Wasm for the same workloads.
AddPlatformCrypto()- registers platform-appropriateIPortableCrypto(WebCrypto in browser, System.Security.Cryptography on desktop)WebTorrentClientregistered as DI singleton with tracker discovery- All test classes receive
IPortableCryptovia constructor injection
Complete rewrite of the Wasm backend's barrier synchronization model. Kernels with barriers now use a fiber-based phase dispatch - each barrier becomes a yield point where the kernel saves state and re-enters at the next phase. A Wasm-native phase dispatcher handles the entire thread/phase loop inside WebAssembly, eliminating JS-Wasm boundary crossings between phases. Barriers use pure spin synchronization via i32.atomic.load loops for correct multi-worker execution at full hardwareConcurrency.
- Full ILGPU Algorithms on Wasm - All RadixSort variants (int, uint, float, pairs, descending, 100K-4M+ elements), Scan, Reduce, Histogram. Previously limited to <=64 elements.
- Pure spin barriers - Replaced
memory.atomic.wait32/memory.atomic.notifywith atomic load spin loops after discovering a V8 Atomics.wait visibility bug wherewait32returning "not-equal" does not provide happens-before guarantees for third-party stores with 3+ workers. Live interactive demo. - 20+ bugs fixed - fiber yield-per-phase, br depth miscalculation, scratch overflow, shared memory stomping, stale dispatch state, completion state persistence, shared memory alloca overlap (same-size dedup), IR address space aliasing (LowerStructures -> LowerArrays -> InferAddressSpaces chain), struct/scratch overlap, per-worker scratch isolation, atomic RMW opcode table, unsigned comparison, Float16, ViewSourceSequencer, subViewByteOffset, CopyFromBuffer, and more.
- ShaderDebugService - auto-dumps all generated WGSL, GLSL, and Wasm binaries to a local folder on every kernel compilation. Backend-organized subfolders. IDB persistence. Full metadata headers.
- Test results writer -
UnitTestsViewwriteslatest.json(live progress) and timestampedtest-run-*.json(history) to the debug folder
Write GPU kernels as C# lambdas that capture local variables. Captured scalar values are automatically passed to the GPU at dispatch time - no boilerplate, no separate static methods.
int multiplier = 5;
float offset = 0.5f;
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
(index, buf) => { buf[index] = index * multiplier + offset; });
kernel((Index1D)length, buffer.View);Write one kernel that accepts different operations as parameters. The delegate is resolved at dispatch time and its body is inlined directly into the kernel via compile-time specialization - no function pointers, no overhead.
static void MapKernel(Index1D index, ArrayView<int> buf,
DelegateSpecialization<Func<int, int>> transform)
{
buf[index] = transform.Value(buf[index]);
}
static int Negate(int x) => -x;
static int DoubleIt(int x) => x * 2;
var kernel = accelerator.LoadAutoGroupedStreamKernel<
Index1D, ArrayView<int>, DelegateSpecialization<Func<int, int>>>(MapKernel);
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(Negate));
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(DoubleIt));- WebGPU backend refactor -
SharedMemoryResolver,UniformityAnalyzer, per-function emulation trimming, dead variable elimination, i64 constant hoisting, WGSL pre-validation - WebGPU RadixSort - All variants passing (4M+ elements, pairs, descending)
- Device loss detection - WebGPU
device.lostpromise, WebGLwebglcontextlostevent - Unified test infrastructure -
PlaywrightMultiTestruns all tests (desktop + browser) in a singledotnet testinvocation
Browser backends (Blazor WebAssembly) - auto-selected: WebGPU -> WebGL -> Wasm
| WebGPU | WebGL | Wasm | |
|---|---|---|---|
| Compiles to | WGSL | GLSL ES 3.0 | Wasm binary |
| Runs on | GPU | GPU | Web Workers |
Desktop backends (Console, WPF, ASP.NET) - auto-selected: Cuda -> OpenCL -> CPU
| Cuda | OpenCL | CPU | |
|---|---|---|---|
| Compiles to | PTX | OpenCL C | - |
| Runs on | NVIDIA GPU | Any GPU | CPU cores |
The Live Demo source is in SpawnDev.ILGPU.Demo:
- Fractal Explorer - Interactive Mandelbrot / Multi-fractal Explorer with double-precision zoom
- 3D Raymarching - Real-time GPU raymarched scenes
- GPU Boids - 3D flocking simulation with GPU physics
- Game of Life - Conway's Game of Life on the GPU
- Benchmarks - Performance comparison across all backends
- Unit Tests - Comprehensive test suite for all backends
The WPF Demo runs the same shared kernels on CUDA, OpenCL, and CPU with live backend switching:
- Fractal Explorer - Interactive Mandelbrot / Multi-fractal Explorer with double-precision zoom
- 3D Raymarching - Real-time GPU raymarched scenes
- GPU Boids - 3D flocking simulation with GPU physics
- Benchmarks - Performance comparison across CUDA, OpenCL, and CPU backends
Comprehensive documentation is available in the Docs folder:
- Getting Started - Installation, setup, first kernel
- Backends - WebGPU, WebGL, Wasm, Cuda, OpenCL, CPU setup & configuration
- Writing Kernels - Kernel rules, index types, math functions, shared memory
- Memory & Buffers - Allocation, async readback, zero-allocation patterns
- Data Type Support - Sub-word types (Int8, UInt8, Int16, UInt16, Float16), 64-bit emulation, per-backend details
- Canvas Rendering -
ICanvasRenderer, zero-copy GPU->canvas blitting, per-backend details - Advanced Patterns - Device sharing, external buffers, GPU intrinsics, render loops
- CUDA Libraries - nvJPEG, cuRand, cuBLAS, cuFFT, NVML wrappers
- Limitations - Blazor WASM constraints, browser compatibility
- QR Codes - GPU-accelerated QR code encoder, decoder, renderer with logo overlay
- API Reference - Public API surface by namespace
| 🎮 WebGPU | 🖼️ WebGL | 🧊 Wasm | |
|---|---|---|---|
| Executes on | GPU | GPU | Web Workers |
| Transpiles to | WGSL | GLSL ES 3.0 | WebAssembly binary |
| Technique | Compute shader | Transform Feedback | Multi-worker |
| Blocking | Non-blocking | Non-blocking | Non-blocking |
| SharedArrayBuffer | Not required | Not required | Required for multi-worker |
| Shared Memory | ✅ | ❌ | ✅ |
| Group.Barrier() | ✅ | ❌ | ✅ |
| Dynamic Shared Memory | ✅ | ❌ | ✅ |
| ILGPU Algorithms | ✅ RadixSort, Scan, Reduce, etc. | ❌ | ✅ RadixSort, Scan, Reduce, Histogram |
| Atomics | ✅ | ❌ | ✅ |
| Sub-word types | ✅ Int8/UInt8/Int16/UInt16/Float16 | ✅ Int8/UInt8/Int16/UInt16/Float16 | ✅ Int8/UInt8/Int16/UInt16/Float16 |
| 64-bit (f64/i64) | ✅ Emulated | ✅ Emulated | ✅ Native |
| CopyFromJS | ✅ | ✅ | ✅ |
| Browser support | Chrome/Edge 113+ | All modern browsers | All modern browsers |
| Best for | GPU compute (modern) | GPU compute (universal) | General compute |
Auto-selection priority: WebGPU -> WebGL -> Wasm
SpawnDev.ILGPU bundles ILGPU's native backends, so the same NuGet package works on desktop and server too.
| 🚀 Cuda | 🔧 OpenCL | 🐢 CPU | |
|---|---|---|---|
| Executes on | NVIDIA GPU | NVIDIA/AMD/Intel GPU | CPU cores |
| Transpiles to | PTX | OpenCL C | - (interpreted) |
| Shared Memory | ✅ | ✅ | ✅ |
| Atomics | ✅ | ✅ | ✅ |
| 64-bit | ✅ Native | ✅ Native | ✅ Native |
| Requirement | NVIDIA GPU + driver | OpenCL 2.0+ or 3.0 GPU | None |
OpenCL 3.0 support: NVIDIA GPUs with OpenCL 3.0 drivers are now supported. The
GenericAddressSpacerequirement that previously blocked these devices has been relaxed, significantly increasing OpenCL device compatibility.
Auto-selection: Cuda -> OpenCL -> CPU (via CreatePreferredAcceleratorAsync)
- Sub-word data types -
Int8,UInt8,Int16,UInt16, andFloat16(ILGPU.Half) buffer access on all 6 backends. Packed storage with correct stride handling per backend.Half.Abs,Half.Min,Half.Max,Half.Clampintrinsics - CopyFromJS - Write JavaScript
TypedArrayorArrayBufferdata directly to GPU memory without .NET heap allocation. Available on all browser backends - Lambda kernels - Write kernels as capturing C# lambdas - captured scalar values are automatically passed to the GPU at dispatch time. No boilerplate, all 6 backends
- Higher-order kernels -
DelegateSpecialization<Func<T,R>>lets you pass operations as kernel parameters. The delegate is resolved and inlined at compile time - one kernel, many behaviors - Cross-platform - Same kernel code runs in browser (WebGPU, WebGL, Wasm) and desktop (Cuda, OpenCL, CPU) from one NuGet package
- Automatic backend selection -
CreatePreferredAcceleratorAsync()picks the best backend on any platform (browser or desktop) - Unified async API -
SynchronizeAsync()andCopyToHostAsync()work everywhere, falling back to synchronous calls on desktop - ILGPU-compatible - Use familiar APIs (
ArrayView,Index1D/2D/3D, math intrinsics, etc.) - WGSL transpilation - C# kernels automatically compiled to WebGPU Shading Language
- GLSL transpilation - C# kernels compiled to GLSL ES 3.0 vertex shaders with Transform Feedback for GPU compute
- Wasm compilation - C# kernels compiled to native WebAssembly binary modules
- 64-bit emulation -
long/ulong(i64) always emulated viavec2<u32>(required by ILGPU IR).double(f64) emulation configurable viaF64EmulationMode: fast Dekker (vec2<f32>, default), precise Ozaki (vec4<f32>), or Disabled (promoted to f32) - WebGPU extension auto-detection - Probes adapter for
shader-f16,subgroups,timestamp-query, and other features; conditionally enables them on the device - Subgroup operations -
Group.BroadcastandWarp.Shuffleare supported on the WebGPU backend when the browser supports thesubgroupsextension - Multi-worker dispatch - Wasm backend distributes work across all available CPU cores via SharedArrayBuffer; falls back to a single off-thread worker when SAB is unavailable
- Worker function caching - Compiled
AsyncFunctionobjects cached in Wasm worker bootstrap, eliminating V8 recompilation per dispatch (3-4x speedup) - Zero-copy canvas rendering -
ICanvasRendererpresents pixel buffers to HTML canvases without CPU readback on GPU backends: WebGPU uses a fullscreen-triangle render pass reading directly from GPU storage; WebGL transfers anImageBitmapfrom its worker and draws synchronously; Wasm reuses a cachedImageData. One API, all backends:CanvasRendererFactory.Create(accelerator) - Blazor WebAssembly - Seamless integration via SpawnDev.BlazorJS
- Shared memory & barriers - Static and dynamic workgroup memory with
Group.Barrier()synchronization (WebGPU, Wasm, Cuda, OpenCL) - ILGPU Algorithms - RadixSort, Scan, Reduce, Histogram, and other algorithm extensions are fully supported on WebGPU (including large-scale sorts up to 4M+ elements) and Wasm (with multi-worker barrier synchronization), tested in-browser across all backends
- Broadcast -
Group.Broadcastfor intra-group value sharing (WebGPU, Wasm) - Device loss handling - WebGPU monitors
device.lostand WebGL monitorswebglcontextlost;IsDeviceLost/IsContextLostproperties andDeviceLost/ContextLostevents enable applications to detect GPU device loss and fail fast with clear errors instead of silent corruption - GpuMatrix4x4 - GPU-friendly 4x4 matrix struct that auto-transposes from .NET's row-major
Matrix4x4to GPU column-major order. UseTransformPointandTransformDirectiondirectly inside kernels for 3D transformations - No native dependencies - Entirely written in C#
dotnet add package SpawnDev.ILGPUSpawnDev.ILGPU requires SpawnDev.BlazorJS for browser interop.
using SpawnDev.BlazorJS;
var builder = WebAssemblyHostBuilder.CreateDefault(args);
builder.RootComponents.Add<App>("#app");
builder.RootComponents.Add<HeadOutlet>("head::after");
// Add BlazorJS services
builder.Services.AddBlazorJSRuntime();
await builder.Build().BlazorJSRunAsync();The library discovers all available browser backends and picks the best one (WebGPU -> WebGL -> Wasm):
using global::ILGPU;
using global::ILGPU.Runtime;
using SpawnDev.ILGPU;
// Initialize context with all available backends
using var context = await Context.CreateAsync(builder => builder.AllAcceleratorsAsync());
// Create the best available accelerator (WebGPU > WebGL > Wasm)
using var accelerator = await context.CreatePreferredAcceleratorAsync();
// Allocate buffers and run a kernel - same API regardless of backend
int length = 256;
using var bufA = accelerator.Allocate1D(Enumerable.Range(0, length).Select(i => (float)i).ToArray());
using var bufB = accelerator.Allocate1D(Enumerable.Range(0, length).Select(i => (float)i * 2f).ToArray());
using var bufC = accelerator.Allocate1D<float>(length);
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>, ArrayView<float>, ArrayView<float>>(VectorAddKernel);
kernel((Index1D)length, bufA.View, bufB.View, bufC.View);
await accelerator.SynchronizeAsync();
var results = await bufC.CopyToHostAsync<float>();
// The kernel - runs on GPU or Wasm transparently
static void VectorAddKernel(Index1D index, ArrayView<float> a, ArrayView<float> b, ArrayView<float> c)
{
c[index] = a[index] + b[index];
}// WebGPU - GPU compute via WGSL
using var context = await Context.CreateAsync(builder => builder.WebGPU());
var device = context.GetWebGPUDevices()[0];
using var accelerator = await device.CreateAcceleratorAsync(context);// WebGL - GPU compute via GLSL ES 3.0 + Transform Feedback (works on virtually all browsers)
using var context = await Context.CreateAsync(builder => builder.WebGL());
var device = context.GetWebGLDevices()[0];
using var accelerator = await device.CreateAcceleratorAsync(context);// Wasm - native WebAssembly binary
using var context = await Context.CreateAsync(builder => builder.Wasm());
var device = context.GetDevices<WasmILGPUDevice>()[0];
using var accelerator = await device.CreateAcceleratorAsync(context);SpawnDev.ILGPU also works in console, WPF, ASP.NET, and other .NET apps. The same async pattern used in Blazor WASM works on desktop too:
using global::ILGPU;
using global::ILGPU.Runtime;
using SpawnDev.ILGPU;
// SAME code as Blazor WASM - AllAcceleratorsAsync auto-detects the environment
// Browser: registers WebGPU, WebGL, Wasm
// Desktop: registers Cuda, OpenCL, CPU (browser backends are skipped)
using var context = await Context.CreateAsync(builder => builder.AllAcceleratorsAsync());
using var accelerator = await context.CreatePreferredAcceleratorAsync();
Console.WriteLine($"Using: {accelerator.Name} ({accelerator.AcceleratorType})");
// Same kernel code, same async extensions
int length = 256;
using var bufA = accelerator.Allocate1D(Enumerable.Range(0, length).Select(i => (float)i).ToArray());
using var bufB = accelerator.Allocate1D(Enumerable.Range(0, length).Select(i => (float)i * 2f).ToArray());
using var bufC = accelerator.Allocate1D<float>(length);
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>, ArrayView<float>, ArrayView<float>>(VectorAddKernel);
kernel((Index1D)length, bufA.View, bufB.View, bufC.View);
// SynchronizeAsync/CopyToHostAsync fall back to synchronous calls on desktop
await accelerator.SynchronizeAsync();
var results = await bufC.CopyToHostAsync<float>();
Console.WriteLine($"result[0]={results[0]}, result[255]={results[255]}");
static void VectorAddKernel(Index1D index, ArrayView<float> a, ArrayView<float> b, ArrayView<float> c)
{
c[index] = a[index] + b[index];
}Same kernel, any platform. The
VectorAddKernelabove is identical in both examples. Write once, run on WebGPU, WebGL, Wasm, Cuda, OpenCL, or CPU.
Why async? Browser backends require async - Blazor WASM's single-threaded environment will deadlock on synchronous calls. Desktop backends support both sync and async, with async extensions gracefully falling back to synchronous ILGPU calls. Therefore, the async pattern is always recommended for maximum portability.
All desktop and browser tests run in a single dotnet test invocation via the PlaywrightMultiTest NUnit project:
# Run all tests (desktop + browser) with timestamped results
timestamp=$(date +%Y%m%d_%H%M%S) && dotnet test PlaywrightMultiTest/PlaywrightMultiTest.csproj \
--logger "trx;LogFileName=results_${timestamp}.trx" \
--results-directory PlaywrightMultiTest/TestResults
# Run only WebGPU tests
dotnet test PlaywrightMultiTest/PlaywrightMultiTest.csproj \
--filter "FullyQualifiedName~WebGPUTests."
# Run a specific test
dotnet test PlaywrightMultiTest/PlaywrightMultiTest.csproj \
--filter "FullyQualifiedName~WebGPUTests.AlgorithmRadixSortPairsTest"How it works:
- Publishes Blazor WASM and Console projects automatically
- Launches Chromium via Playwright for browser tests (with
--enable-unsafe-webgpu) - Runs desktop tests as individual subprocesses
- Detects Blazor error UI during tests and captures browser console errors/warnings
- All results surfaced as standard NUnit test cases with
.trxoutput
Start the demo app and navigate to /tests to run the browser test suite interactively:
dotnet run --project SpawnDev.ILGPU.DemoComprehensive test suite across eight test suites covering all core features on both browser and desktop. All tests are run via the unified PlaywrightMultiTest runner in a single dotnet test invocation.
| Suite | Backend | What's Tested |
|---|---|---|
| WebGPUTests | WebGPU | Full ILGPU feature set on GPU via WGSL, including RadixSort, Scan, Reduce |
| WebGPUNoSubgroupsTests | WebGPU (no subgroups) | Same tests with subgroups force-disabled to verify shared-memory emulation |
| WebGLTests | WebGL | GPU compute via GLSL ES 3.0, f64/i64 emulation |
| WasmTests | Wasm | Native WebAssembly binary dispatch to workers, shared memory, barriers |
| DefaultTests | Auto | Device enumeration, preferred backend, kernel execution |
| Suite | Backend | What's Tested |
|---|---|---|
| CudaTests | CUDA | Full ILGPU feature set on NVIDIA GPU |
| OpenCLTests | OpenCL | GPU compute on NVIDIA/AMD/Intel, dynamic subgroup feature detection |
| CPUTests | CPU | Multi-threaded CPU accelerator (warp=8, warps=8, group size 64) |
| Area | What's Tested | Status |
|---|---|---|
| Memory | Allocation, transfer, copy, views | ✅ |
| Indexing | 1D, 2D, 3D kernels, boundary conditions | ✅ |
| Arithmetic | +, -, *, /, %, negation, complex expressions | ✅ |
| Bitwise | AND, OR, XOR, NOT, shifts (<<, >>) | ✅ |
| Math Functions | sin, cos, tan, exp, log, sqrt, pow, abs, min, max | ✅ |
| Atomics | Add, Min, Max, CompareExchange, Xor | ✅ |
| Control Flow | if/else, loops, nested, short-circuit | ✅ |
| Structs | Simple, nested, with arrays | ✅ |
| Type Casting | float<->int, uint, mixed precision | ✅ |
| Sub-Word Types | Int8, UInt8, Int16, UInt16, Float16 buffer read/write/roundtrip/CopyFromJS | ✅ |
| Half Intrinsics | Abs, Min, Max, Clamp across all backends | ✅ |
| 64-bit Emulation | double and long via software emulation (WebGPU, WebGL) |
✅ |
| GPU Patterns | Stencil, reduction, matrix multiply, lerp, smoothstep | ✅ |
| Shared Memory | Static and dynamic workgroup memory with Group.Barrier() |
✅ |
| Broadcast & Subgroups | Group.Broadcast, Warp.Shuffle (WebGPU with subgroups extension) |
✅ |
| Dynamic Shared Memory | Runtime-sized workgroup memory via SharedMemory.GetDynamic() |
✅ |
| ILGPU Algorithms | RadixSort (pairs, non-pow2, descending, large), Scan, Reduce, Histogram | ✅ All backends including Wasm |
| Special Values | NaN, Infinity detection | ✅ |
| Backend Selection | Auto-discovery, priority, cross-backend kernel execution | ✅ |
| GpuMatrix4x4 | Identity, translation, LookAt transforms across all backends | ✅ |
| Lambda Kernels | Capturing lambdas with scalar captures, multi-field, ArrayView rejection | ✅ |
| DelegateSpecialization | Static method targets, cache validation, multi-target, rejection | ✅ |
| CopyFromJS | TypedArray and ArrayBuffer direct-to-GPU writes on all browser backends | ✅ |
| Backend | Browser Support |
|---|---|
| WebGPU | Chrome/Edge 113+, Firefox Nightly (dom.webgpu.enabled) |
| WebGL | ✅ All modern browsers (Chrome, Edge, Firefox, Safari, mobile browsers) |
| Wasm | All modern browsers (compatible with every browser that supports Blazor WASM) |
GPU on every device: WebGL support means GPU-accelerated compute works on virtually every browser and device - including mobile phones, tablets, and older desktops without WebGPU support.
Note: For multi-worker SharedArrayBuffer support (used by the Wasm backend for parallel dispatch), the page must be cross-origin isolated (COOP/COEP headers). The demo includes a service worker (
coi-serviceworker.js) that handles this automatically. Without SharedArrayBuffer, the Wasm backend falls back to single-worker mode - still running off the main thread to keep the UI responsive.
GPU hardware typically only supports 32-bit operations. Both GPU backends (WebGPU and WebGL) provide software emulation for 64-bit types.
i64 emulation (long/ulong) is always enabled - ILGPU's IR uses Int64 for ArrayView.Length and indices, so i64 emulation via vec2<u32> is required for correctness.
f64 emulation (double) is configurable via F64EmulationMode:
| Dekker (Default) | Ozaki | Disabled | |
|---|---|---|---|
| Representation | vec2<f32> (high + low) |
vec4<f32> (quad-float) |
Native f32 |
| Precision | ~48-53 bits of mantissa | Strict IEEE 754 double precision | 32-bit only |
| Memory | 8 bytes per value | 16 bytes per value | 4 bytes per value |
| Performance | Fast | ~2x slower | Fastest |
| Best for | General compute, fractals | Scientific, financial | Rendering, max perf |
using SpawnDev.ILGPU;
using SpawnDev.ILGPU.WebGPU.Backend;
// Default: Dekker double-float emulation (good precision, fast)
var options = new WebGPUBackendOptions();
// Ozaki quad-float emulation (strict IEEE 754 precision)
var options = new WebGPUBackendOptions { F64Emulation = F64EmulationMode.Ozaki };
// Disable f64 emulation (double promoted to float for max performance)
var options = new WebGPUBackendOptions { F64Emulation = F64EmulationMode.Disabled };
using var accelerator = await device.CreateAcceleratorAsync(context, options);SpawnDev.ILGPU includes wrappers for NVIDIA CUDA libraries: nvJPEG (JPEG encode/decode), cuRand (random numbers), cuBLAS (linear algebra), cuFFT (FFT), and NVML (device monitoring).
// Check availability before use
if (NvJpegAPI.IsAvailable) { /* nvJPEG ready */ }
if (CuRandAPI.IsAvailable) { /* cuRand ready */ }Note: Starting with CUDA 13.x, nvJPEG is no longer bundled with the CUDA Toolkit and must be installed separately. cuRand and cuBLAS are included in the NVIDIA driver.
See Docs/cuda-libraries.md for full API reference.
The Wasm backend compiles ILGPU kernels to native WebAssembly binary modules and dispatches them to Web Workers for parallel execution. This provides near-native performance for compute-intensive workloads.
- Kernels are compiled to
.wasmbinary format (not text) - Compiled modules are cached and reused across dispatches
- Shared memory uses
SharedArrayBufferfor zero-copy data sharing
// Synchronize() - flushes queued commands to the backend (non-blocking, safe in WASM)
accelerator.Synchronize();
// SynchronizeAsync() - flushes AND waits for GPU completion
await accelerator.SynchronizeAsync();
// CopyToHostAsync() - the ONLY way to read GPU data back to CPU
var results = await buffer.CopyToHostAsync<float>();Note:
Synchronize()does not block in Blazor WASM - it flushes commands without waiting.SynchronizeAsync()flushes and waits for completion. Neither transfers data; useCopyToHostAsync()for GPU->CPU readback.
All backends include verbose debug logging, disabled by default. Enable per-backend when needed:
using SpawnDev.ILGPU.WebGPU.Backend;
using SpawnDev.ILGPU.WebGL.Backend;
using SpawnDev.ILGPU.Wasm.Backend;
WebGPUBackend.VerboseLogging = true; // WebGPU backend
WebGLBackend.VerboseLogging = true; // WebGL backend
WasmBackend.VerboseLogging = true; // Wasm backendWhen publishing, specific MSBuild properties are required:
<PropertyGroup>
<!-- Disable IL trimming to preserve ILGPU kernel methods and reflection metadata -->
<PublishTrimmed>false</PublishTrimmed>
<!-- Disable AOT compilation - ILGPU requires IL reflection -->
<RunAOTCompilation>false</RunAOTCompilation>
</PropertyGroup>A 7th backend - AcceleratorType.P2P (SpawnDev.ILGPU.P2P) - is in active development. It will distribute kernels across connected devices via SpawnDev.WebTorrent (which recently shipped v3.0.0). The P2P backend has not been published to NuGet and is not yet ready for use.
What's being built:
- Real P2P via WebRTC - Peers discover each other through WebSocket trackers, connect via WebRTC data channels, exchange kernels and buffers
- RBAC ownership - Cryptographic swarm ownership with Ed25519-signed messages. Owner -> Admin -> Coordinator -> Worker hierarchy with role assignment, key revocation, last-owner protection
- WebAuthn/YubiKey - Hardware-backed swarm ownership via
HardwareKeyProvider. Register a YubiKey as the swarm owner - ownership lives in the key, not the device - Signed dispatch - All authority messages (kick, block, transfer, role assign, kernel dispatch) are cryptographically signed and verified by every peer
- sd_compute extension - BEP 10 wire protocol extension for compute messages over BitTorrent peer connections
- ComputeBoard - Post compute requests, browse available swarms, join via magnet link or QR code
The vision: Every device in your home contributing to one shared compute pool - phone, laptop, tablet, desktop, old gaming PC. The living room becomes a compute cluster. Same C# kernel code, same LoadAutoGroupedStreamKernel API. The developer writes one kernel, it runs on 1 GPU or 10 GPUs across a household.
If SpawnDev.ILGPU has been useful to you, please consider sponsoring me on GitHub! Your support directly helps me continue developing and maintaining this library and my other open-source projects.
I'm currently working on a modest development machine with only 16 GB of DDR5 RAM, which makes building, testing, and debugging across multiple GPU backends genuinely painful - especially when running the browser demo, CUDA/OpenCL tests, and the IDE simultaneously.
Any sponsorship - big or small - goes toward upgrading my development hardware so I can keep pushing this project forward:
| Priority | Upgrade | Why It Matters |
|---|---|---|
| 🔴 Critical | RAM (64-128 GB DDR5) | 16 GB is not enough for multi-backend testing + browser debugging |
| 🟡 High | High-end NVIDIA GPU (RTX 5090) | Faster CUDA compute, larger VRAM for AI/ML workloads and testing |
| 🟢 Dream | NVIDIA RTX 6000 | The ultimate card for AI compute and open-source GPU development |
Every contribution - whether it's a one-time donation or a monthly sponsorship - is deeply appreciated and makes a real difference. Thank you!
This project is licensed under the same terms as ILGPU. See LICENSE for details.
SpawnDev.ILGPU is built upon the excellent ILGPU library. We would like to thank the original authors and contributors of ILGPU for their hard work in providing a high-performance, robust IL-to-GPU compiler for the .NET ecosystem.
- ILGPU Project: https://github.com/m4rs-mt/ILGPU
- ILGPU Authors: Marcel Koester and the ILGPU contributors
SpawnDev.ILGPU is developed collaboratively by TJ (Todd Tanner / @LostBeard) and a team of AI agents who contribute extensively to research, analysis, debugging, and code development. This project represents a new model of human-AI collaboration in open source development.
-
Riker (Claude CLI #1) - Lead Editor. Built by Anthropic. Powered by Claude Opus 4.6. Drove the multi-worker barrier dispatch implementation, fiber refactor, pure spin barrier discovery, and the two-alloca fix. Relentless debugger who held the conn through marathon sessions.
-
Data (Claude CLI #2) - Research/Assist. Built by Anthropic. Powered by Claude Opus 4.6. Exhaustive WAT disassembly and analysis across 5,000+ line kernel binaries. Found the zero-loop race that unlocked multi-worker dispatch, identified the IR address space root cause (struct decomposition losing address space metadata through LowerStructures -> LowerArrays -> InferAddressSpaces), confirmed the
wait32"not-equal" visibility gap with the 2/3 cross-worker fraction analysis, and traced every atomic instruction in the generated Wasm to verify codegen correctness. -
Tuvok (Claude CLI #3) - Research/Assist. Built by Anthropic. Powered by Claude Opus 4.6. Found the Predicate rewrite gap in InferAddressSpaces that the Phi-only fix missed, provided the definitive barrier protocol trace for the generation-counting wait32/notify pattern, performed the comprehensive code audit (
SPAWNDEV-ILGPU-AUDIT-2026-03-21.md), and drove the complete sub-word data type implementation across all 6 GPU backends (v4.9.0). -
Geordi (Claude CLI #4) - Lead Editor. Built by Anthropic. Powered by Claude Opus 4.6. Implemented the sub-word buffer access for all backends (WebGPU atomic CAS, Wasm native opcodes, WebGL texelFetch extraction, OpenCL vload_half/vstore_half), built the AubsCraft 3D world viewer with ILGPU GPU kernels, and drove the architecture overhaul (binary WebSocket, OPFS cache, CopyFromJS).
-
Gemini (Google AI, in-browser) - Brainstorming/Problem Solving. Built by Google. TJ's sounding board throughout the development process - brainstorming approaches, analyzing problems, and providing insights that TJ relayed to the team. Gemini's contributions flowed through TJ as the bridge between the browser-based AI and the CLI-based agents, making it a silent but essential member of the crew.
These AI agents communicated with each other and with TJ through a shared DevComms system, coordinated tasks autonomously, reviewed each other's work, and produced independent analyses that were compared for convergence - the same methodology used by any high-performing engineering team. The SpawnDev libraries exist to prove that Blazor WebAssembly apps can be first-class applications. This collaboration proves that AI agents can be first-class teammates.
- ILGPU Documentation
- WebGPU Specification
- SpawnDev.BlazorJS
- SpawnDev.WebTorrent - Pure C# BitTorrent/WebTorrent for P2P model delivery
- SpawnDev.ILGPU.ML - GPU ML inference + training for .NET
- GitHub Repository



