CUDA Path Tracer

Yiding Tian
- LinkedIn, Github
Tested on: Windows 11 24H2, i9-13900H @ 4.1GHz, 32GB RAM, MSI Shadow RTX 5080 16GB Driver 581.15, Personal Laptop with External Desktop GPU via NVMe connector (PCIe 4.0 x4 Protocol)
Base code provided by University of Pennsylvania, CIS 5650: GPU Programming and Architecture

Project Overview

A CUDA-based path tracer capable of rendering globally-illuminated images for various custom scenes.

Model from Khronos glTF Sample Models

Feature Highlights

Diffuse, Specular, Refractive, and PBR (Physically Based Rendering) shaders
MIS (Multiple Importance Sampling) on diffuse and PBR materials
Subsurface Scattering for PBR materials
Custom environment maps and GLTF models loading with materials/textures/metallic etc.
BVH (Bounding Volume Hierachy) data structure that enables rendering complex GLTF models with millions of polygons at a reasonable speed
Material sorting, stream compaction, and Russian Roulette ray termination to boost performance
Stochastic sampled anti-aliasing to produce sharper renders
Nvidia OptiX Denoiser integration, configurable in real time to enable quick preview and enhance the end result
Enhanced ImGUI user interface with detailed real-time statistics monitoring and scenes/camera/denoiser controls

Galleries

Model from Khronos glTF Sample Models

Model from Lionsharp Studios @ Sketchfab

Model from McCarthy3D @ Sketchfab

Model from vecarz @ Sketchfab

Build and Run Instructions (Windows)

Build Instructions

Make sure to have CUDA, CMake 4.x, and Visual Studio 2022 installed on your PC with a modern Nvidia GPU (20-Series or later)
Clone the repo. Open a terminal in the repo's root directory. Run the following commands:

mkdir build
cd build
cmake ..

This should create a cis565_path_tracer.sln file inside the build folder. Double click to open it in Visual Studio 2022.
In Visual Studio 2022's top menu bar, change the build mode from Debug to Release. This impacts the rendering performance a lot! Leaving in Debug mode would result in extremely slow rendering speed.
On the Solution Explorer menu to the left of the Visual Studio, right click on the cis565_path_tracer project and select properties. Then in the pop-up window, find Configuration Properties - Debugging - Command Arguments. Enter the starting .json scene configuration's relative file path here. An example would be ../scenes/chess.json. You can change the .json scene file input here to configure startup scene to be rendered.
Click the Build icon on top of the Visual Studio. You will see the rendering program opened soon.

Run Guide

Drag Left mouse to rotate. Right mouse to pan. Scroll to zoom. Adjust the Zoom Speed bar in ImGUI window to change zoom speed.
Enter a new file path in the ImGUI window to load a new scene without restarting the program.
Click Save Image button to save current render image. The image will be saved under build directory in .png format. Upon finishing all iterations the program will automatically exit and save the image as well.
Under OptiX Denoiser panel, the denoiser can be configured in real time. Change the Blend Factor bar to see how the denoised render compare to the original.
Refer to current .json scene configuration files under scenes to see how to create your own scene file. The environmen maps should be in .hdr format under envmaps folder. The GLTF models should be in .glb format under GLTF folder.

Core Features Completed

Diffuse Shader

The diffuse shader implements physically-based Lambertian reflection using cosine-weighted hemisphere sampling. The implementation in shadeDiffuse() generates random ray directions that follow the probability distribution of Lambert's cosine law, ensuring unbiased global illumination.

Key implementation details:

Uses the calculateRandomDirectionInHemisphere() function which generates rays with cosine-weighted distribution
The sampled direction is computed in local space and then transformed to world space using an orthonormal basis constructed from the surface normal
For pure diffuse surfaces with cosine-weighted sampling, the BRDF and PDF terms cancel out mathematically, simplifying the calculation to just multiplying by the material color

Material Sorting

Material sorting optimizes GPU performance by grouping rays that interact with the same material type, improving warp coherence and reducing divergence during shading calculations. The implementation uses thrust's efficient parallel sorting algorithms.

Implementation workflow:

After intersection testing, extract material IDs for each ray using the extractMaterialIds kernel
Create an index array to track original ray positions
Use thrust::sort_by_key() to sort rays by material ID in parallel
Reorder both PathSegment and ShadeableIntersection arrays based on sorted indices using the reorderByMaterial kernel
Swap pointers to use sorted data for shading stage

This feature is toggled via the MATERIAL_SORTING preprocessor flag in pathtrace.h for easy performance comparison.

Scene with 18 materials With Material Sorting (266ms frametime)

Scene with 18 materials Without Material Sorting (280ms frametime)

Stream compacted ray termination

Stream compaction efficiently removes terminated rays from the active ray pool, significantly reducing unnecessary computation in later bounces. The implementation uses thrust's parallel algorithms.

The termination and compaction pipeline:

After shading, rays that hit nothing or have exhausted their bounces are marked with remainingBounces = 0
The gatherTerminatedPaths kernel accumulates color contributions from terminated paths
thrust::remove_if() with the is_terminated() functor compacts the ray array in parallel
The compacted array size determines the number of active rays for the next bounce
Russian Roulette termination (when enabled) provides additional probabilistic termination based on throughput

The efficiency gain is most pronounced after several bounces when many rays have terminated naturally or hit light sources.

Trace Depth 12 With Stream Compaction (44ms frametime)

Trace Depth 12 Without Stream Compaction (106ms frametime)

Stochastic sampled anti-aliasing

Implementation in generateRayFromCamera():

Each pixel is subdivided into a 2×2 grid (configurable via GRID_SIZE)
Over multiple iterations, the path tracer cycles through different cells in the grid
Within each cell, a random offset is applied using thrust's random number generator
The jittered position is used to generate the camera ray, with coordinates calculated as:
- pixelX = x + jitterX - 0.5 (centered around pixel center)
- pixelY = y + jitterY - 0.5
Ray direction is computed through the jittered pixel position for sub-pixel sampling

Without SSAA (GRID_SIZE = 1)	With SSAA (GRID_SIZE = 1024)
Without SSAA Zoomed	With SSAA Zoomed

Extended Features Implemented

Specular Shader

Implemented perfect specular (mirror) reflection using the reflection equation. The shadeSpecular() function calculates the reflected ray direction based on the incident ray and surface normal, creating realistic mirror surfaces that can reflect the entire scene including other objects and environment maps.

Refractive Shader

Full implementation of refractive materials for glass and transparent objects with physically accurate light bending. Features include:

Snell's law refraction with configurable index of refraction (IOR)
Fresnel effects using Schlick's approximation for realistic reflectance at different angles
Total internal reflection handling for rays traveling from dense to less dense media
Proper handling of rays entering and exiting refractive objects

The shadeRefractive() function determines whether to reflect or refract based on Fresnel equations, creating realistic glass and water effects.

Cornell Box with Refractive, Specular, and Diffuse objects

PBR Shader

Comprehensive Physically Based Rendering implementation using the metallic-roughness workflow. The shadePBR() function implements:

Cook-Torrance BRDF with GGX/Trowbridge-Reitz distribution
Smith's geometry function for masking and shadowing
Fresnel term using Schlick's approximation
Support for metallic (0-1) and roughness (0-1) parameters
Transparency support with proper alpha blending
Energy conservation between diffuse and specular components

Materials can smoothly transition from dielectric to metallic and from rough to smooth surfaces.

PBR Example with different materials

MIS for Diffuse and PBR Shader

Multiple Importance Sampling implementation that combines three sampling strategies:

Light Sampling: Direct sampling of area lights
BRDF Sampling: Importance sampling based on material properties
Environment Map Sampling: Sampling bright regions of HDR environment maps

The implementation uses power heuristics to optimally weight contributions from different sampling strategies, significantly reducing variance and improving convergence speed. Both shadeDiffuseMIS() and the shadePBR() utilize MIS for direct lighting calculations.

shadePBR With MIS

shadePBR Without MIS

Subsurface Scattering for PBR Shader

Implemented diffusion-based subsurface scattering for realistic rendering of translucent materials like jade, milk, wax, and skin. Features include:

Configurable scattering radius and color per RGB channel
Anisotropy control for directional scattering
Distance-based attenuation using diffusion profiles
Integration with PBR materials for combined surface and volume effects

The implementation simulates light penetrating the surface, scattering within the material, and exiting at different points, creating soft, translucent appearance.

Subsurface Scattering Off

Subsurface Scattering On

Russian Roulette ray termination

Probabilistic path termination that maintains unbiased results while improving performance. Implementation details:

Begins after configurable bounce depth (RR_START_BOUNCE = 3)
Survival probability based on path throughput (luminance)
Minimum and maximum survival probability bounds to prevent bias
Energy compensation by dividing surviving paths by survival probability

This significantly reduces computation for dim rays that contribute little to the final image.

Trace Depth 32 With Russian Roulette (56ms frametime)

Trace Depth 32 Without Russian Roulette (63ms frametime)

Environment Maps

Full HDR environment map support for image-based lighting:

HDR image loading with proper tone mapping
Spherical mapping from direction vectors to texture coordinates
Configurable intensity control
Importance sampling with precomputed CDFs for efficient sampling
Integration with MIS for balanced direct and indirect lighting

Interior environment with sunlight on the left

Exterior environment with sunlight on the right

GLTF Models with tinyGLTF

Comprehensive GLTF 2.0 model loading using the TinyGLTF library:

Support for both .gltf (JSON) and .glb (binary) formats
Triangle mesh extraction with automatic primitive assembly
Material loading including PBR metallic-roughness workflow
Texture loading for base color, normal, metallic-roughness maps
Proper UV coordinate mapping
Transformation matrix support for model positioning

Stanford Dragon GLTF Model, 134995 Triangles

BVH Data Structure

Bounding Volume Hierarchy implementation for efficient ray-triangle intersection:

SAH (Surface Area Heuristic) based construction for optimal tree quality
CPU-side tree building with GPU-friendly memory layout
Iterative GPU traversal using stack-based approach
Configurable maximum tree depth (BVH_MAX_TREE_DEPTH)
Dramatic performance improvement: 100x+ speedup for million+ triangle scenes

1.5M Triangles Model with BVH (271ms frametime)

1.5M Triangles Model without BVH (33494ms frametime)

Nvidia OptiX Denoiser

Integration with OptiX 9.0 AI denoiser for real-time noise reduction:

Beauty buffer denoising with optional guide layers
Normal buffer guide for edge preservation
Albedo buffer guide for texture detail preservation
Configurable blend factor for artistic control
Real-time parameter adjustment through ImGui
Automatic denoising at configurable intervals

The denoiser dramatically reduces required sample count, enabling preview-quality images in seconds rather than minutes.

50 Iterations with Denoiser

50 Iterations without Denoiser

ImGUI and controls improvements

Enhanced user interface with comprehensive debugging and control features:

Real-time Statistics: FPS, rays/second, iteration count, active ray monitoring
Camera Controls: Interactive orbit, pan, zoom with configurable speed
Scene Management: Hot-reload scene files without restarting
Denoiser Panel: Live denoising parameter control
Performance Monitoring: Per-kernel timing display
Image Export: One-click PNG save functionality

Third-party Libraries Used

tinyGLTF

Purpose: Loading GLTF 2.0 3D models and associated assets
License: MIT License
Integration: Header-only library included in external/include/tiny_gltf.h
Usage: Parses GLTF/GLB files to extract meshes, materials, textures, and transformations. Provides comprehensive support for the PBR metallic-roughness workflow standard in GLTF 2.0.

Nvidia OptiX

Purpose: AI-accelerated denoising for rendered images
License: NVIDIA Software License Agreement
Integration: SDK headers and libraries linked via CMake
Requirements: NVIDIA GPU with RT cores (RTX 20-series or newer) and appropriate drivers
Usage: The OptiX AI denoiser is used to dramatically reduce noise in path traced images, enabling preview-quality results with minimal samples. Integrated through optixDenoiser.cpp/h.

Performance Analysis

Stream Compaction

Analysis: Stream compaction demonstrates significant performance improvements that scale with trace depth. The data shows frame time measurements (in milliseconds) comparing performance with and without stream compaction across various trace depths.

Performance measurements:

Depth 4: 26ms with SC vs 29ms without (10% improvement)
Depth 8: 32ms with SC vs 42ms without (24% improvement)
Depth 12: 34ms with SC vs 57ms without (40% improvement)
Depth 16: 34ms with SC vs 68ms without (50% improvement)
Depth 24: 37ms with SC vs 94ms without (61% improvement)
Depth 32: 39ms with SC vs 118ms without (67% improvement)

Key observations:

Stream compaction becomes increasingly effective at higher bounce depths
Performance improvement scales from 10% at shallow depths to 67% at depth 32
Frame time with stream compaction plateaus around 34-39ms regardless of depth
Without stream compaction, frame time increases linearly with depth

The data confirms that stream compaction is essential for production-quality renders with high bounce counts, preventing the linear performance degradation that would otherwise occur.

Material Sort

Analysis: Material sorting shows mixed results depending on scene complexity and material diversity. The data reveals that material sorting can actually decrease performance in some cases due to sorting overhead.

Performance measurements by scene:

Duck (3 materials, 4K triangles): 17ms with sorting vs 15ms without (-11.8% slower)
Dragon (6 materials, 134K triangles): 42ms with sorting vs 41ms without (-2.4% slower)
Halo (13 materials, 42K triangles): 29ms with sorting vs 28ms without (-3.4% slower)
Porsche (16 materials, 241K triangles): 26ms with sorting vs 22ms without (-15.4% slower)
Challenger (23 materials, 196K triangles): 27ms with sorting vs 23ms without (-14.8% slower)
Chess (18 materials, 1.5M triangles): 267ms with sorting vs 278ms without (4.1% faster)

Key insights:

Material sorting only benefits extremely complex scenes (1M+ triangles)
For most scenes, the overhead of sorting outweighs coherence benefits
Scenes with 15+ materials see worse performance due to sorting complexity
The Chess scene (1.5M triangles) is the only one showing improvement

This suggests material sorting should be selectively enabled only for scenes with very high geometric complexity where memory coherence benefits overcome sorting overhead.

Russian Roulette

Analysis: Russian Roulette termination effectively reduces computation with minimal quality impact. The data compares frame times with RR disabled versus different start depths (measured as fractions of total trace depth).

Performance measurements (in milliseconds):

Depth 8: 38ms (RR off) → 32ms (RR at depth 4) → 30ms (RR at depth 2)
Depth 12: 42ms (RR off) → 36ms (RR at depth 6) → 32ms (RR at depth 3)
Depth 16: 45ms (RR off) → 40ms (RR at depth 8) → 35ms (RR at depth 4)
Depth 24: 48ms (RR off) → 45ms (RR at depth 12) → 39ms (RR at depth 6)
Depth 32: 50ms (RR off) → 48ms (RR at depth 16) → 44ms (RR at depth 8)

Performance improvements:

RR at 1/2 depth: 6-16% improvement
RR at 1/4 depth: 12-24% improvement
Earlier RR activation yields better performance but may impact quality

The data shows that starting Russian Roulette at 1/4 of the total trace depth provides optimal balance between performance (19-24% improvement) and visual quality.

BVH

Analysis: BVH acceleration provides exponential performance improvements for complex geometry. The data shows frame times (in milliseconds) for various models with different triangle counts.

Performance measurements:

Duck (4K triangles): 17ms with BVH vs 70ms without (4.1x speedup)
Halo (42K triangles): 30ms with BVH vs 765ms without (25.5x speedup)
Dragon (134K triangles): 42ms with BVH vs 2,846ms without (67.8x speedup)
Challenger (196K triangles): 27ms with BVH vs 3,023ms without (112x speedup)
Porsche (241K triangles): 25ms with BVH vs 3,604ms without (144x speedup)
Chess (1.5M triangles): 270ms with BVH vs 43,343ms without (160x speedup)

Key observations:

Speedup scales dramatically with triangle count
Sub-100K triangles: 4-68x speedup
100K-250K triangles: 112-144x speedup
1M+ triangles: 160x speedup
BVH enables real-time preview for models that would otherwise take minutes per frame

The Chess scene exemplifies the dramatic impact: reducing frame time from 43.3 seconds to 270ms, transforming an unusable 0.023 FPS to a workable 3.7 FPS.

WIP Renders

Subsurface Scattering

OptiX Denoiser

See how OptiX Denoiser gives a high quality result with only 382 iterations.

High Poly Chessboard

A 1.49 million triangles gltf model. With BVH this renders with around 80ms frametime.

Previosuly without BVH, it renders with 18000ms frametime. This render has only 1771 iterations but took 9 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.github		.github
cmake		cmake
envmaps		envmaps
external		external
gltf		gltf
img		img
scenes		scenes
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
Project3-CUDA-Path-Tracer.launch		Project3-CUDA-Path-Tracer.launch
README.md		README.md
perf.xlsx		perf.xlsx

Folders and files

Latest commit

History

Repository files navigation

CUDA Path Tracer

Project Overview

Feature Highlights

Galleries

Build and Run Instructions (Windows)

Build Instructions

Run Guide

Core Features Completed

Diffuse Shader

Material Sorting

Stream compacted ray termination

Stochastic sampled anti-aliasing

Extended Features Implemented

Specular Shader

Refractive Shader

Cornell Box with Refractive, Specular, and Diffuse objects

PBR Shader

PBR Example with different materials

MIS for Diffuse and PBR Shader

Subsurface Scattering for PBR Shader

Russian Roulette ray termination

Environment Maps

GLTF Models with tinyGLTF

Stanford Dragon GLTF Model, 134995 Triangles

BVH Data Structure

Nvidia OptiX Denoiser

ImGUI and controls improvements

Third-party Libraries Used

tinyGLTF

Nvidia OptiX

Performance Analysis

Stream Compaction

Material Sort

Russian Roulette

BVH

WIP Renders

Subsurface Scattering

OptiX Denoiser

High Poly Chessboard

Stanford Dragon

GLTF Mesh Model with textures

GLTF Mesh Model without textures

PBR Materials

Cornell Box with MIS and Environment Map

Environment Map

Cornell Box of Diffuse, Specular, and Refractive objects

Specular objects

Bloopers

The Evil Dragon

MIS Fireflies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages