- Yiding Tian
- Tested on: Windows 11 24H2, i9-13900H @ 4.1GHz, 32GB RAM, MSI Shadow RTX 5080 16GB Driver 581.15, Personal Laptop with External Desktop GPU via NVMe connector (PCIe 4.0 x4 Protocol)
- Base code provided by University of Pennsylvania, CIS 5650: GPU Programming and Architecture
A CUDA-based path tracer capable of rendering globally-illuminated images for various custom scenes.
Model from Khronos glTF Sample Models
- Diffuse, Specular, Refractive, and PBR (Physically Based Rendering) shaders
- MIS (Multiple Importance Sampling) on diffuse and PBR materials
- Subsurface Scattering for PBR materials
- Custom environment maps and GLTF models loading with materials/textures/metallic etc.
- BVH (Bounding Volume Hierachy) data structure that enables rendering complex GLTF models with millions of polygons at a reasonable speed
- Material sorting, stream compaction, and Russian Roulette ray termination to boost performance
- Stochastic sampled anti-aliasing to produce sharper renders
- Nvidia OptiX Denoiser integration, configurable in real time to enable quick preview and enhance the end result
- Enhanced ImGUI user interface with detailed real-time statistics monitoring and scenes/camera/denoiser controls
Model from Khronos glTF Sample Models
Model from Lionsharp Studios @ Sketchfab
Model from McCarthy3D @ Sketchfab
- Make sure to have CUDA, CMake 4.x, and Visual Studio 2022 installed on your PC with a modern Nvidia GPU (20-Series or later)
- Clone the repo. Open a terminal in the repo's root directory. Run the following commands:
mkdir build
cd build
cmake ..
- This should create a
cis565_path_tracer.slnfile inside thebuildfolder. Double click to open it in Visual Studio 2022. - In Visual Studio 2022's top menu bar, change the build mode from
DebugtoRelease. This impacts the rendering performance a lot! Leaving inDebugmode would result in extremely slow rendering speed. - On the
Solution Explorermenu to the left of the Visual Studio, right click on thecis565_path_tracerproject and selectproperties. Then in the pop-up window, findConfiguration Properties-Debugging-Command Arguments. Enter the starting.jsonscene configuration's relative file path here. An example would be../scenes/chess.json. You can change the.jsonscene file input here to configure startup scene to be rendered. - Click the Build icon on top of the Visual Studio. You will see the rendering program opened soon.
- Drag Left mouse to rotate. Right mouse to pan. Scroll to zoom. Adjust the
Zoom Speedbar in ImGUI window to change zoom speed. - Enter a new file path in the ImGUI window to load a new scene without restarting the program.
- Click
Save Imagebutton to save current render image. The image will be saved underbuilddirectory in.pngformat. Upon finishing all iterations the program will automatically exit and save the image as well. - Under
OptiX Denoiserpanel, the denoiser can be configured in real time. Change theBlend Factorbar to see how the denoised render compare to the original. - Refer to current
.jsonscene configuration files underscenesto see how to create your own scene file. The environmen maps should be in.hdrformat underenvmapsfolder. The GLTF models should be in.glbformat underGLTFfolder.
The diffuse shader implements physically-based Lambertian reflection using cosine-weighted hemisphere sampling. The implementation in shadeDiffuse() generates random ray directions that follow the probability distribution of Lambert's cosine law, ensuring unbiased global illumination.
Key implementation details:
- Uses the
calculateRandomDirectionInHemisphere()function which generates rays with cosine-weighted distribution - The sampled direction is computed in local space and then transformed to world space using an orthonormal basis constructed from the surface normal
- For pure diffuse surfaces with cosine-weighted sampling, the BRDF and PDF terms cancel out mathematically, simplifying the calculation to just multiplying by the material color

Material sorting optimizes GPU performance by grouping rays that interact with the same material type, improving warp coherence and reducing divergence during shading calculations. The implementation uses thrust's efficient parallel sorting algorithms.
Implementation workflow:
- After intersection testing, extract material IDs for each ray using the
extractMaterialIdskernel - Create an index array to track original ray positions
- Use
thrust::sort_by_key()to sort rays by material ID in parallel - Reorder both
PathSegmentandShadeableIntersectionarrays based on sorted indices using thereorderByMaterialkernel - Swap pointers to use sorted data for shading stage
This feature is toggled via the MATERIAL_SORTING preprocessor flag in pathtrace.h for easy performance comparison.
Scene with 18 materials With Material Sorting (266ms frametime) |
Scene with 18 materials Without Material Sorting (280ms frametime) |
Stream compaction efficiently removes terminated rays from the active ray pool, significantly reducing unnecessary computation in later bounces. The implementation uses thrust's parallel algorithms.
The termination and compaction pipeline:
- After shading, rays that hit nothing or have exhausted their bounces are marked with
remainingBounces = 0 - The
gatherTerminatedPathskernel accumulates color contributions from terminated paths thrust::remove_if()with theis_terminated()functor compacts the ray array in parallel- The compacted array size determines the number of active rays for the next bounce
- Russian Roulette termination (when enabled) provides additional probabilistic termination based on throughput
The efficiency gain is most pronounced after several bounces when many rays have terminated naturally or hit light sources.
Trace Depth 12 With Stream Compaction (44ms frametime) |
Trace Depth 12 Without Stream Compaction (106ms frametime) |
Implementation in generateRayFromCamera():
- Each pixel is subdivided into a 2×2 grid (configurable via
GRID_SIZE) - Over multiple iterations, the path tracer cycles through different cells in the grid
- Within each cell, a random offset is applied using thrust's random number generator
- The jittered position is used to generate the camera ray, with coordinates calculated as:
pixelX = x + jitterX - 0.5(centered around pixel center)pixelY = y + jitterY - 0.5
- Ray direction is computed through the jittered pixel position for sub-pixel sampling
Without SSAA (GRID_SIZE = 1) |
With SSAA (GRID_SIZE = 1024) |
Without SSAA Zoomed |
With SSAA Zoomed |
Implemented perfect specular (mirror) reflection using the reflection equation. The shadeSpecular() function calculates the reflected ray direction based on the incident ray and surface normal, creating realistic mirror surfaces that can reflect the entire scene including other objects and environment maps.
Full implementation of refractive materials for glass and transparent objects with physically accurate light bending. Features include:
- Snell's law refraction with configurable index of refraction (IOR)
- Fresnel effects using Schlick's approximation for realistic reflectance at different angles
- Total internal reflection handling for rays traveling from dense to less dense media
- Proper handling of rays entering and exiting refractive objects
The shadeRefractive() function determines whether to reflect or refract based on Fresnel equations, creating realistic glass and water effects.
Comprehensive Physically Based Rendering implementation using the metallic-roughness workflow. The shadePBR() function implements:
- Cook-Torrance BRDF with GGX/Trowbridge-Reitz distribution
- Smith's geometry function for masking and shadowing
- Fresnel term using Schlick's approximation
- Support for metallic (0-1) and roughness (0-1) parameters
- Transparency support with proper alpha blending
- Energy conservation between diffuse and specular components
Materials can smoothly transition from dielectric to metallic and from rough to smooth surfaces.
Multiple Importance Sampling implementation that combines three sampling strategies:
- Light Sampling: Direct sampling of area lights
- BRDF Sampling: Importance sampling based on material properties
- Environment Map Sampling: Sampling bright regions of HDR environment maps
The implementation uses power heuristics to optimally weight contributions from different sampling strategies, significantly reducing variance and improving convergence speed. Both shadeDiffuseMIS() and the shadePBR() utilize MIS for direct lighting calculations.
|
shadePBR With MIS |
shadePBR Without MIS |
Implemented diffusion-based subsurface scattering for realistic rendering of translucent materials like jade, milk, wax, and skin. Features include:
- Configurable scattering radius and color per RGB channel
- Anisotropy control for directional scattering
- Distance-based attenuation using diffusion profiles
- Integration with PBR materials for combined surface and volume effects
The implementation simulates light penetrating the surface, scattering within the material, and exiting at different points, creating soft, translucent appearance.
Subsurface Scattering Off |
Subsurface Scattering On |
Probabilistic path termination that maintains unbiased results while improving performance. Implementation details:
- Begins after configurable bounce depth (
RR_START_BOUNCE = 3) - Survival probability based on path throughput (luminance)
- Minimum and maximum survival probability bounds to prevent bias
- Energy compensation by dividing surviving paths by survival probability
This significantly reduces computation for dim rays that contribute little to the final image.
Trace Depth 32 With Russian Roulette (56ms frametime) |
Trace Depth 32 Without Russian Roulette (63ms frametime) |
Full HDR environment map support for image-based lighting:
- HDR image loading with proper tone mapping
- Spherical mapping from direction vectors to texture coordinates
- Configurable intensity control
- Importance sampling with precomputed CDFs for efficient sampling
- Integration with MIS for balanced direct and indirect lighting
Interior environment with sunlight on the left |
Exterior environment with sunlight on the right |
Comprehensive GLTF 2.0 model loading using the TinyGLTF library:
- Support for both
.gltf(JSON) and.glb(binary) formats - Triangle mesh extraction with automatic primitive assembly
- Material loading including PBR metallic-roughness workflow
- Texture loading for base color, normal, metallic-roughness maps
- Proper UV coordinate mapping
- Transformation matrix support for model positioning
Bounding Volume Hierarchy implementation for efficient ray-triangle intersection:
- SAH (Surface Area Heuristic) based construction for optimal tree quality
- CPU-side tree building with GPU-friendly memory layout
- Iterative GPU traversal using stack-based approach
- Configurable maximum tree depth (
BVH_MAX_TREE_DEPTH) - Dramatic performance improvement: 100x+ speedup for million+ triangle scenes
1.5M Triangles Model with BVH (271ms frametime) |
1.5M Triangles Model without BVH (33494ms frametime) |
Integration with OptiX 9.0 AI denoiser for real-time noise reduction:
- Beauty buffer denoising with optional guide layers
- Normal buffer guide for edge preservation
- Albedo buffer guide for texture detail preservation
- Configurable blend factor for artistic control
- Real-time parameter adjustment through ImGui
- Automatic denoising at configurable intervals
The denoiser dramatically reduces required sample count, enabling preview-quality images in seconds rather than minutes.
50 Iterations with Denoiser |
50 Iterations without Denoiser |
Enhanced user interface with comprehensive debugging and control features:
- Real-time Statistics: FPS, rays/second, iteration count, active ray monitoring
- Camera Controls: Interactive orbit, pan, zoom with configurable speed
- Scene Management: Hot-reload scene files without restarting
- Denoiser Panel: Live denoising parameter control
- Performance Monitoring: Per-kernel timing display
- Image Export: One-click PNG save functionality
Purpose: Loading GLTF 2.0 3D models and associated assets
License: MIT License
Integration: Header-only library included in external/include/tiny_gltf.h
Usage: Parses GLTF/GLB files to extract meshes, materials, textures, and transformations. Provides comprehensive support for the PBR metallic-roughness workflow standard in GLTF 2.0.
Purpose: AI-accelerated denoising for rendered images
License: NVIDIA Software License Agreement
Integration: SDK headers and libraries linked via CMake
Requirements: NVIDIA GPU with RT cores (RTX 20-series or newer) and appropriate drivers
Usage: The OptiX AI denoiser is used to dramatically reduce noise in path traced images, enabling preview-quality results with minimal samples. Integrated through optixDenoiser.cpp/h.
Analysis: Stream compaction demonstrates significant performance improvements that scale with trace depth. The data shows frame time measurements (in milliseconds) comparing performance with and without stream compaction across various trace depths.
Performance measurements:
- Depth 4: 26ms with SC vs 29ms without (10% improvement)
- Depth 8: 32ms with SC vs 42ms without (24% improvement)
- Depth 12: 34ms with SC vs 57ms without (40% improvement)
- Depth 16: 34ms with SC vs 68ms without (50% improvement)
- Depth 24: 37ms with SC vs 94ms without (61% improvement)
- Depth 32: 39ms with SC vs 118ms without (67% improvement)
Key observations:
- Stream compaction becomes increasingly effective at higher bounce depths
- Performance improvement scales from 10% at shallow depths to 67% at depth 32
- Frame time with stream compaction plateaus around 34-39ms regardless of depth
- Without stream compaction, frame time increases linearly with depth
The data confirms that stream compaction is essential for production-quality renders with high bounce counts, preventing the linear performance degradation that would otherwise occur.
Analysis: Material sorting shows mixed results depending on scene complexity and material diversity. The data reveals that material sorting can actually decrease performance in some cases due to sorting overhead.
Performance measurements by scene:
- Duck (3 materials, 4K triangles): 17ms with sorting vs 15ms without (-11.8% slower)
- Dragon (6 materials, 134K triangles): 42ms with sorting vs 41ms without (-2.4% slower)
- Halo (13 materials, 42K triangles): 29ms with sorting vs 28ms without (-3.4% slower)
- Porsche (16 materials, 241K triangles): 26ms with sorting vs 22ms without (-15.4% slower)
- Challenger (23 materials, 196K triangles): 27ms with sorting vs 23ms without (-14.8% slower)
- Chess (18 materials, 1.5M triangles): 267ms with sorting vs 278ms without (4.1% faster)
Key insights:
- Material sorting only benefits extremely complex scenes (1M+ triangles)
- For most scenes, the overhead of sorting outweighs coherence benefits
- Scenes with 15+ materials see worse performance due to sorting complexity
- The Chess scene (1.5M triangles) is the only one showing improvement
This suggests material sorting should be selectively enabled only for scenes with very high geometric complexity where memory coherence benefits overcome sorting overhead.
Analysis: Russian Roulette termination effectively reduces computation with minimal quality impact. The data compares frame times with RR disabled versus different start depths (measured as fractions of total trace depth).
Performance measurements (in milliseconds):
- Depth 8: 38ms (RR off) → 32ms (RR at depth 4) → 30ms (RR at depth 2)
- Depth 12: 42ms (RR off) → 36ms (RR at depth 6) → 32ms (RR at depth 3)
- Depth 16: 45ms (RR off) → 40ms (RR at depth 8) → 35ms (RR at depth 4)
- Depth 24: 48ms (RR off) → 45ms (RR at depth 12) → 39ms (RR at depth 6)
- Depth 32: 50ms (RR off) → 48ms (RR at depth 16) → 44ms (RR at depth 8)
Performance improvements:
- RR at 1/2 depth: 6-16% improvement
- RR at 1/4 depth: 12-24% improvement
- Earlier RR activation yields better performance but may impact quality
The data shows that starting Russian Roulette at 1/4 of the total trace depth provides optimal balance between performance (19-24% improvement) and visual quality.
Analysis: BVH acceleration provides exponential performance improvements for complex geometry. The data shows frame times (in milliseconds) for various models with different triangle counts.
Performance measurements:
- Duck (4K triangles): 17ms with BVH vs 70ms without (4.1x speedup)
- Halo (42K triangles): 30ms with BVH vs 765ms without (25.5x speedup)
- Dragon (134K triangles): 42ms with BVH vs 2,846ms without (67.8x speedup)
- Challenger (196K triangles): 27ms with BVH vs 3,023ms without (112x speedup)
- Porsche (241K triangles): 25ms with BVH vs 3,604ms without (144x speedup)
- Chess (1.5M triangles): 270ms with BVH vs 43,343ms without (160x speedup)
Key observations:
- Speedup scales dramatically with triangle count
- Sub-100K triangles: 4-68x speedup
- 100K-250K triangles: 112-144x speedup
- 1M+ triangles: 160x speedup
- BVH enables real-time preview for models that would otherwise take minutes per frame
The Chess scene exemplifies the dramatic impact: reducing frame time from 43.3 seconds to 270ms, transforming an unusable 0.023 FPS to a workable 3.7 FPS.
See how OptiX Denoiser gives a high quality result with only 382 iterations.

A 1.49 million triangles gltf model. With BVH this renders with around 80ms frametime.

Previosuly without BVH, it renders with 18000ms frametime. This render has only 1771 iterations but took 9 hours.






































