Skip to content

eliasimow/Project1-CUDA-Flocking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Untitled video - Made with Clipchamp (4)

Boids CUDA Implementation

  • Eli Asimow
  • LinkedIn, personal website
  • Tested on: Windows 11, AMD Ryzen 7 7435HS @ 2.08GHz 16GB, Nvidia GeForce RTX 4060 GPU

Overview

This project is a complete implementation of the boids algorithm utilizing Cuda for performance optimization. Three methods for that Cuda utilization, Naive, Scattered and Coherent Grid, are included. They operate at different performances levels in different contexts, but with Coherent Grid the simulation can manage well over 1,000,000 boids in a real time environment (1,000 FPS). Additional rules are included for user enjoyement, including a revolving planetary effect and bouncing walls.

Performance

Naive, Scattered and Coherent Performance (FPS) (1)

As the boid count increases, performance deteriorates across all three implementations. However, the rate of change varies. The ultimate collection of optimizations of the coherent grid implementation slowly show their worth as the boid count grows. By 50,000, coherent was the clear best performer. By 1,000,000, coherent was the only viable real time algorithm left.

It’s important to note that both scattered and naive had moments of superiority as well. After metrics analysis, we can conclude that the new sort algorithms and memory overhead are the primary factors here. At exceedingly small boid counts, running additional kernels to sort our boid indices has a greater performance cost than the performance saved by scattered and coherent. So, at 1,000 boids, naive can actually be the best option. Likewise, at 10,000, although the grid cell sort of coherent and scattered make them superior to naive, coherent’s memory adjacency is still relatively less useful than the sort algorithms and additional buffer that scattered can skip.

Naive, Scattered and Coherent Performance (FPS) Naive, Scattered and Coherent Block Size Performance (FPS)

I was surprised to see negligible effects on performance when modifying block size. Perhaps the block size ultimately doesn’t matter outside of a hard ceiling and floor. Because of the lack of general I/O thread freezes or global memory overhead, most threads will be executing continually throughout the kernel. Because of this, perhaps it's irrelevant which block they're a part of.

More expected is the difference between coherent and scattered grid performance. It makes sense that, as the number of boids in neighboring cells increases, the importance of memory adjacency for the speed of accessing them increases accordingly.

Lastly, my favorite optimization was realizing that 27 neighboring grid cells actually outperforms 8 neighbor cells! Grid cell size causes this counterintuitive result. In order to have only 8 cells checked, we must have a grid cell length of 2 * the maximum distance threshold. This results in a total neighbor check side length of 2 * 2d = 4d. With 27 neighboring cells, however, we can have each grid cell sized at the smaller 1 * maximum distance threshold. This results in a side length of just 3 * 1d = 3d, smaller than in the 8 cells implementation. Because 27’s radius check is ultimately smaller, that means more boids are filtered out of this step and the kernel can compute the velocity step relatively quicker.

Bonus Work!

Untitled video - Made with Clipchamp

I really fell in love with adding new rules to the system and experimented quite a bit over the week. The above gif is my favorite, and was accomplished with three changes:

  1. Walls which push the particles away from the boundary
  2. A universal gravity effect that pulls the particles down
  3. A sinusoidal time gravity affected, modulated by particle distance, drawing particles towards the grid center and then dropping them down.

I'll leave you with some other results I found while messing around. Thank you!

Untitled video - Made with Clipchamp (7) Untitled video - Made with Clipchamp (6) Untitled video - Made with Clipchamp (5)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Cuda 45.2%
  • C++ 39.3%
  • CMake 13.3%
  • Makefile 1.1%
  • GLSL 1.1%