- Eli Asimow
- LinkedIn, personal website
- Tested on: Windows 11, AMD Ryzen 7 7435HS @ 2.08GHz 16GB, Nvidia GeForce RTX 4060 GPU
This project is a complete implementation of the boids algorithm utilizing Cuda for performance optimization. Three methods for that Cuda utilization, Naive, Scattered and Coherent Grid, are included. They operate at different performances levels in different contexts, but with Coherent Grid the simulation can manage well over 1,000,000 boids in a real time environment (1,000 FPS). Additional rules are included for user enjoyement, including a revolving planetary effect and bouncing walls.
As the boid count increases, performance deteriorates across all three implementations. However, the rate of change varies. The ultimate collection of optimizations of the coherent grid implementation slowly show their worth as the boid count grows. By 50,000, coherent was the clear best performer. By 1,000,000, coherent was the only viable real time algorithm left.
It’s important to note that both scattered and naive had moments of superiority as well. After metrics analysis, we can conclude that the new sort algorithms and memory overhead are the primary factors here. At exceedingly small boid counts, running additional kernels to sort our boid indices has a greater performance cost than the performance saved by scattered and coherent. So, at 1,000 boids, naive can actually be the best option. Likewise, at 10,000, although the grid cell sort of coherent and scattered make them superior to naive, coherent’s memory adjacency is still relatively less useful than the sort algorithms and additional buffer that scattered can skip.
I was surprised to see negligible effects on performance when modifying block size. Perhaps the block size ultimately doesn’t matter outside of a hard ceiling and floor. Because of the lack of general I/O thread freezes or global memory overhead, most threads will be executing continually throughout the kernel. Because of this, perhaps it's irrelevant which block they're a part of.
More expected is the difference between coherent and scattered grid performance. It makes sense that, as the number of boids in neighboring cells increases, the importance of memory adjacency for the speed of accessing them increases accordingly.
Lastly, my favorite optimization was realizing that 27 neighboring grid cells actually outperforms 8 neighbor cells! Grid cell size causes this counterintuitive result. In order to have only 8 cells checked, we must have a grid cell length of 2 * the maximum distance threshold. This results in a total neighbor check side length of 2 * 2d = 4d. With 27 neighboring cells, however, we can have each grid cell sized at the smaller 1 * maximum distance threshold. This results in a side length of just 3 * 1d = 3d, smaller than in the 8 cells implementation. Because 27’s radius check is ultimately smaller, that means more boids are filtered out of this step and the kernel can compute the velocity step relatively quicker.
I really fell in love with adding new rules to the system and experimented quite a bit over the week. The above gif is my favorite, and was accomplished with three changes:
- Walls which push the particles away from the boundary
- A universal gravity effect that pulls the particles down
- A sinusoidal time gravity affected, modulated by particle distance, drawing particles towards the grid center and then dropping them down.
I'll leave you with some other results I found while messing around. Thank you!




