diff --git a/README.md b/README.md
index d63a6a1..5cf4a2c 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,53 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Alan Qiao
+* Tested on: Windows 11 22H2, i7-10750H @ 2.60GHz, 16GB, GTX 1650 (Personal)
 
-### (TODO: Your README)
+# Flocking Simulation
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/BoidsDemo.gif)
+
+This project showcases a basic flocking simulation based on Reynolds Boids algorithm. The algorithm is implemented in three ways: naive, scattered uniform grid, coherent uniform grid.
+
+### Naive Implementation
+This version implements the algorithm as a kernel function that checks every boid in the system for interaction with the selected boid. In the configuration of this simulation, the maximum distance at which an interaction occurs is small relative to the size of the universe. This results in most of those checks contributing nothing to the movement of the selected boid as they are too far away.
+
+### Scattered Uniform Grid
+This version splits the universe into a uniform grid of cubes. In each cycle, the kernel function only checks boids in the grid cells that intersect the sphere around the selected boid for which an interaction can occur. This effectively reduces the number of unnecessary checks. The boids are assigned to grid cells by pointers and the relative location of their position and velocity data remain unmoved in the respective buffers. Hence the grid is considered scattered in the sense that the data relevant to each grid cell is scattered all over the memory.
+
+### Coherent Uniform Grid
+The version is nearly identical to the Scattered Uniform Grid implementation except that the position and velocity buffers are sorted so that the boids in each grid cell appear consecutively in memory. This reduces the additional overhead of reading a pointer index from global memory in order to access the position and velocity buffers.  
+It turns out that the cost of a parallelized unstable sort, which occurs once per iteration, is much cheaper than the cost of accessing global memory one extra time of every pair of interacting boids to get its index for accessing data buffers. Considering that there are up to $n^2$ pairs of interacting boids, it is likely that even if Thrust sorts in global memory, the number of global memory accesses by a nearly linear sorting algorithm is lower.
+
+## Performance Analysis
+In this section, the two primary metrics used are Frames Per Second (FPS) of the visualization with vertical synchronization turned off, and the Runtime in Milliseconds per Simulation Cycle (ms/cycle).
+
+#### Frames Per Second (FPS)
+FPS is a relatively straightforward measure where a higher FPS means a better performance. Note however that FPS does include the time of generating the visualization, which may not be an accurate representation of the performance of the flock simulation itself.  
+
+#### Milliseconds per Simulation Cycle (ms/cycle)
+To isolate the simulation time, the simulation step was wrapped in a CudaEvent to record the exact duration of one iteration of the simulation in milliseconds. This result was averaged over 300 cycles to account for some variance between the simulation time of different ticks. A lower ms/cycle indicates better performance.
+
+### Changing the Number of Boids
+![](images/fps_num_boids.png)
+Note: All simulations used a Block Size of 128.
+
+It is clear that as the number of boids increase, the performance of all implementations decrease due to an increase in the number of computations required per simulation cycle. The naive implementation sees the fastest drop in FPS as $n^2$ pairs of boids, the most out of the three implemntations, are checked for interaction in each cycle.
+
+![](images/ms_num_boids.png)
+
+It is more evident from studying the ms/cycle that both Uniform Grid implementations sees a similar growth pattern in simulation time. This is expected as the number of checks computed in both implementations are the same. The Scattered Uniform Grid implementation consistently takes longer than the Coherent Uniform Grid implementation, which supports the theory that the performance enhancement comes from reducing the number of global memory reads.
+
+### Changing the Block Size
+![](images/fps_block_size.png)
+Note: All simulations used 5000 boids.
+
+![](images/ms_block_size.png)
+
+In general, block size should be chosen to effectively maximize the number of active warps so that all the SMs are in use and that enough warps are available to hide latencies. Given that GTX 1650 only have 14 SMs, the program needs to have 14 blocks to ensure that all the SMs are being used. With 5000 boids, this means the block size cannot exceed 357. This could explain the slight decrease in performance for a block size of 512 and 1024, although this decrease is rather insignificant.  
+On the other hand, choosing block size that is too small will also hurt performance. The GTX 1650 supports a max of 16 blocks per SM and max of 1024 threads per SM. This means that if the block size is smaller than 64, then there will be unused threads, leading to less warps available for filling in the latency gaps from memory reads. This effect is clearly seen with a noticeable performance improvement from blocksize of 16 up to 64, and then tapering at 64.
+
+### Changing the Cell Width
+For the Uniform Grid implementations, one last experiment was made with the size of the cells. The two settings considered were the interaction distance and double the interaction distance. In the former case, the boids in at most 27 cells would need to be checked, while the latter would check at most 8 cells. However, checking less cells is not necessarily better as not all boids in those cells actually interact with the selected boid. The total volume covered by the 8 cells, $(4 \times maxDist)^3$, is greater than the total volume covered by the 27 cells, $(3 \times maxDist)^3$, which means there is a higher chance that the 8 cells would contain more boids, and hence potentially more boids that do not interact with the selected boid.  
+Indeed, this is what was observed. In two simulations with 5000 boids and a block size of 128 using the Coherent Uniform Grid implementation, setting the cell width to double the max interaction distance resulted in 687 FPS and 0.26 ms/cycle, whereas setting the cell width to the max interaction distance resulted in 700 FPS and 0.26 ms/cycle. The effective is small, however, as difference in number of boids for either cell width is relatively small.
\ No newline at end of file
diff --git a/images/BoidsDemo.gif b/images/BoidsDemo.gif
new file mode 100644
index 0000000..a88b20d
Binary files /dev/null and b/images/BoidsDemo.gif differ
diff --git a/images/fps_block_size.png b/images/fps_block_size.png
new file mode 100644
index 0000000..5978a41
Binary files /dev/null and b/images/fps_block_size.png differ
diff --git a/images/fps_num_boids.png b/images/fps_num_boids.png
new file mode 100644
index 0000000..d602da2
Binary files /dev/null and b/images/fps_num_boids.png differ
diff --git a/images/ms_block_size.png b/images/ms_block_size.png
new file mode 100644
index 0000000..3ff8f58
Binary files /dev/null and b/images/ms_block_size.png differ
diff --git a/images/ms_num_boids.png b/images/ms_num_boids.png
new file mode 100644
index 0000000..0ac2eea
Binary files /dev/null and b/images/ms_num_boids.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..af2182e 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,7 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_pos2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -157,7 +158,8 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  // gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +171,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -230,10 +250,41 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 pCenter(0.f, 0.f, 0.f);
+  int rule1Neighbors = 0;
+  glm::vec3 pRepell(0.f, 0.f, 0.f);
+  glm::vec3 pVel(0.f, 0.f, 0.f);
+  int rule3Neighbors = 0;
+  glm::vec3 deltaVel(0.f, 0.f, 0.f);
+  for (int i = 0; i < N; i++) {
+    if (i == iSelf)
+      continue;
+    float distance = glm::length(pos[iSelf] - pos[i]);
+    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+    if (distance < rule1Distance) {
+      pCenter += pos[i];
+      rule1Neighbors++;
+    }
+    // Rule 2: boids try to stay a distance d away from each other
+    if (distance < rule2Distance) {
+      pRepell -= pos[i] - pos[iSelf];
+    }
+    // Rule 3: boids try to match the speed of surrounding boids
+    if (distance < rule3Distance) {
+      pVel += vel[i];
+      rule3Neighbors++;
+    }
+  }
+  if (rule1Neighbors > 0) {
+    pCenter /= rule1Neighbors;
+    deltaVel += (pCenter - pos[iSelf]) * rule1Scale;
+  }
+  deltaVel += pRepell * rule2Scale;
+  if (rule3Neighbors > 0) {
+    pVel /= rule3Neighbors;
+    deltaVel += pVel * rule3Scale;
+  }
+  return deltaVel;
 }
 
 /**
@@ -242,9 +293,20 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
   // Compute a new velocity based on pos and vel1
+  glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
   // Clamp the speed
+  float speedRatio = glm::length(newVel) / maxSpeed;
+  if (speedRatio > 1) {
+    newVel /= speedRatio;
+  }
   // Record the new velocity into vel2. Question: why NOT vel1?
+  // Need to keep vel1 constant for calculating velocity change on other boids from prev snapshot
+  vel2[index] = newVel;
 }
 
 /**
@@ -289,6 +351,13 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  glm::vec3 relPos = (pos[index] - gridMin) * inverseCellWidth;
+  gridIndices[index] = gridIndex3Dto1D((int)(relPos.x), (int)(relPos.y), (int)(relPos.z), gridResolution);
+  indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +375,31 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  if (index == 0) {
+    gridCellStartIndices[particleGridIndices[index]] = 0;
+    gridCellEndIndices[particleGridIndices[index]] = 0;
+  }
+  else if (particleGridIndices[index] != particleGridIndices[index-1]) {
+    gridCellStartIndices[particleGridIndices[index]] = index;
+    gridCellEndIndices[particleGridIndices[index - 1]] = index - 1;
+  }
+}
+
+__global__ void kernReshuffleBoidBuffers(int N, const int* particleArrayIndices,
+  const glm::vec3* pos1, const glm::vec3* vel1, glm::vec3* pos2, glm::vec3* vel2) {
+  // Reshuffle Position and Velocity buffers to match particleArrayIndices order
+  // Return reshuffled in pos2 and vel2
+  // Ensures Pos and Vel are in order of gridCell
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  pos2[index] = pos1[particleArrayIndices[index]];
+  vel2[index] = vel1[particleArrayIndices[index]];
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +416,75 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  
+  glm::vec3 pCenter(0.f);
+  glm::vec3 pRepell(0.f);
+  glm::vec3 pVel(0.f);
+  int rule1Neighbors = 0;
+  int rule3Neighbors = 0;
+  glm::vec3 deltaVel(0.f);
+
+  float maxCheckDist = imax(imax(rule1Distance, rule2Distance), rule3Distance) * inverseCellWidth;
+  glm::vec3 cell3D = (pos[index] - gridMin) * inverseCellWidth;
+  // check grid cube that encompasses all of rule interaction sphere
+  int xMax = imin((int)(cell3D.x + maxCheckDist), gridResolution - 1);
+  int xMin = imax((int)(cell3D.x - maxCheckDist), 0);
+  int yMax = imin((int)(cell3D.y + maxCheckDist), gridResolution - 1);
+  int yMin = imax((int)(cell3D.y - maxCheckDist), 0);
+  int zMax = imin((int)(cell3D.z + maxCheckDist), gridResolution - 1);
+  int zMin = imax((int)(cell3D.z - maxCheckDist), 0);
+  for (int z = zMin; z <= zMax; z++) {
+    for (int y = yMin; y <= yMax; y++) {
+      for (int x = xMin; x <= xMax; x++) {
+        int curCell = gridIndex3Dto1D(x, y, z, gridResolution);
+        int startInd = gridCellStartIndices[curCell];
+        int endInd = gridCellEndIndices[curCell];
+        if (startInd < 0) {
+          // -1 startInd means no boids in cell
+          continue;
+        }
+        for (int i = startInd; i <= endInd; i++) {
+          int ind = particleArrayIndices[i];
+          float distance = glm::length(pos[index] - pos[ind]);
+          // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+          if (distance < rule1Distance) {
+            pCenter += pos[ind];
+            rule1Neighbors += 1;
+          }
+          // Rule 2: boids try to stay a distance d away from each other
+          if (distance < rule2Distance) {
+            pRepell -= pos[ind] - pos[index];
+          }
+          // Rule 3: boids try to match the speed of surrounding boids
+          if (distance < rule3Distance) {
+            pVel += vel1[ind];
+            rule3Neighbors += 1;
+          }
+        }
+      }
+    }
+  }
+  if (rule1Neighbors > 0) {
+    pCenter /= rule1Neighbors;
+    deltaVel += (pCenter - pos[index]) * rule1Scale;
+  }
+  deltaVel += pRepell * rule2Scale;
+  if (rule3Neighbors > 0) {
+    pVel /= rule3Neighbors;
+    deltaVel += pVel * rule3Scale;
+  }
+
+  glm::vec3 newVel = vel1[index] + deltaVel;
+  // Clamp the speed
+  float speedRatio = glm::length(newVel) / maxSpeed;
+  if (speedRatio > 1) {
+    newVel /= speedRatio;
+  }
+  vel2[index] = newVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +504,74 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
+  glm::vec3 pCenter(0.f);
+  glm::vec3 pRepell(0.f);
+  glm::vec3 pVel(0.f);
+  int rule1Neighbors = 0;
+  int rule3Neighbors = 0;
+  glm::vec3 deltaVel(0.f);
+
+  float maxCheckDist = imax(imax(rule1Distance, rule2Distance), rule3Distance) * inverseCellWidth;
+  glm::vec3 cell3D = (pos[index] - gridMin) * inverseCellWidth;
+  // check grid cube that encompasses all of rule interaction sphere
+  int xMax = imin((int)(cell3D.x + maxCheckDist), gridResolution - 1);
+  int xMin = imax((int)(cell3D.x - maxCheckDist), 0);
+  int yMax = imin((int)(cell3D.y + maxCheckDist), gridResolution - 1);
+  int yMin = imax((int)(cell3D.y - maxCheckDist), 0);
+  int zMax = imin((int)(cell3D.z + maxCheckDist), gridResolution - 1);
+  int zMin = imax((int)(cell3D.z - maxCheckDist), 0);
+  for (int z = zMin; z <= zMax; z++) {
+    for (int y = yMin; y <= yMax; y++) {
+      for (int x = xMin; x <= xMax; x++) {
+        int curCell = gridIndex3Dto1D(x, y, z, gridResolution);
+        int startInd = gridCellStartIndices[curCell];
+        int endInd = gridCellEndIndices[curCell];
+        if (startInd < 0) {
+          // -1 startInd means no boids in cell
+          continue;
+        }
+        for (int i = startInd; i <= endInd; i++) {
+          float distance = glm::length(pos[index] - pos[i]);
+          // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+          if (distance < rule1Distance) {
+            pCenter += pos[i];
+            rule1Neighbors += 1;
+          }
+          // Rule 2: boids try to stay a distance d away from each other
+          if (distance < rule2Distance) {
+            pRepell -= pos[i] - pos[index];
+          }
+          // Rule 3: boids try to match the speed of surrounding boids
+          if (distance < rule3Distance) {
+            pVel += vel1[i];
+            rule3Neighbors += 1;
+          }
+        }
+      }
+    }
+  }
+  if (rule1Neighbors > 0) {
+    pCenter /= rule1Neighbors;
+    deltaVel += (pCenter - pos[index]) * rule1Scale;
+  }
+  deltaVel += pRepell * rule2Scale;
+  if (rule3Neighbors > 0) {
+    pVel /= rule3Neighbors;
+    deltaVel += pVel * rule3Scale;
+  }
+
+  glm::vec3 newVel = vel1[index] + deltaVel;
+  // Clamp the speed
+  float speedRatio = glm::length(newVel) / maxSpeed;
+  if (speedRatio > 1) {
+    newVel /= speedRatio;
+  }
+  vel2[index] = newVel;
 }
 
 /**
@@ -349,6 +580,10 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +599,18 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, 
+    gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  dim3 fullCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, 
+    gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +629,19 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  dim3 fullCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  kernReshuffleBoidBuffers << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_pos2, dev_vel2);
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+    gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos2, dev_vel2, dev_vel1);
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos2, dev_vel1);
+  std::swap(dev_pos2, dev_pos);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +650,11 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..22b450e 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -183,7 +183,7 @@ void initShaders(GLuint * program) {
   //====================================
   // Main loop
   //====================================
-  void runCUDA() {
+  void runCUDA(cudaEvent_t& start, cudaEvent_t& stop, float* milliseconds) {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not
     // use this buffer
@@ -196,6 +196,7 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
     // execute the kernel
+    cudaEventRecord(start);
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
     #elif UNIFORM_GRID
@@ -203,6 +204,7 @@ void initShaders(GLuint * program) {
     #else
     Boids::stepSimulationNaive(DT);
     #endif
+    cudaEventRecord(stop);
 
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
@@ -210,12 +212,23 @@ void initShaders(GLuint * program) {
     // unmap buffer object
     cudaGLUnmapBufferObject(boidVBO_positions);
     cudaGLUnmapBufferObject(boidVBO_velocities);
+
+    cudaEventSynchronize(stop);
+    cudaEventElapsedTime(milliseconds, start, stop);
   }
 
   void mainLoop() {
     double fps = 0;
     double timebase = 0;
     int frame = 0;
+    cudaEvent_t start, stop;
+    int eventCount = 0;
+    float milliseconds = 0;
+    float totalMs = 0;
+    float avgMs = 0;
+
+    cudaEventCreate(&start);
+    cudaEventCreate(&stop);
 
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
@@ -232,13 +245,23 @@ void initShaders(GLuint * program) {
         frame = 0;
       }
 
-      runCUDA();
+      runCUDA(start, stop, &milliseconds);
+
+      totalMs += milliseconds;
+      eventCount++;
+      if (eventCount > 300) {
+        avgMs = totalMs / eventCount;
+        totalMs = 0;
+        eventCount = 0;
+      }
 
       std::ostringstream ss;
       ss << "[";
       ss.precision(1);
-      ss << std::fixed << fps;
-      ss << " fps] " << deviceName;
+      ss << std::fixed << fps << " fps] ";
+      ss << "[";
+      ss.precision(10);
+      ss << std::fixed << avgMs << " ms] " << deviceName;
       glfwSetWindowTitle(window, ss.str().c_str());
 
       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);