diff --git a/README.md b/README.md
index d63a6a1..e2eb23c 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,74 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Aditya Hota
+  * [LinkedIn](https://www.linkedin.com/in/aditya-hota)
+* Tested on: Windows 11, i7-8750H @ 2.20 GHz 20 GB, GTX 1050 Ti with Max-Q Design 6 GB (personal laptop)
 
-### (TODO: Your README)
+# Overview
+This project involved implementing several algorithms to simulate movements of Boid particles. We started off with a naive implementation, then made optimizations to improve parallelism and reduce the number of memory accesses.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+# Screenshots
+Video
+<br>
+<img src="images/RunningBoid.gif" width="750">
+
+Early in execution
+<br>
+<img src="images/RunningBoidEarly.png" width="750">
+
+Later on in execution
+<br>
+<img src="images/RunningBoidLate.png" width="750">
+
+# Functionality
+Boids move according to three rules. These were implemented across threads to allow for a level of parallelism across GPU cores.
+
+1) Boids try to fly towards the centre of mass of neighboring boids.
+2) Boids try to keep a small distance away from other objects (including other boids).
+3) Boids try to match velocity with near boids.
+
+The pseudocode for algorithms was provided in the instructions. See [here](INSTRUCTION.md).
+
+## 1.2 Naive implementation
+The first version of the implementation involved each boid naively checking all other boids to determine how the boid in question should change its velocity.
+
+## 2.1 Uniform grids
+Boids are only influenced by others near it, so we do not need to check all other N-1 boids every time step. Instead, we can find a way to check which boids are near the one in question.
+
+We do this by creating a uniform grid of cells, representing quantized parts of the overall space. Then, for each boid, we figure out which cell the boid is in. We can sort the array of cells along with the array of boids, then store the starting and ending positions of cells within the overall cell array. Boinds in a cell with have their position and velocity array indices stored between the starting and ending positions.
+
+Lastly, when finding the closest neighbors for each boid, we see which cells are near it--we do this by seeing if the boid is in the lower or upper half of a dimension, and checking the neighboring cell closest in that direction. Then, we use the uniform grid of cells to see which boids are in the cells. Only these boids are used to influce the current boid's velocity.
+
+## 2.3 Uniform grids with memory coherency
+In the [uniform grids](##2.1-Uniform-Grids) section, for each boid we stored the index of where we can find its position and velocity. Now, we make an optimization which allows us to use the same index within the cell array. This reduces the number of redirections and random accesses that we have to perform.
+
+# Performance Analysis
+FPS is used as the primary metric of performance analysis. Frame are uncapped (v-sync off), which means the simulation is running as fast as possible.
+
+I added code to track the average FPS over the complete execution of the program. Average FPS value was over was 15 seconds of execution.
+
+## Graphs
+### Increasing Boids (with visualization)
+As shown in the graph below, increasing the number of boids causes the FPS to decrease with all levels of optimizations. However, we see that the naive solution has an exponential decrease whereas both uniform grids decrease inearly. Since the memory coherency optimization has fewer memory accesses, its performance decreases at a slower rate.
+
+<img src="images/Graph_FPS_Boids_Vis.png" width="750">
+
+### Increasing Boids (without visualization)
+We see the same trends as above, but just with higher average frame rates since the boids do not need to be rendered.
+
+<img src="images/Graph_FPS_Boids_NoVis.png" width="750">
+
+### Increasing Block Size
+We see that there is no impact to the FPS when the block size is changed.
+
+<img src="images/Graph_FPS_BlockSize.png" width="750">
+
+## Questions
+1) Increasing the number of boids causes the FPS to decrease because there are more checks that need to be done and a larger number of kernels to run. Since there are more boids, the naive solution will have to check more boids and the optimized solutions are more likely to have more close neighbors per boid.
+
+2) Changing the block size and number of blocks does not change the FPS because we are only changing how the kernels are divided across the GPU SMs. Each SM can only run one warp (one operation in parallel) at a time, so no higher quantity of operations is being run at one time.
+
+3) There was a performance improvement with coherent uniform grid. This is because we are not redirecting through an additional matrix to get the position and velocity array indices for boids. That is one less warp to complete. Furthermore, the memory accesses to the positon and velocity are more contiguous so there may be optimizations such as data prefetching.
+
+4) Changing the number of neighbors to 27 causes the performance to actually improve, because we are checking smaller volumes. This makes our checking volume more quantized, so we may end up having fewer boids to check thatn before.
\ No newline at end of file
diff --git a/images/Graph_FPS_BlockSize.png b/images/Graph_FPS_BlockSize.png
new file mode 100644
index 0000000..f158c4e
Binary files /dev/null and b/images/Graph_FPS_BlockSize.png differ
diff --git a/images/Graph_FPS_Boids_NoVis.png b/images/Graph_FPS_Boids_NoVis.png
new file mode 100644
index 0000000..d2ee843
Binary files /dev/null and b/images/Graph_FPS_Boids_NoVis.png differ
diff --git a/images/Graph_FPS_Boids_Vis.png b/images/Graph_FPS_Boids_Vis.png
new file mode 100644
index 0000000..c89d04d
Binary files /dev/null and b/images/Graph_FPS_Boids_Vis.png differ
diff --git a/images/RunningBoid.gif b/images/RunningBoid.gif
new file mode 100644
index 0000000..9e5c853
Binary files /dev/null and b/images/RunningBoid.gif differ
diff --git a/images/RunningBoidEarly.png b/images/RunningBoidEarly.png
new file mode 100644
index 0000000..4ba72f1
Binary files /dev/null and b/images/RunningBoidEarly.png differ
diff --git a/images/RunningBoidLate.png b/images/RunningBoidLate.png
new file mode 100644
index 0000000..7697729
Binary files /dev/null and b/images/RunningBoidLate.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..4035c4d 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -83,8 +83,9 @@ thrust::device_ptr<int> dev_thrust_particleGridIndices;
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
+// DONE-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_pos2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -168,7 +169,22 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  // DONE-2.1 DONE-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -210,8 +226,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO <<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO <<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -230,21 +246,89 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 pos_self = pos[iSelf];
+    glm::vec3 velocity_change;
+
+    glm::vec3 perceived_center;
+    glm::vec3 c;
+    glm::vec3 perceived_velocity;
+
+    unsigned int num_neighbors_r1 = 0;
+    unsigned int num_neighbors_r3 = 0;
+
+    for (int i = 0; i < N; i++)
+    {
+        if (i == iSelf)
+        {
+            continue;
+        }
+
+        glm::vec3 pos_other = pos[i];
+        float dist_to_other = glm::length(pos_other - pos_self);
+
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        if (dist_to_other < rule1Distance)
+        {
+            perceived_center += pos_other;
+            num_neighbors_r1++;
+        }
+
+        // Rule 2: boids try to stay a distance d away from each other
+        if (dist_to_other < rule2Distance)
+        {
+            c -= (pos_other - pos_self);
+        }
+
+        // Rule 3: boids try to match the speed of surrounding boids
+        if (dist_to_other < rule3Distance)
+        {
+            perceived_velocity += vel[i];
+            num_neighbors_r3++;
+        }
+    }
+
+    // Calculate contributions for each rule
+    // Rule 1
+    if (num_neighbors_r1 > 0)
+    {
+        velocity_change += (perceived_center / (float) num_neighbors_r1 - pos_self) * rule1Scale;
+    }
+
+    // Rule 2
+    velocity_change += c * rule2Scale;
+
+    // Rule 3
+    if (num_neighbors_r3 > 0)
+    {
+        velocity_change += (perceived_velocity / (float) num_neighbors_r3) * rule3Scale;
+    }
+
+    return velocity_change;
 }
 
 /**
-* TODO-1.2 implement basic flocking
+* DONE-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+    glm::vec3 *vel1, glm::vec3 *vel2) {
+    // Compute Boid associated with thread
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    // Compute a new velocity based on pos and vel1
+    glm::vec3 new_velocity = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
+    // Clamp the speed
+    if (glm::length(new_velocity) > maxSpeed)
+    {
+        new_velocity = maxSpeed * glm::normalize(new_velocity);
+    }
+
+    // Record the new velocity into vel2. Question: why NOT vel1?
+    vel2[index] = new_velocity;
 }
 
 /**
@@ -283,12 +367,22 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
+    glm::vec3 gridMin, float inverseCellWidth,
+    glm::vec3 *pos, int *indices, int *gridIndices) {
+    // DONE-2.1
+    // Compute index of Boid
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
     // - Label each boid with the index of its grid cell.
+    glm::vec3 cell_index_3D = glm::floor(inverseCellWidth * (pos[index] - gridMin));
+    gridIndices[index] = gridIndex3Dto1D(cell_index_3D.x, cell_index_3D.y, cell_index_3D.z, gridResolution);
+
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -302,26 +396,161 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+    // DONE-2.1
+    // Compute index of Boid
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    // Identify the start point of each cell in the gridIndices array.
+    // This is basically a parallel unrolling of a loop that goes
+    // "this index doesn't match the one before it, must be a new cell!"
+    int particle_grid_index = particleGridIndices[index];
+    if ((index == 0) || (particle_grid_index != particleGridIndices[index - 1]))
+    {
+        gridCellStartIndices[particle_grid_index] = index;
+    }
+    if ((index == N - 1) || (particle_grid_index != particleGridIndices[index + 1]))
+    {
+        gridCellEndIndices[particle_grid_index] = index;
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+    int N, int gridResolution, glm::vec3 gridMin,
+    float inverseCellWidth, float cellWidth,
+    int *gridCellStartIndices, int *gridCellEndIndices,
+    int *particleArrayIndices,
+    glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+    // DONE-2.1 - Update a boid's velocity using the uniform grid to reduce
+    // the number of boids that need to be checked.
+    // - Identify the grid cell that this particle is in
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    // - For each cell, read the start/end indices in the boid pointer array.
+    // - Access each boid in the cell and compute velocity change from
+    //   the boids rules, if this boid is within the neighborhood distance.
+    // - Clamp the speed change before putting the new speed in vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N)
+    {
+        return;
+    }
+
+    glm::vec3 pos_self = pos[index];
+    glm::vec3 grid_cell = inverseCellWidth * (pos_self - gridMin);
+    glm::vec3 grid_cell_int = glm::floor(grid_cell);
+    glm::vec3 grid_cell_frac = grid_cell - grid_cell_int;
+
+    // The following vectors store whether neighbors in the positive and negative direction should be checked
+    glm::ivec3 check_neg;
+    glm::ivec3 check_pos;
+
+    check_neg.x = (grid_cell_frac.x <= 0.5f && grid_cell_int.x > 0) ? 1.f : 0.f;
+    check_neg.y = (grid_cell_frac.y <= 0.5f && grid_cell_int.y > 0) ? 1.f : 0.f;
+    check_neg.z = (grid_cell_frac.z <= 0.5f && grid_cell_int.z > 0) ? 1.f : 0.f;
+    check_pos.x = (grid_cell_frac.x > 0.5f  && grid_cell_int.x < gridResolution - 1) ? 1.f : 0.f;
+    check_pos.y = (grid_cell_frac.y > 0.5f  && grid_cell_int.y < gridResolution - 1) ? 1.f : 0.f;
+    check_pos.z = (grid_cell_frac.z > 0.5f  && grid_cell_int.z < gridResolution - 1) ? 1.f : 0.f;
+
+    // Velocity change due to each rule
+    // Boids try to fly towards center of mass of neighboring boids
+    glm::vec3 velocity_change;
+    glm::vec3 perceived_center;
+    glm::vec3 c;
+    glm::vec3 perceived_velocity;
+
+    int num_neighbors_r1 = 0;
+    int num_neighbors_r3 = 0;
+
+    for (int z = grid_cell_int.z - check_neg.z; z <= grid_cell_int.z + check_pos.z; z++)
+    {
+        for (int y = grid_cell_int.y - check_neg.y; y <= grid_cell_int.y + check_pos.y; y++)
+        {
+            for (int x = grid_cell_int.x - check_neg.x; x <= grid_cell_int.x + check_pos.x; x++)
+            {
+                int neighbor_grid_cell_1D = gridIndex3Dto1D(x, y, z, gridResolution);
+
+                if (gridCellStartIndices[neighbor_grid_cell_1D] == -1)
+                {
+                    continue;
+                }
+
+                for (int cell_index = gridCellStartIndices[neighbor_grid_cell_1D]; cell_index <= gridCellEndIndices[neighbor_grid_cell_1D]; cell_index++)
+                {
+                    int index_other = particleArrayIndices[cell_index];
+                    glm::vec3 pos_other = pos[index_other];
+
+                    if (index_other != index) {
+                        float dist_to_other = glm::distance(pos_other, pos_self);
+
+                        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                        if (dist_to_other < rule1Distance)
+                        {
+                            perceived_center += pos_other;
+                            num_neighbors_r1++;
+                        }
+
+                        // Rule 2: boids try to stay a distance d away from each other
+                        if (dist_to_other < rule2Distance)
+                        {
+                            c -= (pos_other - pos_self);
+                        }
+
+                        // Rule 3: boids try to match the speed of surrounding boids
+                        if (dist_to_other < rule3Distance) {
+                            perceived_velocity += vel1[index_other];
+                            num_neighbors_r3++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    // Calculate contributions for each rule
+    // Rule 1
+    if (num_neighbors_r1 > 0)
+    {
+        velocity_change += (perceived_center / (float)num_neighbors_r1 - pos_self) * rule1Scale;
+    }
+
+    // Rule 2
+    velocity_change += c * rule2Scale;
+
+    // Rule 3
+    if (num_neighbors_r3 > 0)
+    {
+        velocity_change += (perceived_velocity / (float)num_neighbors_r3) * rule3Scale;
+    }
+
+    // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 new_velocity = vel1[index] + velocity_change;
+    if (glm::length(new_velocity) > maxSpeed)
+    {
+        new_velocity = maxSpeed * glm::normalize(new_velocity);
+    }
+    vel2[index] = new_velocity;
+}
+
+__global__ void kernReshuffleParticlePosVelData(
+    int N, glm::vec3* pos1, glm::vec3* pos2, glm::vec3* vel1, glm::vec3* vel2,
+    int* particleArrayIndices) {
+    // Swaps the position and velocity data values to correspond to the cell
+    // indices rather than the boid indices
+
+    // Calculate index of Boid
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N)
+    {
+        return;
+    }
+
+    // Get sorted particle index value
+    int particle_array_index = particleArrayIndices[index];
+    pos2[index] = pos1[particle_array_index];
+    vel2[index] = vel1[particle_array_index];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,7 +558,7 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+  // DONE-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
@@ -341,33 +570,163 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N)
+    {
+        return;
+    }
+
+    glm::vec3 pos_self = pos[index];
+    glm::vec3 grid_cell = inverseCellWidth * (pos_self - gridMin);
+    glm::vec3 grid_cell_int = glm::floor(grid_cell);
+    glm::vec3 grid_cell_frac = grid_cell - grid_cell_int;
+
+    // The following vectors store whether neighbors in the positive and negative direction should be checked
+    glm::ivec3 check_neg;
+    glm::ivec3 check_pos;
+
+    check_neg.x = (grid_cell_frac.x <= 0.5f && grid_cell_int.x > 0) ? 1.f : 0.f;
+    check_neg.y = (grid_cell_frac.y <= 0.5f && grid_cell_int.y > 0) ? 1.f : 0.f;
+    check_neg.z = (grid_cell_frac.z <= 0.5f && grid_cell_int.z > 0) ? 1.f : 0.f;
+    check_pos.x = (grid_cell_frac.x > 0.5f  && grid_cell_int.x < gridResolution - 1) ? 1.f : 0.f;
+    check_pos.y = (grid_cell_frac.y > 0.5f  && grid_cell_int.y < gridResolution - 1) ? 1.f : 0.f;
+    check_pos.z = (grid_cell_frac.z > 0.5f  && grid_cell_int.z < gridResolution - 1) ? 1.f : 0.f;
+
+    // Velocity change due to each rule
+    // Boids try to fly towards center of mass of neighboring boids
+    glm::vec3 velocity_change;
+    glm::vec3 perceived_center;
+    glm::vec3 c;
+    glm::vec3 perceived_velocity;
+
+    int num_neighbors_r1 = 0;
+    int num_neighbors_r3 = 0;
+
+    for (int z = grid_cell_int.z - check_neg.z; z <= grid_cell_int.z + check_pos.z; z++)
+    {
+        for (int y = grid_cell_int.y - check_neg.y; y <= grid_cell_int.y + check_pos.y; y++)
+        {
+            for (int x = grid_cell_int.x - check_neg.x; x <= grid_cell_int.x + check_pos.x; x++)
+            {
+                int neighbor_grid_cell_1D = gridIndex3Dto1D(x, y, z, gridResolution);
+
+                if (gridCellStartIndices[neighbor_grid_cell_1D] == -1)
+                {
+                    continue;
+                }
+
+                for (int cell_index = gridCellStartIndices[neighbor_grid_cell_1D]; cell_index <= gridCellEndIndices[neighbor_grid_cell_1D]; cell_index++)
+                {
+                    glm::vec3 pos_other = pos[cell_index];
+
+                    if (cell_index != index) {
+                        float dist_to_other = glm::distance(pos_other, pos_self);
+
+                        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                        if (dist_to_other < rule1Distance)
+                        {
+                            perceived_center += pos_other;
+                            num_neighbors_r1++;
+                        }
+
+                        // Rule 2: boids try to stay a distance d away from each other
+                        if (dist_to_other < rule2Distance)
+                        {
+                            c -= (pos_other - pos_self);
+                        }
+
+                        // Rule 3: boids try to match the speed of surrounding boids
+                        if (dist_to_other < rule3Distance) {
+                            perceived_velocity += vel1[cell_index];
+                            num_neighbors_r3++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    // Calculate contributions for each rule
+    // Rule 1
+    if (num_neighbors_r1 > 0)
+    {
+        velocity_change += (perceived_center / (float)num_neighbors_r1 - pos_self) * rule1Scale;
+    }
+
+    // Rule 2
+    velocity_change += c * rule2Scale;
+
+    // Rule 3
+    if (num_neighbors_r3 > 0)
+    {
+        velocity_change += (perceived_velocity / (float)num_neighbors_r3) * rule3Scale;
+    }
+
+    // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 new_velocity = vel1[index] + velocity_change;
+    if (glm::length(new_velocity) > maxSpeed)
+    {
+        new_velocity = maxSpeed * glm::normalize(new_velocity);
+    }
+    vel2[index] = new_velocity;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+    // DONE-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos <<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+    // DONE-1.2 ping-pong the velocity buffers
+    glm::vec3* temp = dev_vel2;
+    dev_vel2 = dev_vel1;
+    dev_vel1 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+    // DONE-2.1
+    // Uniform Grid Neighbor search using Thrust sort.
+    // In Parallel:
+    // - label each particle with its array index as well as its grid index.
+    //   Use 2x width grids.
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    // - Perform velocity updates using neighbor search
+    // - Update positions
+    // - Ping-pong buffers as needed
+
+    dim3 fullBlocksPerGridBoids((numObjects + blockSize - 1) / blockSize);
+    dim3 fullBlocksPerGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+    kernComputeIndices<<<fullBlocksPerGridBoids, blockSize>>>(
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+    dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernResetIntBuffer<<<fullBlocksPerGridCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+
+    kernIdentifyCellStartEnd<<<fullBlocksPerGridBoids, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGridBoids, blockSize>>>(
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos<<<fullBlocksPerGridBoids, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+    glm::vec3* temp = dev_vel2;
+    dev_vel2 = dev_vel1;
+    dev_vel1 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+  // DONE-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
@@ -382,6 +741,30 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+    dim3 fullBlocksPerGridBoids((numObjects + blockSize - 1) / blockSize);
+    dim3 fullBlocksPerGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+    kernComputeIndices<<<fullBlocksPerGridBoids, blockSize>>>(
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+    dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernResetIntBuffer<<<fullBlocksPerGridCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+
+    kernIdentifyCellStartEnd<<<fullBlocksPerGridBoids, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    kernReshuffleParticlePosVelData<<<fullBlocksPerGridBoids, blockSize>>>(numObjects, dev_pos, dev_pos2, dev_vel1, dev_vel2, dev_particleArrayIndices);
+    kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGridBoids, blockSize>>>(
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos2, dev_vel2, dev_vel1);
+    kernUpdatePos<<<fullBlocksPerGridBoids, blockSize>>>(numObjects, dt, dev_pos2, dev_vel1);
+
+    glm::vec3* temp = dev_pos2;
+    dev_pos2 = dev_pos;
+    dev_pos = temp;
 }
 
 void Boids::endSimulation() {
@@ -389,7 +772,13 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  // DONE-2.1 DONE-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..be74ded 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,13 +14,16 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
 const float DT = 0.2f;
 
+unsigned long long fps_cum_sum;
+unsigned long long fps_cum_cnt;
+
 /**
 * C main function.
 */
@@ -255,9 +258,15 @@ void initShaders(GLuint * program) {
 
       glfwSwapBuffers(window);
       #endif
+
+      fps_cum_sum += (unsigned long long) fps;
+      fps_cum_cnt++;
     }
     glfwDestroyWindow(window);
     glfwTerminate();
+
+    unsigned long long avg_fps = fps_cum_sum / fps_cum_cnt;
+    std::cout << "Avg FPS: " << avg_fps << std::endl;
   }