diff --git a/README.md b/README.md
index d63a6a1..fcc36c6 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,55 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Gene Liu
+  * [LinkedIn](https://www.linkedin.com/in/gene-l-3108641a3/)
+* Tested on: Windows 10, i7-9750H @ 2.60GHz, 16GB RAM, GTX 1650 Max-Q 4096MB (personal laptop)
+  * SM 7.5
 
-### (TODO: Your README)
+### Screenshots/Gifs
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/5kboid.gif)
+
+5k boids, 128 block size, 100 scene scale
+
+![](images/50kboid.gif)
+
+50k boids, 128 block size, 100 scene scale
+
+![](images/500kboid.gif)
+
+500k boids, 128 block size, 200 scene scale
+
+### Performance Analysis
+
+Performance analysis was done by comparing average FPS of each simulation. The following graphs show sim FPS as a result of varying boid count, with a constant block size of 128 and scene scale of 100, with data points at 5000, 10000, 50000, 100000, and 500000 boids for each scheme.
+
+<img src="/images/fpsvsboidnumnv.jpg" alt="image" width="50%" height="auto">
+
+<img src="/images/fpsvsboidnumv.jpg" alt="image" width="50%" height="auto">
+
+Increasing the boid count decreased the FPS no matter the parllelization scheme, which is as expected. This is as an increased number of particles to track and update increases computation and hence reduces the frame rate. The coherent gridding performed the best, with uniform gridding and finally the naive parallelization following. This is also expected, as each method provides improvements upon the next.
+
+The FPS compared to varying block sizes was also analyzed, with the same boid number of 50000, no visualization, and 100 scene scale. Data points were gathered at block sizes of 16, 32, 64, 128, 256, 512, and 1024.
+
+<img src="/images/fpsvsblocksizenv.jpg" alt="image" width="50%" height="auto">
+
+The results were once again as expected, with performance not gaining much after block size reaches 128. The three parallelization schemes showed varying performance levels in the expected order. Block size not improving performance can be explained as the warp size is constant at 32, limiting the degree of parallelization(the number of threads that can run at the same time), no matter the overall block size. At lower values at and around 32, the performance decrease can be attributed to not fully using parallelization and instead serialization operations.
+
+### Questions
+
+* For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+
+No matter the implementation, as the number of boids increased, the performance decreased. This makes sense as increasing the number of particles increases the computation needed to be done in updating velocities and checking neighbors, hence increasing the runtime of each loop and thus decreasing the frame rate. This computation eventually becomes the bottleneck and directly results in the decreased frame rate.
+
+* For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+
+Increasing the block size and count(keeping the total thread count constant) increases the performance of the simulation with diminishing returns. Increasing block size up to about 128 threads per block does drastically improve performance, likely due to fully utilizing the parallelization of the GPU. At lower block sizes, core cycles are wasted as there are not enough threads in the block to schedule at the same time or cover memory read latency. Once beyond 128 though, as the total number of threads performing computation stays the same and since we are fully utilizing the GPU cores and scheduling, and so not much performance is gained.
+
+* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+
+Yes, there was a visible performance improvement with the coherent unifrom grid compared to the uniform one. This is expected and due to the reduction of random memory accesses and increasing contiguous access, which is faster with regards to the GPU shared memory. This is relevant as all the data arrays live in GPU memory as they are arrays, and so contiguous memory access helps to reduce these memory read times. While the coherent uniform grid has an additional computation step in reordering/reshuffling the position and velocity arrays to match the sorted ones, this runtime is likely covered by the memory read improvements, resulting in overall faster sim performance.
+
+* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+
+Halving the cell width(to the same as the neighborhood distance) and checking 27 neighboring cells as a result actually improves performance compared to the original simulation. Although we do have to check more cells, each cell has half the side length and so 1/8 the area of a previous cell. Thus, the ratio of volume for the 27 cell check vs the 8 cell check is (27/8)/8 = 27/64, which means that these 27 cells constitute a lower volume than the previous 8 cells. On average, there are thus less boids in these checked cells(assuming a uniform boid distribution) and so the velocity update likely performs fewer loops, resulting in better performance.
diff --git a/images/500kboid.gif b/images/500kboid.gif
new file mode 100644
index 0000000..dd183fc
Binary files /dev/null and b/images/500kboid.gif differ
diff --git a/images/50kboid.gif b/images/50kboid.gif
new file mode 100644
index 0000000..a6e37ba
Binary files /dev/null and b/images/50kboid.gif differ
diff --git a/images/5kboid.gif b/images/5kboid.gif
new file mode 100644
index 0000000..41fb8a8
Binary files /dev/null and b/images/5kboid.gif differ
diff --git a/images/fpsvsblocksizenv.jpg b/images/fpsvsblocksizenv.jpg
new file mode 100644
index 0000000..2ceb436
Binary files /dev/null and b/images/fpsvsblocksizenv.jpg differ
diff --git a/images/fpsvsboidnumnv.jpg b/images/fpsvsboidnumnv.jpg
new file mode 100644
index 0000000..b6df81d
Binary files /dev/null and b/images/fpsvsboidnumnv.jpg differ
diff --git a/images/fpsvsboidnumv.jpg b/images/fpsvsboidnumv.jpg
new file mode 100644
index 0000000..8d7919a
Binary files /dev/null and b/images/fpsvsboidnumv.jpg differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..852ab98 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,7 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_tmp_buffer;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +170,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_tmp_buffer, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_tmp_buffer failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +252,36 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    const glm::vec3 self_pos = pos[iSelf];
+    glm::vec3 com = glm::vec3(0), repel = glm::vec3(0), nearby_vel = glm::vec3(0);
+    int r1_nn = 0, r3_nn = 0;
+    for (int i = 0; i < N; i++) {
+        const glm::vec3 i_pos = pos[i];
+        const float dist = glm::distance(self_pos, i_pos);
+        if (i == iSelf) continue;
+        if (dist < rule1Distance) {
+            com += i_pos;
+            r1_nn++;
+        }
+        if (dist < rule2Distance) {
+            repel -= (i_pos - self_pos);
+        }
+        if (dist < rule3Distance) {
+            nearby_vel += vel[i];
+            r3_nn++;
+        }
+    }
+    glm::vec3 final_vel = glm::vec3(0);
+    if (r1_nn > 0) {
+        com /= r1_nn;
+        final_vel += ((com - self_pos) * rule1Scale);
+    }
+    if (r3_nn > 0) {
+        nearby_vel /= r3_nn;
+        final_vel += (nearby_vel * rule3Scale);
+    }
+    final_vel += (repel * rule2Scale);
+    return final_vel;
 }
 
 /**
@@ -242,9 +290,18 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    // Compute a new velocity based on pos and vel1
+    glm::vec3 new_vel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+    // Clamp the speed
+    if (glm::length(new_vel) > maxSpeed) {
+        new_vel = maxSpeed * glm::normalize(new_vel);
+    }
+    // Record the new velocity into vel2. Question: why NOT vel1?
+    vel2[index] = new_vel;
 }
 
 /**
@@ -286,9 +343,17 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
     // - Label each boid with the index of its grid cell.
+    glm::vec3 b_pos = pos[index];
+    glm::ivec3 grid_index = floor((b_pos - gridMin) * inverseCellWidth);
+    gridIndices[index] = gridIndex3Dto1D(grid_index.x, grid_index.y, grid_index.z, gridResolution);
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +371,18 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    if (index == N - 1) gridCellEndIndices[particleGridIndices[index]] = index;
+    else {
+        if (index == 0) gridCellStartIndices[particleGridIndices[index]] = index;
+        if (particleGridIndices[index] != particleGridIndices[index + 1]) {
+            gridCellEndIndices[particleGridIndices[index]] = index;
+            gridCellStartIndices[particleGridIndices[index + 1]] = index + 1;
+        }
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +391,69 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
   // - Identify the grid cell that this particle is in
   // - Identify which cells may contain neighbors. This isn't always 8.
+    const glm::vec3 b_pos = pos[index];
+    const float neighborhood_dist = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+    const glm::ivec3 grid_lb = floor((b_pos - glm::vec3(neighborhood_dist) - gridMin) * inverseCellWidth);
+    const glm::ivec3 grid_ub = floor((b_pos + glm::vec3(neighborhood_dist) - gridMin) * inverseCellWidth);
   // - For each cell, read the start/end indices in the boid pointer array.
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 com = glm::vec3(0), repel = glm::vec3(0), nearby_vel = glm::vec3(0);
+    int r1_nn = 0, r3_nn = 0;
+    //oob grid check
+    //inverse loop for mem locality
+    for (int z = imax(0, grid_lb.z); z <= imin(grid_ub.z, gridResolution-1); z++) {
+        for (int y = imax(0, grid_lb.y); y <= imin(grid_ub.y, gridResolution-1); y++) {
+            for (int x = imax(0, grid_lb.x); x <= imin(grid_ub.x, gridResolution-1); x++) {
+                const int grid_index = gridIndex3Dto1D(x, y, z, gridResolution);
+                //loop through boids in cell
+                //empty cell check
+                if (gridCellStartIndices[grid_index] == -1) continue;
+                for (int cell_index = gridCellStartIndices[grid_index]; cell_index <= gridCellEndIndices[grid_index]; cell_index++) {
+                    int boid_index = particleArrayIndices[cell_index];
+                    if (boid_index == index) continue;
+                    const glm::vec3 new_b_pos = pos[boid_index];
+                    const float dist = glm::distance(b_pos, new_b_pos);
+                    if (dist > neighborhood_dist) continue;
+                    if (dist < rule1Distance) {
+                        com += new_b_pos;
+                        r1_nn++;
+                    }
+                    if (dist < rule2Distance) {
+                        repel -= (new_b_pos - b_pos);
+                    }
+                    if (dist < rule3Distance) {
+                        nearby_vel += vel1[boid_index];
+                        r3_nn++;
+                    }
+                }
+            }
+        }
+    }
+    glm::vec3 accel = glm::vec3(0);
+    if (r1_nn > 0) {
+        com /= r1_nn;
+        accel += ((com - b_pos) * rule1Scale);
+    }
+    if (r3_nn > 0) {
+        nearby_vel /= r3_nn;
+        accel += (nearby_vel * rule3Scale);
+    }
+    accel += (repel * rule2Scale);
+    glm::vec3 final_vel = vel1[index] + accel;
+    if (glm::length(final_vel) > maxSpeed) {
+        final_vel = maxSpeed * glm::normalize(final_vel);
+    }
+    vel2[index] = final_vel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,18 +461,72 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
   // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
   // - Identify the grid cell that this particle is in
   // - Identify which cells may contain neighbors. This isn't always 8.
+    const glm::vec3 b_pos = pos[index];
+    const float neighborhood_dist = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+    const glm::ivec3 grid_lb = floor((b_pos - glm::vec3(neighborhood_dist) - gridMin) * inverseCellWidth);
+    const glm::ivec3 grid_ub = floor((b_pos + glm::vec3(neighborhood_dist) - gridMin) * inverseCellWidth);
   // - For each cell, read the start/end indices in the boid pointer array.
   //   DIFFERENCE: For best results, consider what order the cells should be
   //   checked in to maximize the memory benefits of reordering the boids data.
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 com = glm::vec3(0), repel = glm::vec3(0), nearby_vel = glm::vec3(0);
+    int r1_nn = 0, r3_nn = 0;
+    //oob grid check
+    //inverse loop for mem locality
+    for (int z = imax(0, grid_lb.z); z <= imin(grid_ub.z, gridResolution - 1); z++) {
+        for (int y = imax(0, grid_lb.y); y <= imin(grid_ub.y, gridResolution - 1); y++) {
+            for (int x = imax(0, grid_lb.x); x <= imin(grid_ub.x, gridResolution - 1); x++) {
+                const int grid_index = gridIndex3Dto1D(x, y, z, gridResolution);
+                //loop through boids in cell
+                //empty cell check
+                if (gridCellStartIndices[grid_index] == -1) continue;
+                for (int cell_index = gridCellStartIndices[grid_index]; cell_index <= gridCellEndIndices[grid_index]; cell_index++) {
+                    const glm::vec3 new_b_pos = pos[cell_index];
+                    const float dist = glm::distance(b_pos, new_b_pos);
+                    //cant check by boid index anymore, check if this boid is self through dist == 0(exact pos)
+                    if (dist == 0 || dist > neighborhood_dist) continue;
+                    if (dist < rule1Distance) {
+                        com += new_b_pos;
+                        r1_nn++;
+                    }
+                    if (dist < rule2Distance) {
+                        repel -= (new_b_pos - b_pos);
+                    }
+                    if (dist < rule3Distance) {
+                        nearby_vel += vel1[cell_index];
+                        r3_nn++;
+                    }
+                }
+            }
+        }
+    }
+    glm::vec3 accel = glm::vec3(0);
+    if (r1_nn > 0) {
+        com /= r1_nn;
+        accel += ((com - b_pos) * rule1Scale);
+    }
+    if (r3_nn > 0) {
+        nearby_vel /= r3_nn;
+        accel += (nearby_vel * rule3Scale);
+    }
+    accel += (repel * rule2Scale);
+    glm::vec3 final_vel = vel1[index] + accel;
+    if (glm::length(final_vel) > maxSpeed) {
+        final_vel = maxSpeed * glm::normalize(final_vel);
+    }
+    vel2[index] = final_vel;
 }
 
 /**
@@ -348,40 +534,97 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel1);
   // TODO-1.2 ping-pong the velocity buffers
+    std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
+    //for ops with size = numObjects
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    //for ops with size = gridCellCount
+    dim3 blocks_per_grid_grid((gridCellCount + blockSize - 1) / blockSize);
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+    kernResetIntBuffer << <blocks_per_grid_grid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <blocks_per_grid_grid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+    
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+        dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
   // - Update positions
-  // - Ping-pong buffers as needed
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+
+   // - Ping-pong buffers as needed
+    std::swap(dev_vel1, dev_vel2);
+}
+
+__global__ void kernReorderBuffer(int N, int* order, glm::vec3* src, glm::vec3* tgt) {
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index < N) {
+        tgt[index] = src[order[index]];
+    }
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+    //for ops with size = numObjects
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    //for ops with size = gridCellCount
+    dim3 blocks_per_grid_grid((gridCellCount + blockSize - 1) / blockSize);
+
+    kernResetIntBuffer << <blocks_per_grid_grid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <blocks_per_grid_grid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+    //   the particle data in the simulation array.
+    //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+    // dev_particleArrayIndices contains order
+    kernReorderBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_tmp_buffer);
+    std::swap(dev_pos, dev_tmp_buffer);
+    kernReorderBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_vel1, dev_tmp_buffer);
+    std::swap(dev_vel1, dev_tmp_buffer);
+
+    // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+        dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2);
+
+    // - Update positions
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+
+    // - Ping-pong buffers as needed
+    std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +633,10 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..fafcc7b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**