diff --git a/README.md b/README.md
index d63a6a1..b154ec5 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,53 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Helena Zhang
+* Tested on: Windows 11, i7-10750 @ 2.6GHz 16GB, Geforce RTX 2060 6GB
 
-### (TODO: Your README)
+### Demo
+![](images/few.gif)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+Coherent Grid, 10000 boids, 128 Block Size, 100 Scene Size, Single Cell Width
+
+
+![](images/many.gif)
+
+Coherent Grid, 500000 boids, 128 Block Size, 200 Scene Size, Single Cell Width
+
+
+### Performance Analysis
+The three implementations performed drastically differently, as shown in the graphs below. All FPS data is collected from runs with **128 Block Size, 100 Scene Size, Single Cell Width**
+
+
+![](images/Implementation_Vis.jpg)
+![](images/Implementation.jpg)
+
+Some patterns, such as the exponential decrease in performance in the naive implementation, can be observed from plotting the FPS on a Log Scaled X axis.
+
+
+![](images/Implementation_VisLog.jpg)
+![](images/ImplementationLog.jpg)
+
+There seems to be an unexpected decrease in performance at 10K boids for all implementations. If a solution is found, this section would be updated accordingly.
+
+In terms of block size, or the number of threads per block, smaller block sizes, especially if they are lower than 32, perform less well. The performance also gradually worsens after peaking at 64 threads per block. This data is collected from runs with **500k boids, 200 Scene Size, Single Cell Width**
+
+
+![](images/BlockSize.jpg)
+
+
+### Performance Questions
+* For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+   * The program performed slower as the number of boids increased for all implementations. As the number of boids increased, all three implementations must check for more boids in the entire scene, or more boids in each grid. With more boids for the program to check, the more warps and data read is needed, which slows the performance of the program. 
+
+
+* For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+   * Block count and block size only affected performance when the block size is below 32. This is because 32 is the warp size for CUDA, and having fewer than 32 threads per block would expose any latency due to slower operations like data read from global memory. 
+
+
+* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+   * There is significant improvement in performance going from uniform grid to coherent grid. This is the expected outcome because it eliminates an extra step for reading global memory data to access the index of a neighboring boid's position and velocity. 
+
+
+* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+   * Checking 27 smaller cells is more effective than checking 8 neighboring cells. This is because the large 8 neighboring cells with double the side length of a smaller cell has 2 * 2 * 2 times the volume of a smaller cell, so checking 8 neighboring large cells is equivalent to checking 8 * 8 = 64 smaller cells. 
diff --git a/images/BlockSize.jpg b/images/BlockSize.jpg
new file mode 100644
index 0000000..7392a71
Binary files /dev/null and b/images/BlockSize.jpg differ
diff --git a/images/Implementation.jpg b/images/Implementation.jpg
new file mode 100644
index 0000000..549c1c9
Binary files /dev/null and b/images/Implementation.jpg differ
diff --git a/images/ImplementationLog.jpg b/images/ImplementationLog.jpg
new file mode 100644
index 0000000..9bc0a3b
Binary files /dev/null and b/images/ImplementationLog.jpg differ
diff --git a/images/Implementation_Vis.jpg b/images/Implementation_Vis.jpg
new file mode 100644
index 0000000..5b16c55
Binary files /dev/null and b/images/Implementation_Vis.jpg differ
diff --git a/images/Implementation_VisLog.jpg b/images/Implementation_VisLog.jpg
new file mode 100644
index 0000000..1ed08a0
Binary files /dev/null and b/images/Implementation_VisLog.jpg differ
diff --git a/images/few.gif b/images/few.gif
new file mode 100644
index 0000000..c0873fb
Binary files /dev/null and b/images/few.gif differ
diff --git a/images/many.gif b/images/many.gif
new file mode 100644
index 0000000..10c4a16
Binary files /dev/null and b/images/many.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..4cfac85 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -52,7 +52,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 #define maxSpeed 1.0f
 
 /*! Size of the starting area in simulation space. */
-#define scene_scale 100.0f
+#define scene_scale 200.0f
 
 /***********************************************
 * Kernel state (pointers are device pointers) *
@@ -86,6 +86,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
 
+glm::vec3* dev_pos_sorted;
+
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -157,7 +159,7 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +171,25 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos_sorted, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos_sorted failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +254,61 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	glm::vec3 selfVel = vel[iSelf];
+
+	// Info for rule 1: perceived center and # neighbors
+	glm::vec3 joinCenter(0.0f);
+	int n1 = 0;
+
+	// Info for rule 2: separation center
+	glm::vec3 sepCenter(0.0f);
+
+	// Info for rule 3: neighboring velocity
+	glm::vec3 nVelocity(0.0f);
+	int n3 = 0;
+
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) continue;
+		// valid neighbor info
+		glm::vec3 iPos = pos[i];
+		glm::vec3 iVel = vel[i];
+
+		float dist = glm::distance(selfPos, iPos);
+
+		// Rule 1: flock to neighbors:
+		if (dist <= rule1Distance) {
+			joinCenter += iPos;
+			n1++;
+		}
+
+		// Rule 2: separation
+		if (dist <= rule2Distance) {
+			sepCenter -= iPos - selfPos;
+		}
+
+		// Rule 3: match velocity
+		if (dist <= rule3Distance) {
+			nVelocity += iVel;
+			n3++;
+		}
+	}
+
+	glm::vec3 newVel(0.0f);
+
+	if (n1 > 0) {
+		joinCenter /= n1;
+		newVel += rule1Scale * (joinCenter - selfPos);
+	}
+
+	newVel += rule2Scale * sepCenter;
+
+	if (n3 > 0) {
+		nVelocity /= n3;
+		newVel += rule3Scale * nVelocity;
+	}
+
+	return newVel;
 }
 
 /**
@@ -242,9 +317,19 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index >= N) {
+		return;
+	}
+
+	glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
+	if (newVel.length() > maxSpeed) {
+		newVel = maxSpeed * glm::normalize(newVel);
+	}
+
+	vel2[index] = newVel;
 }
 
 /**
@@ -289,12 +374,21 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+    const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (index < N) {
+        // find grid cell
+        const glm::ivec3 gridPos = floor((pos[index] - gridMin) * inverseCellWidth);
+        indices[index] = index;
+        gridIndices[index] = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution);
+    }
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
     intBuffer[index] = value;
   }
@@ -306,6 +400,26 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+
+    const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (index < N) {
+        const int gridNum = particleGridIndices[index];
+        if (index == 0) {
+            gridCellStartIndices[gridNum] = index;
+        }
+
+        if (index == N - 1) {
+            gridCellEndIndices[gridNum] = index;
+        }
+        else {
+            const int nextGridNum = particleGridIndices[index + 1];
+            if (gridNum != nextGridNum) {
+                gridCellEndIndices[gridNum] = index;
+                gridCellStartIndices[nextGridNum] = index + 1;
+            }
+        }
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +428,104 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	  // the number of boids that need to be checked.
+	  // - Identify the grid cell that this particle is in
+	const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < N) {
+		const glm::vec3 selfPos = pos[index];
+		const glm::vec3 selfVel = vel1[index];
+
+		// Info for rule 1: perceived center and # neighbors
+		glm::vec3 joinCenter(0.0f);
+		int n1 = 0;
+
+		// Info for rule 2: separation center
+		glm::vec3 sepCenter(0.0f);
+
+		// Info for rule 3: neighboring velocity
+		glm::vec3 nVelocity(0.0f);
+		int n3 = 0;
+
+		// - Identify which cells may contain neighbors. This isn't always 8.
+		const glm::ivec3 minGridPos = floor((selfPos - glm::vec3(cellWidth) - gridMin) * inverseCellWidth);
+		const glm::ivec3 maxGridPos = floor((selfPos + glm::vec3(cellWidth) - gridMin) * inverseCellWidth);
+
+		for (int z = glm::max(0, minGridPos.z); z <= glm::min(gridResolution - 1, maxGridPos.z); z++) {
+			for (int y = glm::max(0, minGridPos.y); y <= glm::min(gridResolution - 1, maxGridPos.y); y++) {
+				for (int x = glm::max(0, minGridPos.x); x <= glm::min(gridResolution - 1, maxGridPos.x); x++) {
+					const int cell = gridIndex3Dto1D(x, y, z, gridResolution);
+					// - For each cell, read the start/end indices in the boid pointer array.
+					const int start = gridCellStartIndices[cell];
+					if (start == -1) continue;
+					const int end = gridCellEndIndices[cell];
+					for (int i = start; i <= end; i++) {
+						// - Access each boid in the cell and compute velocity change from
+						//   the boids rules, if this boid is within the neighborhood distance.
+						const int boid = particleArrayIndices[i];
+
+						if (boid == index) continue;
+						// valid neighbor info
+						const glm::vec3 iPos = pos[boid];
+						const glm::vec3 iVel = vel1[boid];
+
+						const float dist = glm::distance(selfPos, iPos);
+
+						// Rule 1: flock to neighbors:
+						if (dist <= rule1Distance) {
+							joinCenter += iPos;
+							n1++;
+						}
+
+						// Rule 2: separation
+						if (dist <= rule2Distance) {
+							sepCenter -= iPos - selfPos;
+						}
+
+						// Rule 3: match velocity
+						if (dist <= rule3Distance) {
+							nVelocity += iVel;
+							n3++;
+						}
+					}
+				}
+			}
+		}
+
+		glm::vec3 newVel = selfVel;
+
+		if (n1 > 0) {
+			joinCenter /= n1;
+			newVel += rule1Scale * (joinCenter - selfPos);
+		}
+
+		newVel += rule2Scale * sepCenter;
+
+		if (n3 > 0) {
+			nVelocity /= n3;
+			newVel += rule3Scale * nVelocity;
+		}
+
+		// - Clamp the speed change before putting the new speed in vel2
+
+		if (newVel.length() > maxSpeed) {
+			newVel = maxSpeed * glm::normalize(newVel);
+		}
+
+		vel2[index] = newVel;
+	}
+}
+
+__global__ void kernShuffle(
+	int N, int* ordering, glm::vec3* unsorted_set, glm::vec3* sorted_set) {
+	// use the mapping of grid index to boid index from ordering;
+	// place content from unsorted_set to sorted_set
+	const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < N) {
+		sorted_set[index] = unsorted_set[ordering[index]];
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,59 +533,172 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < N) {
+		const glm::vec3 selfPos = pos[index];
+		const glm::vec3 selfVel = vel1[index];
+
+		// Info for rule 1: perceived center and # neighbors
+		glm::vec3 joinCenter(0.0f);
+		int n1 = 0;
+
+		// Info for rule 2: separation center
+		glm::vec3 sepCenter(0.0f);
+
+		// Info for rule 3: neighboring velocity
+		glm::vec3 nVelocity(0.0f);
+		int n3 = 0;
+
+		// - Identify which cells may contain neighbors. This isn't always 8.
+		const glm::ivec3 minGridPos = floor((selfPos - glm::vec3(cellWidth) - gridMin) * inverseCellWidth);
+		const glm::ivec3 maxGridPos = floor((selfPos + glm::vec3(cellWidth) - gridMin) * inverseCellWidth);
+		for (int z = glm::max(0, minGridPos.z); z <= glm::min(gridResolution - 1, maxGridPos.z); z++) {
+			for (int y = glm::max(0, minGridPos.y); y <= glm::min(gridResolution - 1, maxGridPos.y); y++) {
+				for (int x = glm::max(0, minGridPos.x); x <= glm::min(gridResolution - 1, maxGridPos.x); x++) {
+					const int cell = gridIndex3Dto1D(x, y, z, gridResolution);
+					const int start = gridCellStartIndices[cell];
+					if (start == -1) continue;
+					const int end = gridCellEndIndices[cell];
+					for (int i = start; i <= end; i++) {
+						if (i == index) continue;
+						// valid neighbor info
+						const glm::vec3 iPos = pos[i];
+						const glm::vec3 iVel = vel1[i];
+
+						const float dist = glm::distance(selfPos, iPos);
+
+						// Rule 1: flock to neighbors:
+						if (dist <= rule1Distance) {
+							joinCenter += iPos;
+							n1++;
+						}
+
+						// Rule 2: separation
+						if (dist <= rule2Distance) {
+							sepCenter -= iPos - selfPos;
+						}
+
+						// Rule 3: match velocity
+						if (dist <= rule3Distance) {
+							nVelocity += iVel;
+							n3++;
+						}
+					}
+				}
+			}
+		}
+
+		glm::vec3 newVel = selfVel;
+
+		if (n1 > 0) {
+			joinCenter /= n1;
+			newVel += rule1Scale * (joinCenter - selfPos);
+		}
+
+		newVel += rule2Scale * sepCenter;
+
+		if (n3 > 0) {
+			nVelocity /= n3;
+			newVel += rule3Scale * nVelocity;
+		}
+
+		if (newVel.length() > maxSpeed) {
+			newVel = maxSpeed * glm::normalize(newVel);
+		}
+
+		vel2[index] = newVel;
+	}
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this particle is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
+	// TODO-1.2 ping-pong the velocity buffers
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered << < fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
+	// - Ping-pong buffers as needed
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	const dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// In Parallel:
+	// - Label each particle with its array index as well as its grid index.
+	//   Use 2x width grids
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	kernShuffle << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_pos_sorted);
+	kernShuffle << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_vel1, dev_vel2);
+	// 
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos_sorted, dev_vel2, dev_vel1);
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos_sorted, dev_vel1);
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	std::swap(dev_pos, dev_pos_sorted);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +707,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_pos_sorted);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..3bed56a 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,12 +14,14 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+
+#define FPS_TESTING 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
-const float DT = 0.2f;
+const int N_FOR_VIS = 500000;
+const float DT = 0.5f;
 
 /**
 * C main function.
@@ -217,6 +219,11 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+    double testing_timebase = 0; // testing code
+    double avg_fps = 0; // testing code
+    bool avg_found = false;
+    int accum_frame = 0;
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -224,6 +231,7 @@ void initShaders(GLuint * program) {
       glfwPollEvents();
 
       frame++;
+      
       double time = glfwGetTime();
 
       if (time - timebase > 1.0) {
@@ -232,6 +240,14 @@ void initShaders(GLuint * program) {
         frame = 0;
       }
 
+      if (time - testing_timebase > 5.0) {
+          accum_frame++;
+          if (time - testing_timebase > 15.0 && !avg_found) {
+              avg_found = true;
+              avg_fps = accum_frame / 10.000;
+              std::cout << "!!! Average FPS over 10 sec: " << avg_fps << std::endl;
+          }
+      }
       runCUDA();
 
       std::ostringstream ss;
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..da0fe7e 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -47,7 +47,7 @@ double lastX;
 double lastY;
 float theta = 1.22f;
 float phi = -0.70f;
-float zoom = 4.0f;
+float zoom = 6.0f;
 glm::vec3 lookAt = glm::vec3(0.0f, 0.0f, 0.0f);
 glm::vec3 cameraPosition;