diff --git a/README.md b/README.md index 0e38ddb1..12bb4fdb 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,175 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Christina Qiu + * [LinkedIn](https://www.linkedin.com/in/christina-qiu-6094301b6/), [personal website](https://christinaqiu3.github.io/), [twitter](), etc. +* Tested on: Windows 11, Intel Core i7-13700H @ 2.40GHz, 16GB RAM, NVIDIA GeForce RTX 4060 Laptop GPU (Personal laptop) -### (TODO: Your README) +## Overview + +This is an implementation of scan (prefix sum) and stream compaction algorithms on both the CPU and GPU using CUDA. + +This project includes four parts: + +### Part 1: CPU Scan & Stream Compaction +* Implements a basic exclusive prefix sum (scan) on the CPU. +* Two stream compaction implementations: + * Without scan: simple loop that filters non-zero values. + * With scan and scatter: mimics the parallel approach by mapping, scanning, and scattering. +* Used for correctness testing and performance comparison against GPU implementations. +* Runtime O(n) + +### Part 2: Naive GPU Scan Algorithm +* Implements a naive parallel scan on the GPU using CUDA. +* Iteratively applies scan logic for each depth level (d) in multiple kernel launches. +* Not work-efficient and not in-place. +* Demonstrates basic GPU memory handling and parallel loop structure. +* Runtime O(log n) kernel launches, total work O(n log n) + * At each iteration, a full kernel with n threads is launched + +### Part 3: Work-Efficient GPU Scan & Stream Compaction +* Implements the Blelloch (work-efficient) scan algorithm using the upsweep and downsweep phases. +* Handles non-power-of-two input sizes by padding to the next power of two. +* Adds GPU stream compaction using: + * Map step (0/1 flags for zero vs. non-zero), + * Scan on flags, + * Scatter to final output. +* Much faster and scalable compared to the naive implementation. +* Runtime O(log n) kernel launches, total work O(n) + * This is because each kernel does fewer threads of work as d increases. + * At step (d = 0), launched (threads = n/2), per thread (work = constant) + * At step (d = 1), launched (threads = n/4), per thread (work = constant) + * At step (d = logn-1), launched (threads = 1), per thread (work = constant) + * Thus the total work across all kernels: (n/2) + (n/4) + (n/8) + ... + 1 = O(n) + +### Part 4: Using Thrust's Implementation +* Leverages Thrust, a high-performance parallel algorithms library built on CUDA. +* Implements scan using thrust::exclusive_scan. +* Simplifies GPU programming and enables performance benchmarking against custom implementations. + +### Output of Scan & Stream Compaction Tests +(Array Size = 2^24) +``` +**************** +** SCAN TESTS ** +**************** + [ 7 32 41 34 34 45 20 39 29 27 7 38 38 ... 10 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 28.6697ms (std::chrono Measured) + [ 0 7 39 80 114 148 193 213 252 281 308 315 353 ... 410882306 410882316 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 30.8242ms (std::chrono Measured) + [ 0 7 39 80 114 148 193 213 252 281 308 315 353 ... 410882253 410882290 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 14.7507ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 14.6412ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 6.67194ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 6.49674ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.34758ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 1.26874ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 2 2 1 2 2 3 0 0 3 3 3 3 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 38.4214ms (std::chrono Measured) + [ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 38.1998ms (std::chrono Measured) + [ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 3 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 88.2723ms (std::chrono Measured) + [ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 11.8516ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 11.6095ms (CUDA Measured) + passed +``` + +## Runtime and Performance Analysis + +(note: I tested Array Sizes up to 2^24 because 2^25 is where computation breaks down on my laptop) + +1. Hypothesis: The GPU implementations of scan (Naive, Work-Efficient, and Thrust) will outperform the serial CPU scan as the array size increases. The Thrust library implementation will be the fastest on large data sizes due to its highly optimized CUDA backend. The naive GPU scan will be slower than the work-efficient implementation due to redundant work and synchronization overhead. + +### Size of Array v. Runtime of Scan (Power of 2) Graph + +blockSize = 128 + +![Data](img/graph1_t.png) +![Graph](img/graph1_v.png) + +### Size of Array v. Runtime of Scan (Non Power of 2) Graph + +blockSize = 128 + +![Data](img/graph2_t.png) +![Graph](img/graph2_v.png) + +Conclusion: + +CPU scan runtime increases exponentially as the array size increases, due to larger data sizes needing to access memory beyond the L1/L2 caches. + +For smaller inputs, Naive GPU scan can actually outperform the more complex approaches because it uses fewer threads and has less overhead. However, as the input size grows, the naive approach becomes inefficient due to poor memory access patterns. Specifically, threads must access data locations increasingly farther apart, resulting in uncoalesced memory transactions that hurt bandwidth and performance. Additionally, many threads become idle in later stages, reducing hardware utilization. + +Work Efficient GPU scan performs two kernels per iteration (upsweep and downsweep). Despite more frequent kernel invocations, this method excels for larger inputs because it minimizes idle threads and accesses memory in a way that favors coalescing. The upsweep and downsweep phases ensure that each element is processed only a logarithmic number of times with well-structured memory access, improving throughput and scaling much better than the naive approach. As a result, this approach ultimately surpasses both the naive GPU scan and the serial CPU scan in speed for large array sizes. + +## + +2. Hypothesis: The CPU implementation of stream compaction is expected to become significantly slower as the array size increases due to its sequential nature and growing cache/memory pressure. On the other hand, the work-efficient GPU implementation should demonstrate much better scalability, especially on larger arrays, due to parallel processing and improved memory access patterns. + +### Size of Array v. Runtime of Stream Compaction (Power of 2) Graph + +blockSize = 128 + +![Data](img/graph3_t.png) +![Graph](img/graph3_v.png) + +### Size of Array v. Runtime of Stream Compaction (Non Power of 2) Graph + +blockSize = 128 + +![Data](img/graph4_t.png) +![Graph](img/graph4_v.png) + +Conclusion: As expected, the CPU implementation’s runtime increases rapidly with array size (particularly beyond 2^21). In contrast, the work-efficient GPU implementation shows much better scaling, with runtimes increasing much more slowly as array size grows. For smaller sizes (e.g. 2^18 - 2^20), the CPU is actually slightly faster, likely due to lower kernel launch overhead. But this quickly flips as data size increases. This shows why GPU acceleration is critical for real-time or high-volume applications involving stream compaction. + +## + +2. Hypothesis: The choice of block size in a GPU kernel greatly affects performance due to how it maps to the hardware's available threads and memory resources. Smaller block sizes may lead to underutilized GPU cores, while overly large blocks can cause increased register and shared memory pressure, reducing occupancy. We expect optimal performance at moderate block sizes, such as 128 or 256, which strike a balance between these factors. + +### Blocksize v. Runtime of Scan (Power of 2) Graph + +Array Size = 2^20 + +![Data](img/graph5_t.png) +![Graph](img/graph5_v.png) + +### Blocksize v. Runtime of Scan (Non Power of 2) Graph + +Array Size = 2^20 + +![Data](img/graph6_t.png) +![Graph](img/graph6_v.png) + +Conclusion: The optimal block size for scan is mid-range (128). Runtimes degrade with both very small and very large blocks, due to either insufficient parallelism or limited warp scheduling capacity. Non-power-of-two inputs introduce negligible overhead, indicating that the scan logic (including padding and bounds checking) is both correct and efficient. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) diff --git a/img/graph1_t.png b/img/graph1_t.png new file mode 100644 index 00000000..4f55c0a2 Binary files /dev/null and b/img/graph1_t.png differ diff --git a/img/graph1_v.png b/img/graph1_v.png new file mode 100644 index 00000000..86f8b31b Binary files /dev/null and b/img/graph1_v.png differ diff --git a/img/graph2_t.png b/img/graph2_t.png new file mode 100644 index 00000000..25401f40 Binary files /dev/null and b/img/graph2_t.png differ diff --git a/img/graph2_v.png b/img/graph2_v.png new file mode 100644 index 00000000..42b1d32f Binary files /dev/null and b/img/graph2_v.png differ diff --git a/img/graph3_t.png b/img/graph3_t.png new file mode 100644 index 00000000..ea363a75 Binary files /dev/null and b/img/graph3_t.png differ diff --git a/img/graph3_v.png b/img/graph3_v.png new file mode 100644 index 00000000..ea56a307 Binary files /dev/null and b/img/graph3_v.png differ diff --git a/img/graph4_t.png b/img/graph4_t.png new file mode 100644 index 00000000..5645216c Binary files /dev/null and b/img/graph4_t.png differ diff --git a/img/graph4_v.png b/img/graph4_v.png new file mode 100644 index 00000000..a48e5549 Binary files /dev/null and b/img/graph4_v.png differ diff --git a/img/graph5_t.png b/img/graph5_t.png new file mode 100644 index 00000000..e75a4f1b Binary files /dev/null and b/img/graph5_t.png differ diff --git a/img/graph5_v.png b/img/graph5_v.png new file mode 100644 index 00000000..40e4cc7d Binary files /dev/null and b/img/graph5_v.png differ diff --git a/img/graph6_t.png b/img/graph6_t.png new file mode 100644 index 00000000..6fdb1fa6 Binary files /dev/null and b/img/graph6_t.png differ diff --git a/img/graph6_v.png b/img/graph6_v.png new file mode 100644 index 00000000..eadd1700 Binary files /dev/null and b/img/graph6_v.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..19655349 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d630..c6726262 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,11 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int i = threadIdx.x + blockIdx.x * blockDim.x; + if (i >= n) return; + if (i < n) { + bools[i] = (idata[i] != 0) ? 1 : 0; + } } /** @@ -33,6 +38,11 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int i = threadIdx.x + blockIdx.x * blockDim.x; + if (i >= n) return; + if (i < n && bools[i]) { + odata[indices[i]] = idata[i]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa115..8826c58c 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,10 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = idata[i - 1] + odata[i - 1]; + } timer().endCpuTimer(); } @@ -30,9 +33,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int count = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; + return count; } /** @@ -41,10 +50,35 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + + int* map = new int[n]; + int* scanArr = new int[n]; timer().startCpuTimer(); - // TODO + + // MAP + for (int i = 0; i < n; i++) { + map[i] = (idata[i] != 0) ? 1 : 0; + } + + // SCAN + scanArr[0] = 0; + for (int i = 1; i < n; i++) { + scanArr[i] = scanArr[i - 1] + map[i - 1]; + } + + // SCATTER + int count = 0; + for (int i = 0; i < n; i++) { + if (map[i] != 0 && scanArr[i] < n) { + odata[scanArr[i]] = idata[i]; + count++; + } + } + delete[] map; + delete[] scanArr; + timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346ee..491dcd9d 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,10 @@ #include "common.h" #include "efficient.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) + +#define BLOCK_SIZE 512 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +16,128 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpsweep(int n, int d, int* data) { + int index = threadIdx.x + blockIdx.x * blockDim.x; + int stride = 1 << (d + 1); + int idx = index * stride + (stride - 1); + + if (idx < n) { + int left = idx - (1 << d); + data[idx] += data[left]; + } + } + + __global__ void kernDownsweep(int n, int d, int* data) { + int index = threadIdx.x + blockIdx.x * blockDim.x; + int stride = 1 << (d + 1); + int idx = index * stride + (stride - 1); + + if (idx < n) { + int left = idx - (1 << d); + int t = data[left]; + data[left] = data[idx]; + data[idx] += t; + } + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + // This can be done in place + // there won't be a case where one thread writes to and another thread reads from the same location in the array. + // Test non-power-of-two-sized arrays. + // your intermediate array sizes will need to be rounded to the next power of two. void scan(int n, int *odata, const int *idata) { + int* dev_data; + int paddedN = 1 << ilog2ceil(n); // power of two + size_t byteSize = paddedN * sizeof(int); + + cudaMalloc((void**)&dev_data, byteSize); + + // Copy input to dev_data, pad with 0 if needed for non-power-of-two + cudaMemset(dev_data, 0, byteSize); + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + int blockSize = 128; timer().startGpuTimer(); - // TODO + + // === UPSWEEP === + for (int d = 0; (1 << (d + 1)) <= paddedN; ++d) { + int numThreads = paddedN / (1 << (d + 1)); + int numBlocks = (numThreads + blockSize - 1) / blockSize; + + if (numThreads > 0) { + kernUpsweep << > > (paddedN, d, dev_data); + cudaDeviceSynchronize(); + } + } + + // Set last element to 0 for downsweep + cudaMemset(dev_data + paddedN - 1, 0, sizeof(int)); + cudaDeviceSynchronize(); + + // === DOWNSWEEP === + for (int d = ilog2ceil(paddedN) - 1; d >= 0; --d) { + int numThreads = paddedN / (1 << (d + 1)); + int numBlocks = (numThreads + blockSize - 1) / blockSize; + + if (numThreads > 0) { + kernDownsweep << > > (paddedN, d, dev_data); + cudaDeviceSynchronize(); + } + } + timer().endGpuTimer(); + + // Copy result back + cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_data); + } + + // copy to remove timer issues + void scanCopy(int n, int* odata, const int* idata) { + int* dev_data; + int paddedN = 1 << ilog2ceil(n); // power of two + size_t byteSize = paddedN * sizeof(int); + + cudaMalloc((void**)&dev_data, byteSize); + + // Copy input to dev_data, pad with 0 if needed for non-power-of-two + cudaMemset(dev_data, 0, byteSize); + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + int blockSize = 128; + //timer().startGpuTimer(); + + // === UPSWEEP === + for (int d = 0; (1 << (d + 1)) <= paddedN; ++d) { + int numThreads = paddedN / (1 << (d + 1)); + int numBlocks = (numThreads + blockSize - 1) / blockSize; + + if (numThreads > 0) + kernUpsweep << > > (paddedN, d, dev_data); + } + + // Set last element to 0 for downsweep + cudaMemset(dev_data + paddedN - 1, 0, sizeof(int)); + + // === DOWNSWEEP === + for (int d = ilog2ceil(paddedN) - 1; d >= 0; --d) { + int numThreads = paddedN / (1 << (d + 1)); + int numBlocks = (numThreads + blockSize - 1) / blockSize; + + if (numThreads > 0) + kernDownsweep << > > (paddedN, d, dev_data); + } + + //timer().endGpuTimer(); + + // Copy result back + cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_data); } /** @@ -29,12 +148,67 @@ namespace StreamCompaction { * @param odata The array into which to store elements. * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. - */ + */ int compact(int n, int *odata, const int *idata) { + int* dev_idata; + int* dev_bools; + int* dev_indices; + int* dev_odata; + + int paddedN = 1 << ilog2ceil(n); // power of two + + size_t byteSize = n * sizeof(int); + + cudaMalloc((void**)&dev_idata, byteSize); + cudaMalloc((void**)&dev_bools, byteSize); + cudaMalloc((void**)&dev_indices, byteSize); + cudaMalloc((void**)&dev_odata, byteSize); + checkCUDAErrorWithLine("cudaMalloc failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + int blockSize = 128; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + timer().startGpuTimer(); - // TODO + + // === Step 1: Map to Boolean === + StreamCompaction::Common::kernMapToBoolean << > > (n, dev_bools, dev_idata); + checkCUDAErrorWithLine("After kernMapToBoolean"); + cudaDeviceSynchronize(); + + // === Step 2: Scan on bools to get indices === + scanCopy(n, dev_indices, dev_bools); + checkCUDAErrorWithLine("After scan"); + cudaDeviceSynchronize(); + + // === Step 3: Scatter === + StreamCompaction::Common::kernScatter << > > (n, dev_odata, dev_idata, dev_bools, dev_indices); + checkCUDAErrorWithLine("After scatter"); + cudaDeviceSynchronize(); + timer().endGpuTimer(); - return -1; + + + // Get total number of non-zero elements + int count = 0; + cudaMemcpy(&count, dev_indices + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + + int lastBool = 0; + cudaMemcpy(&lastBool, dev_bools + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + + int total = count + lastBool; + + // Copy compacted result back to host + cudaMemcpy(odata, dev_odata, total * sizeof(int), cudaMemcpyDeviceToHost); + + // Cleanup + cudaFree(dev_idata); + cudaFree(dev_bools); + cudaFree(dev_indices); + cudaFree(dev_odata); + + return total; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 43088769..0950cc6e 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +13,62 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } + // TODO: __global__ + __global__ void kernNaiveScan(int n, int offset, int* odata, int* idata) { + if (n <= 0) return; + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) return; + + if (index >= (1 << offset)) { + odata[index] = idata[index - (1 << offset)] + idata[index]; + } else { + odata[index] = idata[index]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + // ilog2ceil(n) separate kernel invocations + // can't generally operate on an array in-place on the GPU + // create two device arrays. Swap them at each iteration: + // read from A and write to B, read from B and write to A, and so on. + // Be sure to test non - power - of - two - sized arrays. void scan(int n, int *odata, const int *idata) { + int* dev_in; + int* dev_out; + int byteSize = n * sizeof(int); + + cudaMalloc((void**)&dev_in, byteSize); + checkCUDAErrorWithLine("cudaMalloc dev_in failed!"); + cudaMalloc((void**)&dev_out, byteSize); + checkCUDAErrorWithLine("cudaMalloc dev_out failed!"); + + cudaMemcpy(dev_in + 1, idata, (n - 1) * sizeof(int), cudaMemcpyHostToDevice); // shift input right by one + cudaMemset(dev_in, 0, sizeof(int)); // setting first elem to 0 for exclusive + + int numThreads = 64; + int numBlocks = (n + numThreads - 1) / numThreads; + + int logn = ilog2ceil(n); + timer().startGpuTimer(); - // TODO + for (int i = 0; i < logn; ++i) { + kernNaiveScan<<>>(n, i, dev_out, dev_in); + + std::swap(dev_in, dev_out); + } + timer().endGpuTimer(); + + + // After logn iterations, the final result is in dev_in (after last swap) + cudaMemcpy(odata, dev_in, byteSize, cudaMemcpyDeviceToHost); + + cudaFree(dev_in); + cudaFree(dev_out); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e7..fb3f0c5a 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,19 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::device_vector dv_in(idata, idata + n); // implicitly copied to GPU + thrust::device_vector dv_out(n); + + // Time only the scan + timer().startGpuTimer(); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + + // Copy back the result to odata + thrust::copy(dv_out.begin(), dv_out.end(), odata); } } }