CIS565-Fall-2022 · Li-Jia-Jun · Sep 19, 2022 · Sep 19, 2022 · Sep 21, 2022 · Sep 21, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,141 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Jiajun Li
 
-### (TODO: Your README)
+Linkedin: [link](https://www.linkedin.com/in/jiajun-li-5063a4217/)
+
+Tested on: Windows 10, i7-12700 @ 2.10GHz, 32GB, RTX3080 12GB
+
+CUDA Compute Capability: 8.6
+
+## **Overview**
+
+In this project, different scan methods and some of scan applications are implemented:
+
+CPU side:
+  1. CPU navie scan
+  2. CPU compaction using CPU navie scan
+  3. CPU navie radix sort using CPU navie scan
+
+GPU side:
+  1. GPU navie scan
+  2. GPU efficient scan with threads reduction
+  3. GPU efficient stream compaction using GPU efficient scan
+
+Full explanations of each method can be found in [GPU Gem 3 Ch 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+
+The project also includes thrust::scan as a benchmark in performance analysis.
+
+## **Project Setup**
+
+This project included the following changes from the original project:
+
+1. Add ```radix_sort.h``` and ```radix_sort.cu``` to ```stream_compaction/CMakeLists```
+
+2. Add radix sort test code in ```src\main.cpp```
+
+## **Output example**
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  45  11  41  13  34  22   1  22   6   5   3   5  21 ...   9   0 ]
+==== cpu scan, power-of-two ====
+  elapsed time: 0.0519ms    (std::chrono Measured)
+    [   0  45  56  97 110 144 166 167 189 195 200 203 208 ... 801487 801496 ]
+==== cpu scan, non-power-of-two ====
+  elapsed time: 0.0519ms    (std::chrono Measured)
+    [   0  45  56  97 110 144 166 167 189 195 200 203 208 ... 801379 801414 ]
+    passed
+==== naive scan, power-of-two ====
+  elapsed time: 0.453632ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+  elapsed time: 0.403456ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+  elapsed time: 0.31744ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+  elapsed time: 0.08192ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+  elapsed time: 0.044032ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+  elapsed time: 0.045056ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   0   3   0   3   0   2   3   1   3   3   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+  elapsed time: 0.0657ms    (std::chrono Measured)
+    [   1   3   3   2   3   1   3   3   1   1   2   3   1 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+  elapsed time: 0.0648ms    (std::chrono Measured)
+    [   1   3   3   2   3   1   3   3   1   1   2   3   1 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+  elapsed time: 0.1507ms    (std::chrono Measured)
+    [   1   3   3   2   3   1   3   3   1   1   2   3   1 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+  elapsed time: 0.17408ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+  elapsed time: 0.171008ms    (CUDA Measured)
+    passed
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+==== cpu radix sort, power-of-two ====
+  elapsed time: 0.4766ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== cpu radix sort, non-power-of-two ====
+  elapsed time: 0.4721ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+```
+
+
+## **Performance Analysis**
+
+In all the following analysis, less time is better.
+
+### **Scan**
+
+![](img/Scan.png)
+
+* Work efficient scan out performs cpu scan when number of elements is greater than 2^16.
+
+* Work efficient scan roughly align with thrust scan when number of elements is greater than 2^18.
+
+* Navie GPU scan is always slower than navie CPU scan. This is because GPU method accesses data trhough global memory, which is considerably costy.
+
+
+### **Stream Compaction**
+
+![](img/StreamCompaction.png)
+
+* Using scan will make it slower in the CPU implementation because scan introduces more iterations over array.
+
+* Work efficient scan starts to out perform cpu scan when number of elements is greater than 2^18.
+
+* For work efficient scan, it performs slightly better when the number of elements is not power of two.
+
+### **Radix Sort**
+
+![](img/RadixSort.png)
+
+### **Future Improvement**
+
+1. Implement parallel radix sort and compare it with navie radix sort.
+
+2. Make GPU scans even more efficient by using share memory.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/img/RadixSort.png b/img/RadixSort.png
diff --git a/img/Scan.png b/img/Scan.png
diff --git a/img/StreamCompaction.png b/img/StreamCompaction.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,9 +11,10 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix_sort.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -28,6 +29,7 @@ int main(int argc, char* argv[]) {
     printf("****************\n");
 
     genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -67,18 +69,32 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, power-of-two");
+    StreamCompaction::Naive::scan2(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, false);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, non-power-of-two");
+    StreamCompaction::Naive::scan2(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(NPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    //printArray(SIZE, c, false);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    //printArray(NPOT, c, false);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +163,42 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
+    printf("\n");
+    printf("*****************************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("*****************************\n");
+
+    genArray(SIZE - 1, a, 50);
+
+    zeroArray(SIZE, c);
+    printDesc("cpu radix sort, power-of-two");
+    StreamCompaction::Radix_Sort::radix_sort_cpu(SIZE, 6, c, a);
+    count = SIZE;
+    printElapsedTime(StreamCompaction::Radix_Sort::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(count, c, true);
+
+    zeroArray(SIZE, c);
+    printDesc("cpu radix sort, non-power-of-two");
+    count = NPOT;
+    StreamCompaction::Radix_Sort::radix_sort_cpu(NPOT, 6, c, a);
+    printElapsedTime(StreamCompaction::Radix_Sort::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(count, c, true);
+
+    zeroArray(SIZE, c);
+    printDesc("parallel radix sort, power-of-two");
+    StreamCompaction::Radix_Sort::radix_sort_parallel(SIZE, 6, c, a);
+    count = SIZE;
+    printElapsedTime(StreamCompaction::Radix_Sort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(count, c, true);
+
+    zeroArray(SIZE, c);
+    printDesc("parallel radix sort, non-power-of-two");
+    count = NPOT;
+    StreamCompaction::Radix_Sort::radix_sort_parallel(NPOT, 6, c, a);
+    printElapsedTime(StreamCompaction::Radix_Sort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(count, c, true);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix_sort.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix_sort.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include <device_launch_parameters.h>
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -24,6 +25,12 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+
+            if (k >= n)
+                return;
+
+            bools[k] = idata[k] > 0 ? 1 : 0;
         }
 
         /**
@@ -33,6 +40,16 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+
+            int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+
+            if (k >= n)
+                return;
+
+            if (bools[k] > 0)
+            {
+                odata[indices[k]] = idata[k];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,10 +19,22 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+
+            doScan(n, odata, idata);
+
             timer().endCpuTimer();
         }
 
+        void doScan(int n, int* odata, const int* idata)
+        {
+            // Exclusive scan
+            odata[0] = 0;
+            for (int i = 1; i < n; ++i)
+            {
+                odata[i] = idata[i - 1] + odata[i - 1];
+            }
+        }
+
         /**
          * CPU stream compaction without using the scan function.
          *
@@ -31,8 +43,18 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+
+            int count = 0;
+            for (int i = 0; i < n; ++i)
+            {
+                if (idata[i] > 0)
+                {
+                    odata[count++] = idata[i];
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -41,10 +63,39 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+
+            int* tmpInputData = new int[n];
+            int* tmpOutputData = new int[n];
+
             timer().startCpuTimer();
-            // TODO
+
+            // Transfer idata to 0,1 set
+            for (int i = 0; i < n; ++i)
+            {
+                tmpInputData[i] = idata[i] > 0 ? 1 : 0;
+            }
+
+            // Exclusive scan
+            doScan(n, tmpOutputData, tmpInputData);
+
+            // Final array size
+            int count = tmpOutputData[n - 1] + tmpInputData[n - 1];
+
+            // Scatter
+            for (int i = 0; i < n; ++i)
+            {
+                if (tmpInputData[i] > 0)
+                {
+                    odata[tmpOutputData[i]] = idata[i];
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+
+            delete[] tmpInputData;
+            delete[] tmpOutputData;
+
+            return count;
         }
     }
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
@@ -6,6 +6,8 @@ namespace StreamCompaction {
     namespace CPU {
         StreamCompaction::Common::PerformanceTimer& timer();
 
+        void doScan(int n, int* odata, const int* idata);
+
         void scan(int n, int *odata, const int *idata);
 
         int compactWithoutScan(int n, int *odata, const int *idata);