Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
958588d
CPU side done
Liang-Hao-Quan Sep 13, 2022
9569f15
naive scan done, but there is a better way for buffer swapping -- act…
Liang-Hao-Quan Sep 13, 2022
c82718b
Got rid of cudaMemcpy
Liang-Hao-Quan Sep 14, 2022
e91ced3
Small optimization on the CPU side
Liang-Hao-Quan Sep 14, 2022
64f0aec
Efficient scan is buggy
Liang-Hao-Quan Sep 14, 2022
d02eceb
work-efficient power of two passed
Liang-Hao-Quan Sep 15, 2022
39688c5
Got rid of using pow() but still failing non-power-of-two
Liang-Hao-Quan Sep 15, 2022
1815a26
Work-Efficient Scan Done
Liang-Hao-Quan Sep 15, 2022
2b05423
Encountered a "illegal memory access" for cudaMemset when implementin…
Liang-Hao-Quan Sep 15, 2022
70e4711
Fixed the illegal memory access bug but the result count is incorrect
Liang-Hao-Quan Sep 15, 2022
4fcbc02
Added multiple temp arrays to debug and found the bug...a very stupid…
Liang-Hao-Quan Sep 16, 2022
0bf2ca0
Small Optimization - Work-Efficient Compaction done
Liang-Hao-Quan Sep 16, 2022
38c9207
Thrust done
Liang-Hao-Quan Sep 16, 2022
8726093
Covered an edge case -- Di shared this with me
Liang-Hao-Quan Sep 17, 2022
6d22ff6
Added a test case for Radix sort on the CPU side
Liang-Hao-Quan Sep 17, 2022
ac5e2f1
Radix sort implemented but buggy
Liang-Hao-Quan Sep 17, 2022
bd39350
Added some lines for debugging.
Liang-Hao-Quan Sep 18, 2022
3f9bee3
Radix done
Liang-Hao-Quan Sep 18, 2022
7cd98f4
Radix non power of 2 failed
Liang-Hao-Quan Sep 18, 2022
0812891
Radix Sort Done
Liang-Hao-Quan Sep 18, 2022
ecaf4c5
Update README.md
Liang-Hao-Quan Sep 18, 2022
9b63d9a
Fixed a bug for naive scan...
Liang-Hao-Quan Sep 18, 2022
f2c0c61
Merge branch 'main' of https://github.com/LEO-CGGT/Project2-Stream-Co…
Liang-Hao-Quan Sep 18, 2022
ee90e8a
Added images, optimized thrust impl
Liang-Hao-Quan Sep 19, 2022
338de3f
Update README.md
Liang-Hao-Quan Sep 19, 2022
0606291
Added Table of Contents
Liang-Hao-Quan Sep 19, 2022
16766b7
Update README.md
Liang-Hao-Quan Sep 19, 2022
3c73ded
Update README.md
Liang-Hao-Quan Sep 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 205 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,210 @@
CUDA Stream Compaction
======================

# CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Haoquan Liang
* [LinkedIn](https://www.linkedin.com/in/leohaoquanliang/)
* Tested on: Windows 10, Ryzen 7 5800X 8 Core 3.80 GHz, NVIDIA GeForce RTX 3080 Ti 12 GB

# Table of Contents
[Features](#features)
[Performance Analysis](#perf_anal)
[Extra Credit](#ec)
[Debugging Notes/Example](#debug)
[Output Sample](#output)
[Future Improvement](#future)

<a name="features"/>

# Features
* CPU Scan
* CPU Compaction with/without Scan
* GPU Naive Scan
* GPU Work-efficient Scan
* GPU Compaction with Scan
* Thrust Scan/Compaction
* GPU Scan Optimization (Extra Credit)
* Radix Sort (Extra Credit)

<a name="perf_anal"/>

# Performance Analysis
The data is generated in Release mode
## Scan
![Scan power-of-two](img/scan-pot.png)

![Scan non-power-of-two](img/scan-npot.png)
Thrust's implementation is always the most efficient method. GPU work-efficient scan starts outperforming the CPU when the input size is greater than 2^20. GPU naive method is always the least efficient method.
This is totally expected, as there are more overheads on the GPU side, and the benefits of parallelism only starts to show up when the input size is big enough.

## Compaction
![Compaction power-of-two](img/compact-pot.png)

![Compaction non-power-of-two](img/compact-npot.png)
CPU with scan is the least efficient in this case, which makes sense since it actually did a lot of extra work without the benefit of parallelism. GPU work-efficient compaction starts outperforming the regular CPU implementation when the input size is greater than 2^18. This is early than the scan case, because compaction can take a lot of advantages from parallelism.

## Sort
![Sort power-of-two](img/sort-pot.png)

![Sort non-power-of-two](img/sort-npot.png)
The CPU side uses the std::sort, which uses the Introsort algorithm that has a time complexity of O(N log(N)). It is already a very efficient sorting algorithm, but the GPU's Radix sort still manages to beat it when the input size is more than 2^23.

<a name="ec"/>

# Extra Credit
### Why is My GPU Approach So Slow?
There are many reasons for the GPU approach to be slower, and following are some of my ideas:
* There are way more computation overheads for the GPU scan, which is due to the nature of the algorithm being used. There isn't really any way to improve it.
* We are using Global memory in our implementation, instead of Shared Memory that is a lot faster.
* In both Up/Down Sweep, when d value is high, there will be a lot of threads idling. I added code to make sure that if the data at that index won't be checked for the current d value, it should exit early. There are further improvements that can be done by improving the locality of the data.
* Block size can affect the performance. I tried varying the block size based on the input size, setting a very small block size, and setting a very big block size. Unfortunately, all the above methods decreased the performance. I also checked the block size used by Thrust, and it seems to be using 128. So I used 128 in my implementation too. It will be interesting to dive deeper and learn the reasons in the future.

### Radix Sort
I implemented the radix sort and wrote a test case using the std::sort on the CPU side. The algorithm works perfectly and start outperforming the CPU side when input is greater than 2^23.
In main.cpp, I constructed array a (power-of-two size) and array b (non-power-of-two size), and I called StreamCompaction::Efficient::radixSort() to generate the result, and print whether the result matches the result from the CPU.
This is the result
```
*****************************
** RADIX SORT TESTS **
*****************************
[ 71 23 76 5 97 9 58 85 81 38 37 70 2 ... 80 0 ]
==== cpu std::sort, power-of-two ====
elapsed time: 1086.39ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ]
==== cpu std::sort, non-power-of-two ====
elapsed time: 1067.79ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ]
==== radix sort, power-of-two ====
elapsed time: 803.583ms (CUDA Measured)
passed
==== radix sort, non-power-of-two ====
elapsed time: 785.894ms (CUDA Measured)
passed
```
for these lines:
```
printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");
// Sort Test
genArray(SIZE - 1, a, 100);
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu std::sort, power-of-two");
StreamCompaction::CPU::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("cpu std::sort, non-power-of-two");
StreamCompaction::CPU::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, c, true);

zeroArray(SIZE, d);
printDesc("radix sort, power-of-two");
StreamCompaction::Efficient::radixSort(SIZE, d, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(SIZE, b, d);

zeroArray(SIZE, d);
printDesc("radix sort, non-power-of-two");
StreamCompaction::Efficient::radixSort(NPOT, d, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(NPOT,c, d);

```

<a name="debug"/>

# Debugging Notes/Example
For this project, I mainly used the watch and memory tool from the VS debugger. I want to clearly see the changes of the array at each step, so I copy the result from the Kernel to a local CPU array, use the watch to find out its memory, and then use the memory tool to see if it's the desired result.
Following is an example:
![Sort Debugging](img/debug_radix_sort6.png)
My Radix sort is generating a wrong result, so I use the above method to go through each step. At step 5, the result wasn't expected (it should have all the 0s and 2s followed by 1s and 3s but instead they are all mixed together). I figured out it's the KernScatter function that didn't generate the correct result, so I went to check for it and find the bug.

<a name="output"/>

# Output Sample

For input size 2^26:
```

****************
** SCAN TESTS **
****************
[ 21 17 24 5 47 20 8 18 12 37 41 44 21 ... 25 0 ]
==== cpu scan, power-of-two ====
elapsed time: 36.0359ms (std::chrono Measured)
[ 0 21 38 62 67 114 134 142 160 172 209 250 294 ... 1643665856 1643665881 ]
==== cpu scan, non-power-of-two ====
elapsed time: 35.3576ms (std::chrono Measured)
[ 0 21 38 62 67 114 134 142 160 172 209 250 294 ... 1643665797 1643665811 ]
passed
==== naive scan, power-of-two ====
elapsed time: 53.4917ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 47.9857ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 24.1193ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 21.9279ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.766976ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.86528ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 2 1 3 3 0 3 2 2 1 1 0 2 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 128.336ms (std::chrono Measured)
[ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 122.404ms (std::chrono Measured)
[ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
passed
==== cpu compact with scan ====
elapsed time: 271.842ms (std::chrono Measured)
[ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 33.268ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 33.4285ms (CUDA Measured)
passed

### (TODO: Your README)
*****************************
** RADIX SORT TESTS **
*****************************
[ 34 42 27 4 19 27 7 9 22 13 49 20 41 ... 8 0 ]
==== cpu std::sort, power-of-two ====
elapsed time: 946.368ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== cpu std::sort, non-power-of-two ====
elapsed time: 908.604ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== radix sort, power-of-two ====
elapsed time: 846.838ms (CUDA Measured)
passed
==== radix sort, non-power-of-two ====
elapsed time: 815.581ms (CUDA Measured)
passed
```

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
<a name="future"/>

# Future Improvement
* Using Shared Memory for GPU scan
* Improve the locality of the array to help retiring more idling threads
* Learn more about Thrust's implementation (why it is so efficient)
Binary file added img/compact-npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compact-pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort7--changed-D.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort8-still-wrong.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/debug_radix_sort9-yes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/sort-npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/sort-pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 37 additions & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,13 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 26; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
int* d = new int[SIZE];


int main(int argc, char* argv[]) {
// Scan tests
Expand Down Expand Up @@ -147,8 +149,42 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");
// Sort Test
genArray(SIZE - 1, a, 50);
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu std::sort, power-of-two");
StreamCompaction::CPU::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("cpu std::sort, non-power-of-two");
StreamCompaction::CPU::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, c, true);

zeroArray(SIZE, d);
printDesc("radix sort, power-of-two");
StreamCompaction::Efficient::radixSort(SIZE, d, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(SIZE, b, d);

zeroArray(SIZE, d);
printDesc("radix sort, non-power-of-two");
StreamCompaction::Efficient::radixSort(NPOT, d, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(NPOT,c, d);


system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
delete[] d;
}
20 changes: 16 additions & 4 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,28 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
__global__ void kernScatter(int n, int* odata,
const int* idata, const int* bools, const int* indices) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

if (bools[index]) {
odata[indices[index]] = idata[index];
}

}

}
Expand Down
59 changes: 53 additions & 6 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#include <cstdio>
#include "cpu.h"

#include "common.h"

namespace StreamCompaction {
Expand All @@ -19,7 +18,12 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int identity = 0;
odata[0] = identity; // exclusive scan
for (int i = 1; i < n; ++i)
{
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +34,17 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int oIndex = 0;
for (int i = 0; i < n; ++i)
{
if (idata[i] != 0)
{
odata[oIndex] = idata[i];
oIndex++;
}
}
timer().endCpuTimer();
return -1;
return oIndex; // the number of elements remaining
}

/**
Expand All @@ -41,10 +53,45 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* keepData = new int[n];
// Step 1 and Step 2
timer().startCpuTimer();
int identity = 0;
odata[0] = identity; // exclusive scan

for (int i = 0; i < n; ++i)
{
keepData[i] = (idata[i] == 0)? 0 : 1;
if (i > 0)
odata[i] = odata[i - 1] + keepData[i - 1];

}

// Step 3
int oIndex = 0;
for (int i = 0; i < n; ++i)
{
if (keepData[i])
{
odata[oIndex] = idata[i];
oIndex++;
}
}
timer().endCpuTimer();
delete[] keepData;
return oIndex;
}

/*
* CPU sort using std::sort
*/
void sort(int n, int* odata, const int* idata)
{
memcpy(odata, idata, n * sizeof(int));
timer().startCpuTimer();
// TODO
std:: sort(odata, odata + n);
timer().endCpuTimer();
return -1;
}

}
}
Loading