Project 8: Dot Products on the GPU
Log onto one of the computers hosting a GPU. (rocketcalc-n0, -n1, -n2, or -n3). e.g.
ssh -X rocketcalc.cs.colby.edu
You can use XEmacs to edit files here. Notice that you need to use the -X flag for X-forwarding instead of -Y.
In this project, you will be writing a CUDA C program to find the dot product of two vectors. The goal of the project is to become familiar with shared memory and synchronization issues with CUDA C.
- Read section 5.3.1 in Cuda By Example.
- Write a single CUDA C program that performs a dot product between two vectors. For the overall program, use the strategy outlined in their example. However, use the "naive" strategy for summing the products. Also, write CPU code to verify the answer is correct.
- Now write a version that uses the tree structure to sum the products (i.e. use their code snippet). It is OK to assume NUM_THREADS_PER_BLOCK is a power of 2.
- Time the two versions. Is the naive version slower? How large does NUM_THREADS_PER_BLOCK need to be before the tree version is faster than the naive version?
Develop a strategy (similar to the binary tree reduction) that will work for any NUM_THREADS_PER_BLOCK value.
Or make clever use of GPU timers to determine how many blocks are being run at a time. Use any program you want to use. Explain your results and your interpretation of them.
Writeup and Handin
To hand in your project, you will gather all of the necessary files into a proj08_
- Create a file named README.txt for your project write-up. Include the analysis outlined earlier. The more thorough the analysis, the higher your grade will be.
- You should hand in all code necessary to run your solutions. Place all necessary .h, .c, and Makefile files in the directory. Stephanie will probably want to compile and run the code. It should be possible to do so without looking for any more files.
Tar up the directory and email the tarball to Stephanie.