Project 10: We Can't Get Enough of Fish Schooling on the GPU
In this project, you will be finishing up your CUDA C version of the fish-schooling simulations. The goal of the project is to run lots of simulations, time them, and look examine the output (i.e. determine how r affects elongation and polarity for multiple values of r, and how that changes with fish-school size).
- In kernel.cu, write computeElongation and computePolarization so that they work on the device (I say this under the assumption that you aren't using disp_sim in this directory).
- In kernel.cu, write a version of fillRandomOcean (call it something like fillRandomOceanOnGPU) that uses CUDA's random number generator curand.
- To use the library, you will need to include <curand_kernel.h>. To ensure that different threads and different blocks have different pseudorandom numbers, each block needs a seed, so fillRandomOceanGPU will need to take as input a unsigned long long seed. It will be used by a kernel function and must run on the device, so it needs __device__ before its definition. We will talk about this in class, but here is the documentation for the CURAND library (Chapter 3 is the part you want). I call curand_init so that all fish in all oceans use the same seed (which I set by calling time(NULL) on the host). Each thread uses its thread id (blockDim.x*blockIdx.x+threadIdx.x) for the sequence number and 0 for the offset. The job of each thread is to fill in information for its goal fish.
- You will also need to set your LD_LIBRARY_PATH environment variable to include the library. Add this line to your .cshrc file:
- You will need to add -lcurand to your LFLAGS in your Makefile.
setenv LD_LIBRARY_PATH /usr/local/cuda/lib64
- In kernel.cu, write kernel (i.e. __global__) runRandomSimulationWithStatistics. It should take as input an unsigned long long seed, along with r, numFrames, and pointers to elongation and polarization values. The code should declare the same shared memory arrays as the simulation kernel from last week, but instead of copying an initial ocean from device global memory, it will create it by calling fillRandomOceanOnGPU, putting the data directly into shared memory. It can then proceed like last week's version, but it must also compute the elongation and the polarization, and place those results in appropriate slots in the elongation and polarization arrays.
- Make a file named collect_stats.cu, and put runSimulations in it. This file plays a role similar to the role collect_stats.c played in the pthreads project - it encapsulates all the code that gets the simulations running. In this case, it means runSimulations calls the runRandomSimulationWithStatistics kernel for NUM_OCEANS simulations with NUM_FISH in each. Add a definition for a NUM_OCEANS macro to fish_sizes.h (both collect_stats and sim_stats_nr will want it).
- Write sim_stats_nr.cu. You can basically copy sim_stats_nr.c from the pthreads version. To use get_time_sec, take it out of utils.h and put it into a file named my_timing.cu and include a file named my_timing.cuh). Also, remember to call cudaDeviceSynchronize() before timing any kernel call.
- Run the code for oceans of 100 fish, and compare the timing results to those of the p-threads version. Which is faster? How many oceans can you simulate without seeing a significant performance hit?
- Increase the number of fish in the ocean and describe the effects of r on polarization and elongation -- does r have different effects, depending on the ocean size?
Perform a particularly in depth analysis of the code or the results. I would really like to know if we can tell how many blocks are actually run at the same time. If you can figure that out, I will be particularly pleased.
Improve the performance. The current design leaves much to be desired. Can you speed it up?
Writeup and Handin
To hand in your project, you will gather all of the necessary files into a proj10 directory under your "turnin" directory on the gpu host.
- Create a file named README.txt for your project write-up. Include a description of the process you used to determine that your code produces correct results. Also include the analysis outlined earlier. The more thorough the analysis, the higher your grade will be.
- You should hand in all code necessary to run your solutions. Place all necessary .h, .c, and Makefile files in the proj10 directory. Stephanie will probably want to compile and run the code. It should be possible to do so without looking for any more files. Zip up the directory and mail it to Stephanie.