CS 336: Project #8

Fish Schooling on the GPU

Log onto one of the computers hosting a GPU. (gpu0, gpu1, gpu2, or gpu3). e.g.

ssh -X gpu0.cs.colby.edu

You can use XEmacs to edit files here. Notice that you need to use the -X flag for X-forwarding instead of -Y.

In this project, you will be writing a CUDA C version of the fish-schooling simulations. The goal of the project is to implement a substantial program which uses CUDA C to make it run quickly. We will be discussing the design in class, but I want to have the details written down so that you can refer to them.

  1. Read the rest of the paper on fish-schooling simulation. (The last time you read it, you were instructed to skip the parts on the GPU implementation. This time, read that part too).
  2. Zip up your sequential implementation of the fish-schooling code and scp it to the GPU host. Put it in a directory other than the proj 08 directory (you won't be turning this in). This may take a little bit of work, because you will probably first want to copy fishStep from your p-threads version, since it has all the latest and greatest features. Then compile and run enough programs to convince yourself you have something against which you can test the GPU code you will write for this project.
  3. There are a few issues related to re-writing the fish simulation code on the GPU.
    1. All user-written code called by the kernel must be in the same compilation unit as the kernel, because there is no device side linker. That means we need to be very careful with our support files, like my_math. We will place all kernels into a file called kernels.cu and include my_math.cu instead of my_math.h. This has implications for linking. It means no program can link both my_math.o and kernels.o because both will have the code for the math functions.
    2. All mallocs need to be cast to the appropriate type. On the CPU, the compiler was forgiving, on the GPU it isn’t. So change and line like this:
             float *a = malloc(sizeof(float)*N);
      to this
             float *a = (float *) malloc(sizeof(float)*N);
    3. We want to create four kinds of files:
      1. dual_support.cu: contains routines that should be callable from both host and device functions (note this includes the test functions). The macro defining NUM_FISH should be in dual_support.cuh.
      2. kernels.cu: contains all kernels. It will include dual_support.cu
      3. host support code: various files containing routines that should be callable from the host only (e.g. utils.h, fishIO.h, etc.)
      4. main programs: any main program should include kernels.h and dual_support.cuh (if it needs it) and host support files. It should link in kernels.o and host_support.o, but not dual_support.o because kernels.o is a superset of dual_support.o.
  4. Get the math support functions working on the GPU. We need to place the math functions into dual_support.cu because both device and host code need to call them. Below is the schematic for the files that will be involved in getting the math code onto the GPU and getting some testing code written. They are color-coded: blue is for files that have host-only code, purple is for files with code that runs both places, and red is for files with device-only code. I also demonstrate that dual_support and kernels are joined together as one object (because of the shortcomings of the device-side linker).

    figure showing test_my_math files

    1. Copy all functions into dual_support.cu and put the __device__ and __host__ specifiers with all function declarations
    2. Copy all functions from test_my_math.c into dual_support.cu and put the __device__ and __host__ specifiers with all function declarations
    3. Change the name of the “main” function to “test_math”, remove its arguments, and add __device__ and __host__ to its definition.
    4. Add prototypes for all math functions and test_math to dual_support.cuh.
    5. Delete my_math.c, my_math.h, and test_my_math.c
    6. Create kernels.cu
      1. include “dual_support.cu” (note that it is the body and not the header!)
      2. Add a function to kernels.cu called test_math_kernel. It has no input or output – it should just call “test_math”.
    7. Create kernels.cuh. Place the prototype for test_math_kernel in it
    8. Create a file named test_my_math.cu
      1. include kernels.cuh
      2. include dual_support.cuh
      3. Write a main program that executes the kernel “test_math_kernel” in 1 thread, synchronizes the threads, and then calls test_math. This makes test_math run on the GPU and then on the CPU.
    9. Create a file fish_sizes.h and place in it the macro defining NUM_FISH. (Be sure to include the macro for FISH_SIZES_H as well). Although none of our code uses NUM_FISH yet, the Makefile assumes fish_sizes.h is there. So, we might as well write it now.
    10. Download Stephanie’s Makefile.
    11. Make test_my_math and run it. Fix any bugs ☺
  5. Get a simple simulation working on the GPU. Let’s start by getting a simulation that uses an initial configuration defined on the CPU and which copies the final configuration back to the CPU. We will need more files - most of which are host-support files. Below is the diagram of files for this program:

    figure showing sim_simple files

    1. First, we need our support routines. Copy fishStep and correctDirection to kernels.cu. Put __device__ before them because they will run on the GPU. Then fix up fishStep so that it handles only one “goal” fish (instead of looping over all goal fish, thread threadIdx.x should handle goal fish threadIdx.x). Remarkably, nothing else needs to change. Here is a good thought exercise: Why don’t we need to call __syncthreads() in fishStep?
    2. Get more host-only support code. Copy my_timing.h, my_timing.c, fishIO.h, fishIO.c, initial_schools.h, and initial_schools.c to the proj 08 directory. Change fishIO.h and initial_schools.h to fishIO.cuh and initial_schools.cuh. Likewise, change all fishIO.c and initial_schools.c to fishIO.cu and initial_schools.cu. Update initial_schools.cu so that it includes dual_support.cuh instead of my_math.h. Update fishIO.cu so that it casts the result of the malloc calls to the appropriate pointer type. Also add a return statement (return 1) to writeFrame.
    3. Now let’s write our simulation kernel. This belongs in kernels.cu (with a prototype in kernels.cuh). It will take an initial configuration from the CPU (i.e. it will expect it to be in global device memory), run it, and then return the final configuration to the CPU (i.e. it will leave it in global device memory). This kernel is designed for debugging purposes. It is allowed to assume that it is running with one block and NUM_FISH threads. (Note: now kernels.cu needs to include fish_sizes.h). Here is an outline of the function:
      // run simulation with initial configuration supplied by CPU
      // (the data are in global memory, in variables globalPx, globalPy,
      // globalVx, and globalVy).
      // Return the final configuration in those same variables.
      __global__ void runSimulationForDebug(
              float *globalPx, float *globalPy, 
              float *globalVx, float *globalVy, 
              float r, int numFrames) {
        // Declare shared vectors for Px, Py, Vx, and Vy. Since we need
        // two copies of the ocean (current and new), we will use variables
        // Px1, Px2, Py1, Py2, ... , Vy2
        __shared__ float Px1[NUM_FISH];
        // etc.
        // Now make pointers to these arrays so that it is easy to swap them after
        // each call to fishStep
        float *Px = Px1;
        // ...
        float *newVy = Vy2;
        // Let's just use our within-block thread id to figure out which fish is "ours"
        // (i.e. we assume that globalPx, etc. is of length NUM_FISH).
        int id = threadIdx.x;
        // It is our job to copy the appropriate information from the global arrays to the
        // shared arrays.
        Px[id] = globalPx[id];
        // ...
        Vx[id] = globalVy[id];
        // Before we can start the simulation, I need to make sure my compatriots have
        // finished their copying.
        // Now, loop over numFrames.
        // In the loop, call fishStep, then sync the threads, then swap the pointers (just like
        // you did in the sequential version)
        // Finally, put the info about my fish back into the global arrays.
        globalPx[id] = Px[id];
        // ...
        globalVy[id] = Vy[id];
    4. Finally, let’s write the main program. Create a file named sim_simple.cu that will take r and numFrames as command line inputs, create an initial schooling configuration on the CPU, copy it to the GPU, and then call a simulation kernel. It should call the kernel on one block with NUM_FISH threads. The final configuration should then be copied back to the host. You can then dump it to the screen or to a file. I would include fishIO.h to do that. Make a dumpFrame function similar to the writeFrame function that uses printf instead of fprintf.
    5. Make sim_simple and convince yourself (and Stephanie) that it is working.
    6. Determine how large of a school you can simulate accurately on the GPU. Compute it by hand first, then run simulations to determine if your prediction (based on GPU memory and thread limits) is correct.
    7. Explain the purpose of each call to __syncthreads. I.e. tell me what could go wrong if it weren't there.


Get computeElongation and computePolarization working in the new set-up. Report that the result of calling them in sim_simple. Next week, we will be running those functions on the device. So I would do that, which means your kernel will need to call them, which means it will need to return the values.

Find a more efficient strategy for running the simulations.

Perform a particularly in depth analysis of the code.

Writeup and Handin

To hand in your project, you will gather all of the necessary files into a proj08 directory under your turnin direction on the gpu host.

  1. Create a file named README.txt for your project write-up. Include a description of the process you used to determine that your code produces correct results. Also include the analysis outlined earlier. The more thorough the analysis, the higher your grade will be.
  2. You should hand in all code necessary to run your solutions. Place all necessary .h, .c, and Makefile files in the proj08 directory. Stephanie will probably want to compile and run the code. It should be possible to do so without looking for any more files.

Email Stephanie to let her know your project is ready for grading.