CS 4370/6370 Fall 2023 Program Assignment # 2 – Tiled Matrix Multiplication Solution in CUDA

R K Gaur
Oct 9, 2022
5 min read

Updated: Jul 11

CS 4370/6370 Fall 2023 Program Assignment # 2

1. The objective

The objective of this programming assignment is to implement tiled matrix multiplication, and get a better understanding of shared memory.

2. Submission

A team can have up to 3 students. All students in the same team will receive the same grade.

a. Each team only submits one copy of your programming assignment, including cuda source program, readme, and a report.

b. Each team member needs to submit the list of the names of all team members.

3. Project description: Tiled Matrix Multiplication

In this project, you will develop a complete CUDA program for tiled matrix multiplication. You will multiply two two-dimensional matrices A,B on the device GPU. After the device matrix multiplication is invoked, your program will also compute the correct solution matrix using the CPU, and compare that solution with the device-computed solution. If it matches (within a certain tolerance, i.e., 0.000001), then it will print out "Test PASSED" to the screen before exiting.

The pseudo code for matrix multiplication on the CPU is as follows:

void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏

{

for (int row = 0; row < Width; ++row)‏

for (int col = 0; col < Width; ++col) {

double sum = 0;

for (int k = 0; k < Width; ++k) {

float a = M[row * Width + k];

float b = N[k * Width + col];

sum += a * b;

}

P[row * Width + col] = sum;

}

The pseudo code for tiled matrix multiplication on the GPU device is as follows:

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)

{

__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];

__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

int Row = by * TILE_WIDTH + ty;

int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

// Loop over the Md and Nd tiles required to compute the Pd element

for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Coolaborative loading of Md and Nd tiles into shared memory

ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx];

ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width];

__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k)

Pvalue += ds_M[ty][k] * ds_N[k][tx];

__synchthreads();

}

d_P[Row*Width+Col] = Pvalue;

}

void main() {

dim3 dimBlock(blocksize,blocksize);

dim3 dimGrid( ceiling (double (N) /dimBlock.x), ceiling (double (N) /dimBlock.y) );

MatrixMulKernel <<<dimGrid,dimBlock>>>(a,b,c,N);

}

Use the following pseudo code for matrix initialization.

int *a, *b, *c;

A=malloc(sizeof(int)*N*N; //N is the size

//then malloc for b and c

Int init =1325;

For (i=0;i<N;i++){

For (j=0;j<N;j++){

Init=3125*init%6553;

A[i][j]=(init-1000)%6553;

B[i][j]=Init%251;

}

Use the following matrix size and thread block size (the number of threads in each block).

Use the following pseudo code for result verification, with a within a specified tolerance, i.e., 0.000001.

bool verify(float *A, float *B, float *C, int width) {

const float relativeTolerance = 1e-6; // 1e-6 = 0.000001

for(int row = 0; row < width; ++row) {

for(int col = 0; col < width; ++col) {

float sum = 0;

for(unsigned int k = 0; k < width; ++k) {

sum += A[row*width + k]*B[k*width + col];

}

float relativeError = (sum - C[row*width + col])/sum;

if (relativeError > relativeTolerance

|| relativeError < -relativeTolerance) {

printf("TEST FAILED\n\n");

return false;

}

printf("TEST PASSED\n\n");

return true;

}

Record your runtime with respect to different input matrix sizes as shown in the following table, and compute the speed up using the GPU computation time, and the CPU computation time.

Optional: You can also include the memory transfer time between CPU and GPU in the GPU computation time (In that case, it might be fair to also include the time for matrix initialization in the CPU computation time), and re-compute the speedup.

I will post a sample program for you to reference how to measure CPU running time, GPU execution time, and the memory transfer time between CPU and GPU. The method used in the sample program is not the only way to measure the running time on CPU. If you use different method to measure CPU execution time, it is totally acceptable. (Don’t use CUDA event to measure CPU execution time.)

Explore the sidebar on the left. You can add media to add color to your post. You can also change the cover image, optimize the SEO Settings, add categories or tags, and much more. Spend some time discovering each tab and learn how you can enhance your future posts.

Requirements:

1. In order to use the cuda compiler environment installed under the cs unix server, fry.cs.wright.edu, you need to connect to this unix server remotely using a secure shell client, such as putty. You can remotely connect to this unix server, fry.cs.wright.edu, on campus from a Wright State computer or use your own laptop connecting to the WSU wifi network named “WSU-Secure”. Note that you cannot remotely connect to this computer using ssh using computers outside Wright State University without installing VPN or use the campus “WSU_EZ_CONNECT” wifi network. If you want to connect to this server remotely off campus, you need to install VPN on your computer first. If you want to edit your cuda source programs under windows, download notepad++. Then edit your source programs using notepad++. After you finish editing the cuda source programs, using the secure file transfer client (WinSCP, you can download it online, and install it on your personal computer) to transfer your cuda source programs to fry.cs.wright.edu.

2. You must submit an ELECTRONIC COPY of your source program through Pilot before the due date. If for some reason Pilot is unavailable, submit your source code by email to meilin.liu@wright.edu.

3. Submit all your source codes, a README file, a report, and any other required files. It is required that you explain how to compile and run your programs clearly in the README file. In the report, please report whether your programs have all the functionalities required in the project description. In your report, please state clearly any functionalities not implemented in your program. If your program works correctly, please include screenshots in your report. Your submitted file name should use your last name as part of the file name, for example, Liu_Project1.cpp, Liu_Project1_Report, Liu_Project1_ReadMe, etc. All the submitted project files should have: Course Number / Course Title, your name, group member’s name, prof.’s name, date, and the project name. If you did not include these required contents in your submitted files, then 5 points will be deducted.

4. The grader or the instructor will test your programs under CUDA environment, on the linux server, fry.cs.wright.edu. Before you submit your program, please connect to this server using your campus ID to test your program (I have demoed how to compile and execute a cuda program on this server. If you have questions, let me know).

5. The programming assignment is individual. You must finish the project by yourself. If you allow others to copy your programs or answers, you will get the same punishment as those who copy yours.

How to use CUDA on fry.cs.wright.edu

First using putty or other secure shell clients to connect to fry.cs.wright.edu using your campus id (for example, w123abc), then run the following command:

srun -p a100 --gres=gpu:1 --pty bash

or srun -p p100 --gres=gpu:1 --pty bash

This command will request access to a gpu node and launch a bash shell on it.

Then you can compile a cuda program vectadd.cu using the following command under the directory where your source cuda program is located.

nvcc vectadd.cu -o vectadd

Then you can execute vectadd using the following command under the directory where the generated executable file (of your cuda source program), vectadd, is located.

./vectadd