top of page
Search

# CS 4370/6370 Fall 2023 Program Assignment # 2 – Tiled Matrix Multiplication Solution in CUDA

Updated: Oct 3

CS 4370/6370 Fall 2023 Program Assignment # 2

1. The objective

The objective of this programming assignment is to implement tiled matrix multiplication, and get a better understanding of shared memory.

2. Submission

A team can have up to 3 students. All students in the same team will receive the same grade.

a. Each team only submits one copy of your programming assignment, including cuda source program, readme, and a report.

b. Each team member needs to submit the list of the names of all team members.

3. Project description: Tiled Matrix Multiplication

In this project, you will develop a complete CUDA program for tiled matrix multiplication. You will multiply two two-dimensional matrices A,B on the device GPU. After the device matrix multiplication is invoked, your program will also compute the correct solution matrix using the CPU, and compare that solution with the device-computed solution. If it matches (within a certain tolerance, i.e., 0.000001), then it will print out "Test PASSED" to the screen before exiting.

The pseudo code for matrix multiplication on the CPU is as follows:

void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏

{

for (int row = 0; row < Width; ++row)‏

for (int col = 0; col < Width; ++col) {

double sum = 0;

for (int k = 0; k < Width; ++k) {

float a = M[row * Width + k];

float b = N[k * Width + col];

sum += a * b;

}

P[row * Width + col] = sum;

}

}

The pseudo code for tiled matrix multiplication on the GPU device is as follows:

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)

{

__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];

__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

int Row = by * TILE_WIDTH + ty;

int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

// Loop over the Md and Nd tiles required to compute the Pd element

for (int m = 0; m < Width/TILE_WIDTH; ++m) {

ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx];

ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width];

for (int k = 0; k < TILE_WIDTH; ++k)

Pvalue += ds_M[ty][k] * ds_N[k][tx];

}

d_P[Row*Width+Col] = Pvalue;

}

void main() {

dim3 dimBlock(blocksize,blocksize);

dim3 dimGrid( ceiling (double (N) /dimBlock.x), ceiling (double (N) /dimBlock.y) );

MatrixMulKernel <<<dimGrid,dimBlock>>>(a,b,c,N);

}

Use the following pseudo code for matrix initialization.

int *a, *b, *c;

A=malloc(sizeof(int)*N*N; //N is the size

//then malloc for b and c

Int init =1325;

For (i=0;i<N;i++){

For (j=0;j<N;j++){

Init=3125*init%6553;

A[i][j]=(init-1000)%6553;

B[i][j]=Init%251;

}

}

Use the following matrix size and thread block size (the number of threads in each block). Use the following pseudo code for result verification, with a within a specified tolerance, i.e., 0.000001.

bool verify(float *A, float *B, float *C, int width) {

const float relativeTolerance = 1e-6; // 1e-6 = 0.000001

for(int row = 0; row < width; ++row) {

for(int col = 0; col < width; ++col) {

float sum = 0;

for(unsigned int k = 0; k < width; ++k) {

sum += A[row*width + k]*B[k*width + col];

}

float relativeError = (sum - C[row*width + col])/sum;

if (relativeError > relativeTolerance

|| relativeError < -relativeTolerance) {

printf("TEST FAILED\n\n");

return false;

}

}

}

printf("TEST PASSED\n\n");

return true;

}

Record your runtime with respect to different input matrix sizes as shown in the following table, and compute the speed up using the GPU computation time, and the CPU computation time.

Optional: You can also include the memory transfer time between CPU and GPU in the GPU computation time (In that case, it might be fair to also include the time for matrix initialization in the CPU computation time), and re-compute the speedup.

I will post a sample program for you to reference how to measure CPU running time, GPU execution time, and the memory transfer time between CPU and GPU. The method used in the sample program is not the only way to measure the running time on CPU. If you use different method to measure CPU execution time, it is totally acceptable. (Don’t use CUDA event to measure CPU execution time.)

Explore the sidebar on the left. You can add media to add color to your post. You can also change the cover image, optimize the SEO Settings, add categories or tags, and much more. Spend some time discovering each tab and learn how you can enhance your future posts. Requirements:

2. You must submit an ELECTRONIC COPY of your source program through Pilot before the due date. If for some reason Pilot is unavailable, submit your source code by email to meilin.liu@wright.edu.

4. The grader or the instructor will test your programs under CUDA environment, on the linux server, fry.cs.wright.edu. Before you submit your program, please connect to this server using your campus ID to test your program (I have demoed how to compile and execute a cuda program on this server. If you have questions, let me know).

5. The programming assignment is individual. You must finish the project by yourself. If you allow others to copy your programs or answers, you will get the same punishment as those who copy yours.

How to use CUDA on fry.cs.wright.edu

First using putty or other secure shell clients to connect to fry.cs.wright.edu using your campus id (for example, w123abc), then run the following command:

srun -p a100 --gres=gpu:1 --pty bash

or srun -p p100 --gres=gpu:1 --pty bash

This command will request access to a gpu node and launch a bash shell on it.

Then you can compile a cuda program vectadd.cu using the following command under the directory where your source cuda program is located.

Then you can execute vectadd using the following command under the directory where the generated executable file (of your cuda source program), vectadd, is located.

## +91 - 995 314 1035

Solution Includes: Plagiarism report and AI report with 100% Accuracy. 