CS 4370/6370 Program Assignment 2 - Tiled Matrix Multiplication Solution in CUDA
- R K Gaur

- Jul 11
- 5 min read
Updated: Sep 4
# CS 4370/6370 Program Assignment # 2: Tiled Matrix Multiplication
1. The Objective
The goal of this programming assignment is to implement tiled matrix multiplication. This will help you gain a better understanding of shared memory.
2. Submission Guidelines
A team can consist of up to three students. All members of the same team will receive the same grade.
Each team must submit one copy of the programming assignment. This includes the CUDA source program, a README file, and a report.
Each team member must submit a list of all team members' names.
3. Project Description: Tiled Matrix Multiplication
In this project, you will develop a complete CUDA program for tiled matrix multiplication. You will multiply two two-dimensional matrices, A and B, on the device GPU. After invoking the device matrix multiplication, your program will compute the correct solution matrix using the CPU. It will then compare this solution with the device-computed solution. If they match (within a tolerance of 0.000001), the program will print "Test PASSED" before exiting.
Pseudo Code for Matrix Multiplication on the CPU
```c
void MatrixMulOnHost(float M, float N, float* P, int Width)
{
for (int row = 0; row < Width; ++row)
for (int col = 0; col < Width; ++col) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[row * Width + k];
float b = N[k * Width + col];
sum += a * b;
}
P[row * Width + col] = sum;
}
}
```
Pseudo Code for Tiled Matrix Multiplication on the GPU
```c
__global__ void MatrixMulKernel(float d_M, float d_N, float* d_P, int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
// Identify the row and column of the P element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = d_M[Row Width + m TILE_WIDTH + tx];
ds_N[ty][tx] = d_N[Col + (m TILE_WIDTH + ty) Width];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += ds_M[ty][k] * ds_N[k][tx];
__syncthreads();
}
d_P[Row * Width + Col] = Pvalue;
}
void main() {
dim3 dimBlock(blocksize, blocksize);
dim3 dimGrid(ceiling(double(N) / dimBlock.x), ceiling(double(N) / dimBlock.y));
MatrixMulKernel <<<dimGrid, dimBlock>>>(a, b, c, N);
}
```
Matrix Initialization Pseudo Code
```c
int a, b, *c;
A = malloc(sizeof(int) N N); // N is the size
// Then malloc for b and c
int init = 1325;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
init = 3125 * init % 6553;
A[i][j] = (init - 1000) % 6553;
B[i][j] = init % 251;
}
}
```
Matrix Size and Thread Block Size

Result Verification Pseudo Code
```c
bool verify(float A, float B, float *C, int width) {
const float relativeTolerance = 1e-6; // 1e-6 = 0.000001
for (int row = 0; row < width; ++row) {
for (int col = 0; col < width; ++col) {
float sum = 0;
for (unsigned int k = 0; k < width; ++k) {
sum += A[row width + k] B[k * width + col];
}
float relativeError = (sum - C[row * width + col]) / sum;
if (relativeError > relativeTolerance || relativeError < -relativeTolerance) {
printf("TEST FAILED\n\n");
return false;
}
}
}
printf("TEST PASSED\n\n");
return true;
}
```
Record your runtime with respect to different input matrix sizes as shown in the following table. Compute the speedup using the GPU computation time and the CPU computation time.
Optional Considerations
You can include the memory transfer time between the CPU and GPU in the GPU computation time. In that case, it might be fair to also include the time for matrix initialization in the CPU computation time and recompute the speedup.
I will provide a sample program for reference on how to measure CPU running time, GPU execution time, and memory transfer time between CPU and GPU. The method used in the sample program is not the only way to measure running time on the CPU. If you use a different method to measure CPU execution time, it is acceptable. (Do not use CUDA events to measure CPU execution time.)
Requirements
To use the CUDA compiler environment installed on the CS Unix server, fry.cs.wright.edu, connect to this server remotely using a secure shell client, such as PuTTY. You can connect to this server on campus from a Wright State computer or use your laptop connected to the WSU wifi network named “WSU-Secure.” Note that you cannot connect remotely using SSH from outside Wright State University without installing VPN or using the campus “WSU_EZ_CONNECT” wifi network. If you want to connect to this server remotely off campus, you need to install VPN on your computer first. If you want to edit your CUDA source programs under Windows, download Notepad++. Then edit your source programs using Notepad++. After editing, use a secure file transfer client (like WinSCP) to transfer your CUDA source programs to fry.cs.wright.edu.
You must submit an ELECTRONIC COPY of your source program through Pilot before the due date. If Pilot is unavailable, submit your source code via email to meilin.liu@wright.edu.
Submit all source codes, a README file, a report, and any other required files. Clearly explain how to compile and run your programs in the README file. In your report, state whether your programs have all the functionalities required in the project description. If any functionalities are not implemented, please state that clearly. If your program works correctly, include screenshots in your report. Your submitted file name should include your last name, e.g., Liu_Project1.cpp, Liu_Project1_Report, Liu_Project1_ReadMe, etc. All submitted project files should include: Course Number / Course Title, your name, group member’s name, professor’s name, date, and project name. If you do not include these required contents, you will lose 5 points.
The grader or instructor will test your programs in the CUDA environment on the Linux server, fry.cs.wright.edu. Before submitting your program, connect to this server using your campus ID to test it (I have demonstrated how to compile and execute a CUDA program on this server. If you have questions, let me know).
This programming assignment is individual. You must complete the project by yourself. If you allow others to copy your programs or answers, you will face the same punishment as those who copied yours.
How to Use CUDA on fry.cs.wright.edu
First, connect to fry.cs.wright.edu using your campus ID (for example, w123abc) with PuTTY or another secure shell client. Then run the following command:
```bash
srun -p a100 --gres=gpu:1 --pty bash
```
or
```bash
srun -p p100 --gres=gpu:1 --pty bash
```
This command requests access to a GPU node and launches a bash shell.
Next, compile a CUDA program, `vectadd.cu`, using the following command in the directory where your source CUDA program is located:
```bash
nvcc vectadd.cu -o vectadd
```
Then execute `vectadd` using the following command in the directory where the generated executable file (of your CUDA source program), `vectadd`, is located:
```bash
./vectadd
```
Need Plagiarism-Free and Manually Written Solutions with 100% Accuracy?
📞 Call or WhatsApp: +91- 995 - 314 - 1035 (For quick response)
📧 Email: javascholars@gmail.com





Comments