**Question 1. (10 points) ** For the two-dimensional matrix addition kernel C=A+B, assume that the size of the matrixes are all 1024*1024, each thread calculates one output element. Also assume the size of the thread block is 16*16, if the and the blockIDx and the threadIDx of a particular thread is blockIDx. x=5, blockIdx.y=8, and threadIDx. x=6, threadIdx.y=7, what is the row and column of the output array element this thread will compute?

**Question 2. (5 points) ** If a CUDA device’s SM can take up to 1536 threads and up to 4 thread blocks, which of the following block configurations would result in the most number of threads in the SM?

(1)128 threads per block

(2)256 threads per block

(3)512 threads per block

(4)1024 threads per block

**Question 3 (5 points).** For vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many thread blocks will be launched? How many threads will be in the grid?

**Question 4 (20 points). ** We are to process a 62X80 (80 pixels in the x or horizontal direction, 62 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 62 and n’s value is 80.

__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {

// Calculate the row # of the d_Pin and d_Pout element to process

int Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element to process

int Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in range

if ((Row < m) && (Col < n)) {

d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];

}

}

(1) Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel? How many warps will have control divergence?

(2) If we are to process an 80x61 picture (61 pixels in the x or horizontal direction and 80 pixels in the y or vertical direction) picture, how many warps will have control divergence?

(3) If are to process a 76x61 picture (61 pixels in the x direction and 76 pixels in the y direction), how many warps will have control divergence?

**Question 5 (10 points). ** Assume the following simple matrix multiplication kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)

{

int Row = blockIdx.y*blockDim.y+threadIdx.y;

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;

for (int k = 0; k < Width; ++k) {

Pvalue += M[Row*Width+k] * N[k*Width+Col];

}

P[Row*Width+Col] = Pvalue;

}

}

If we launch the kernel with a block size of 32*32 on a 1000*1000 matrix, how many warps will have control divergence?

**Question 6 (15 points). ** What is the maximum speedup possible according to Amdahl's Law for a program that is 20% inherently serial and 80% parallelizable using N processors for the values of N below? Show your work.

(a) N=10 processors.

(b) N=30 processors.

(c) What is the maximum speedup possible?

CS 4370/6370 Fall 2022 Homework #1 Solution

For plagiarism free solution and best price Please call or WhatsApp at :

**+91 - 9953 141 035 **

**Solution Includes: AI writing Detection and Plagiarism report with 100% Accuracy. **

Perfect solution as per my requirement. No plagiarism as well.

Thank you so much.