Python – PyCUDA kernel function

PyCUDA kernel function… here is a solution to the problem.

PyCUDA kernel function

I’m new to PyCUDA and am looking through some examples on the PyCUDA website. I’m trying to figure out the logic behind some lines of code, and I’d appreciate it if someone explained the idea behind it.

The following code snippet is from the PyCUDA website. Inside the function definition, I didn’t understand

int idx = threadIdx.x + threadIdx.y*4;

How to use the above line to calculate the index of an array. Why threadIdx.x and threadIdx.y add up, and why threadIdx.y is multiplied by 4.

For function calls to the GPU, why block is defined as 5,5,1. Because it is an array of 5×5 elements, according to my understanding, the block size should be 5,5 and not 5,5,1.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print ("ORIGINAL MATRIX")
print a
print ("DOUBLED MATRIX AFTER PyCUDA EXECUTION")
print a_doubled

Solution

The example you posted appears to be from (or plagiarized) a book called “Python Parallel Programming Cookbook,” which I hadn’t heard of until five minutes ago. Honestly, if I were the author of that book, I would be ashamed to include such an appalling example of error.

This is a small modification to what you posted and its output :

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
     int idx = threadIdx.x + threadIdx.y*4;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

[Warning: Python 2 syntax].

In [2]: %run matdouble.py
[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.         -0.62060976  0.49836278 -1.60820103  1.71903515]]

That is, the code is not working as expected, which can be a source of confusion for you.

The correct way to address multidimensional arrays such as numpy arrays stored in linear memory is described in this very recent answer. Any sensible programmer would write a kernel in your example like this:

__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}

to

pass the leading dimension of the array as an argument to the kernel (in this case it should be 5, not 4). Doing so results in the following:

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
lda = numpy.int32(a.shape[-1])
func(a_gpu, lda, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

Produce expected results:

In [3]: %run matdouble.py
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

For the function call to the GPU why is the block defined as 5,5,1.
Since it is an array of 5×5 elements so in my understanding the block
size should be 5,5 instead of 5,5,1.

In CUDA, all blocks implicitly have three dimensions. The block size of (5,5) is the same as the block size of (5,5,1). The last dimension can be ignored because it is a dimension (i.e. all threads in the block have threadIdx.z = 1). The trap you shouldn’t fall into is confusing the dimensions of a CUDA block or grid with the dimensions of the input array. Sometimes it is convenient to make them the same, but it is equally unnecessary or even not recommended. For this example (assuming rows are primarily stored in order), a kernel written correctly in BLAS style might look like this:

__global__ void doubleMatrix(float *a, int m, int n, int lda)
{
     int col = threadIdx.x + blockIdx.x * blockDim.x;
     int row = threadIdx.y + blockDim.y * blockDim.y;

for(; row < m; row += blockDim.y * gridDim.y) {
         for(; col < n; col += blockDim.x * gridDim.x) {
             int idx = col + row * lda;
             a[idx] *= 2.f;
         }
    }
}

[Note: Written in the browser, not compiled or tested].

Here, any

legal block and grid dimensions will handle input array elements of any size correctly, and the total number of elements will fit into signed 32-bit integers. If you run too many threads, some threads do nothing. If too few threads are running, some threads will process multiple array elements. If you run a grid with the same dimension as the input array, each thread will only process one input, as intent in the example you are considering. If you want to read about how to choose the most suitable block and grid size, I recommend you to get started here .

Related Problems and Solutions