Python - How do I know if my Embarrassingly Parallel task is a good GPU?

How do I know if my Embarrassingly Parallel task is a good GPU?… here is a solution to the problem.

How do I know if my Embarrassingly Parallel task is a good GPU?

Are we saying that tasks that require fairly light calculations per row on a large number of rows simply do not suit the GPU?

I’m going to do some data processing on a table where the rows are independent. So it’s embarrassingly parallel. I have a GPU, so… A match made in heaven? It is very similar to this example, it calculates the moving average of each entry per row (rows are independent. )

import numpy as np

from numba import guvectorize

@guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
    window_width = window_arr[0]
    asum = 0.0
    count = 0
    for i in range(window_width):
        asum += a[i]
        count += 1
        out[i] = asum / count
    for i in range(window_width, len(a)):
        asum += a[i] - a[i - window_width]
        out[i] = asum / count

arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))

Like this example, my treatment of each line is not very mathematical. Instead, it iterates through the rows and performs some sum, assignment, and other scattered operations, with some conditional logic added.

I tried using guVectorize in the Numba library to assign it to an Nvidia GPU. It works fine, but I don’t get the acceleration.

Is this type of task suitable for GPUs in principle? That said, if I dig deeper into Numba and start tweaking threads, blocks, and memory management or algorithm implementations, I should theoretically get speeded up. Or, the problem simply doesn’t fit the architecture at all.

The answer below seems to indicate that it doesn’t fit, but I’m not quite convinced yet.

numba – guvectorize barely faster than jit

and numba guvectorize target=’parallel’ slower than target=’cpu’

Solution

Your task is obviously memory-bound, but that doesn’t mean you can’t benefit from the GPU, but it may not be as straightforward as a CPU-bound task.

Let’s look at common configurations and do some math:

Approximately CPU-RAM memory bandwidth. 24GB/s
CPU-GPU transmission bandwidth approx. 8GB/s
Approximately GPU-RAM memory bandwidth. 180GB/s

Let’s say we need to transfer 24 GB of data to complete the task, then we will have the following optimal times (whether and how to reach these times is another question!):

Scenario: CPU time only = 24GB/24GB/s = 1 second.
Scenario: Data must be transferred from the CPU to the GPU (24GB/8GB/s = 3 seconds) and processed there (24GB/180GB/s = 0.13 seconds), resulting in 3.1 seconds.
Scenario: The data is already on the device, so only 24GB/180GB/s = 0.13 seconds.

As you can see, there is potential for acceleration, but only in the 3rd case – when your data is already on the GPU device.

However, achieving maximum bandwidth is a challenging undertaking.

For example, when processing a matrix on the CPU by rows, you want the data to be arranged in row-first order (C order) in order to take full advantage of the L1 cache: read a double, you actually load 8 doubles into the cache, and you don’t want to evict them from the cache before processing the remaining 7.

On the GPU, on the other hand, you want memory access to be coalesced, e.g. thread 0 should access address 0, thread 1 – address 1, and so on . To do this, the data must be in column precedence order (Fortran order).

One more thing to consider: how to test performance. Your test array is only about 2MB in size, so it’s small enough for an L3 cache. The bandwidth of the L3 cache depends on the number of cores used for computation, but at least around 100GB/s – not much slower than the GPU, and possibly much faster when parallelized on the CPU.

You need larger datasets to not be fooled by caching behavior.

A bit off topic comment: Your algorithm is not very robust from a numerical point of view.

If the window width is 3, as in your example, but there are about 10**4 elements in a row. So, for the last element, the value is the result of approximately 10**4 addition and subtraction, each of which adds a rounding error to the value – by comparison, only three additions make a big difference if done “naively”.

Of course, it may not matter (like 10 elements in a row in your example), but it may also bite you one day….

Python – How do I know if my Embarrassingly Parallel task is a good GPU?

How do I know if my Embarrassingly Parallel task is a good GPU?

Solution

Related Problems and Solutions