Linux – Threaded implementations degrade performance

Threaded implementations degrade performance… here is a solution to the problem.

Threaded implementations degrade performance

I implemented a small program in C to calculate PI (mainly personal interest and training) using Monte Carlo method. After implementing the basic code structure, I added a command-line option that allows threaded calculations to be performed.

I

expected a huge increase in speed, but I was disappointed. The command line summary should be clear. The final number of iterations to approximate PI is the product of the number of –iterations and -threads passed from the command line. Leaving -threads blank defaults it to 1 thread, causing execution in the main thread.

The following tests performed a total of 80 million iterations.

On Windows 7 64-bit (Intel Core2Duo machines):

Windows Stats

Compiled with Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3

On Ubuntu/Linaro 12.04 (8-core AMD):

Linux Stats

Compiled with GCC 4.6.3: gcc pi.c -lm -lpthread -o3 -o pi

Performance

On Windows, the threaded version

is a few milliseconds faster than the non-threaded version. Honestly, I’m looking forward to better performances. On Linux, ugh! Did I make a mistake? Why spend 2000% of your time? Of course a lot depends on the implementation, so be it. Excerpt after command line argument parsing is complete and calculations begin:

    // Begin computation.
    clock_t t_start, t_delta;
    double pi = 0;

if (args.threads == 1) {
        t_start = clock();
        pi = pi_mc(args.iterations);
        t_delta = clock() - t_start;
    }
    else {
        pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
        if (!threads) {
            return alloc_failed();
        }

struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
        if (!values) {
            free(threads);
            return alloc_failed();
        }

t_start = clock();
        for (i=0; i < args.threads; i++) {
            values[i].iterations = args.iterations;
            values[i].out = 0.0;
            pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
        }
        for (i=0; i < args.threads; i++) {
            pthread_join(threads[i], NULL);
            pi += values[i].out;
        }
        t_delta = clock() - t_start;

free(threads);
        threads = NULL;
        free(values);
        values = NULL;

pi /= (double) args.threads;
    }

Although pi_mc_threaded() is implemented as:

struct PIThreadData {
    int iterations;
    double out;
};

void* pi_mc_threaded(void* ptr) {
    struct PIThreadData* data = ptr;
    data->out = pi_mc(data->iterations);
}

You can find the full source code in http://pastebin.com/jptBTgwr

Question

Why is that? Why is this extreme difference on Linux? I expect the calculation to take at least 3/4 of the original time. Of course, I may just be using the pthread library by mistake. It would be very good to clarify how to do it correctly in this case.

Solution

The problem is that in the glibc implementation, rand() is called __random() , and that

long int
__random ()
{
  int32_t retval;

__libc_lock_lock (lock);

(void) __random_r (&unsafe_state, &retval);

__libc_lock_unlock (lock);

return retval;
}

Lock the function __random_r that does the actual work each time it is called.

So once you have multiple threads using rand(), you have each thread wait for the other thread almost every time rand() is called. It should be much faster to use random_r() and your own buffers directly in each thread.

Related Problems and Solutions