Threaded implementations degrade performance
I implemented a small program in C to calculate PI (mainly personal interest and training) using Monte Carlo method. After implementing the basic code structure, I added a command-line option that allows threaded calculations to be performed.
I
expected a huge increase in speed, but I was disappointed. The command line summary should be clear. The final number of iterations to approximate PI is the product of the number of –iterations
and -threads
passed from the command line. Leaving -threads
blank defaults it to 1
thread, causing execution in the main thread.
The following tests performed a total of 80 million iterations.
On Windows 7 64-bit (Intel Core2Duo machines):
Compiled with Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8-core AMD):
Compiled with GCC 4.6.3: gcc pi.c -lm -lpthread -o3 -o pi
Performance
On Windows, the threaded version
is a few milliseconds faster than the non-threaded version. Honestly, I’m looking forward to better performances. On Linux, ugh! Did I make a mistake? Why spend 2000% of your time? Of course a lot depends on the implementation, so be it. Excerpt after command line argument parsing is complete and calculations begin:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
Although pi_mc_threaded()
is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code in http://pastebin.com/jptBTgwr
Question
Why is that? Why is this extreme difference on Linux? I expect the calculation to take at least 3/4 of the original time. Of course, I may just be using the pthread
library by mistake. It would be very good to clarify how to do it correctly in this case.
Solution
The problem is that in the glibc implementation, rand()
is called __random()
, and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
Lock the function __random_r
that does the actual work each time it is called.
So once you have multiple threads using rand(), you have each thread wait for the other thread almost every time rand()
is called. It should be much faster to use
random_r()
and your own buffers directly in each thread.