Feeds:
Posts

## Parallelizing simple “for” loops

I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code

```data_type *data;
for (int i = 0; i < N; ++i)
do_work(data, i);
```

where each cycle is largely independent of other cycles, we can process the loop with 4 threads:

```data_type *data;
kt_for(4, do_work, data, N);
```

The 4 threads will end at about the same time even if each cycle takes very different time to process.

The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.

To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:

```#include <stdlib.h>

typedef struct {
int max_iter, w, h;
double xmin, xmax, ymin, ymax;
int *k;
} global_t;

static void compute(void *_g, int i, int tid)
{
global_t *g = (global_t*)_g;
double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
int k;
x = x0, y = y0;
for (k = 0; k < g->max_iter; ++k) {
double z = x * y;
x *= x; y *= y;
if (x + y >= 4) break;
x = x - y + x0;
y = z + z + y0;
}
g->k[i] = k;
}

int main(int argc, char *argv[])
{
int i, tot, n_threads = 2;
global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
tot = global.w * global.h;
global.k = calloc(tot, sizeof(int));
for (i = 0; i < tot; ++i) compute(&global, i, 0);
free(global.k);
return 0;
}
```

The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):

kt_run

Cilk

OpenMP
1 CPU, machine1, gcc

29.4

2 CPU, machine1, gcc

16.0

17.5
4 CPU, machine1, gcc

8.6

1 CPU, machine1, icc

26.8

2 CPU, machine1, icc

14.7

16.3
4 CPU, machine1, icc

8.3

9.5

1 CPU, machine2, gcc

60.9

2 CPU, machine2, gcc

30.7

31 CPU, machine2, gcc

2.2

On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

### One Response

1. on November 11, 2013 at 11:21 am | Reply John Mikes

ktf_worker seems to spin on __sync_fetch_and_add until it manages to get some work. That means it is not wait free.

“A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless the execution speeds on the other processes.”

That’s not the case here – one ktf_worker will have unbounded number of steps (actually bound on input, but we’re not interested on that for this definition) while it tries to deconflict getting work from its queue.