I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code
data_type *data;
for (int i = 0; i < N; ++i)
do_work(data, i);
where each cycle is largely independent of other cycles, we can process the loop with 4 threads:
data_type *data; kt_for(4, do_work, data, N);
The 4 threads will end at about the same time even if each cycle takes very different time to process.
The scheduler uses a simplified task stealing algorithm to balance the load of each thread. Initially, given m threads, kt_for() assigns the i-th task/cycle to thread i%m. If a thread finishes earlier than other threads, the thread will steal a task from the most loaded thread. Thus as long as there remain enough tasks, no threads will be idle.
The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.
To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:
#include <stdlib.h>
typedef struct {
int max_iter, w, h;
double xmin, xmax, ymin, ymax;
int *k;
} global_t;
static void compute(void *_g, int i, int tid)
{
global_t *g = (global_t*)_g;
double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
int k;
x = x0, y = y0;
for (k = 0; k < g->max_iter; ++k) {
double z = x * y;
x *= x; y *= y;
if (x + y >= 4) break;
x = x - y + x0;
y = z + z + y0;
}
g->k[i] = k;
}
int main(int argc, char *argv[])
{
int i, tot, n_threads = 2;
global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
tot = global.w * global.h;
global.k = calloc(tot, sizeof(int));
for (i = 0; i < tot; ++i) compute(&global, i, 0);
free(global.k);
return 0;
}
The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):
| kt_run | Cilk | OpenMP | |
|---|---|---|---|
| 1 CPU, machine1, gcc | 29.4 | ||
| 2 CPU, machine1, gcc | 16.0 | 17.5 | |
| 4 CPU, machine1, gcc | 8.6 | ||
| 1 CPU, machine1, icc | 26.8 | ||
| 2 CPU, machine1, icc | 14.7 | 16.3 | |
| 4 CPU, machine1, icc | 8.3 | 9.5 | |
| 1 CPU, machine2, gcc | 60.9 | ||
| 2 CPU, machine2, gcc | 30.7 | ||
| 31 CPU, machine2, gcc | 2.2 |
On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

