Feeds:
Posts
Comments

Archive for the ‘Uncategorized’ Category

I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code

data_type *data;
for (int i = 0; i < N; ++i)
    do_work(data, i);

where each cycle is largely independent of other cycles, we can process the loop with 4 threads:

data_type *data;
kt_for(4, do_work, data, N);

The 4 threads will end at about the same time even if each cycle takes very different time to process.

The scheduler uses a simplified task stealing algorithm to balance the load of each thread. Initially, given m threads, kt_for() assigns the i-th task/cycle to thread i%m. If a thread finishes earlier than other threads, the thread will steal a task from the most loaded thread. Thus as long as there remain enough tasks, no threads will be idle.

The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.

To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:

#include <stdlib.h>

typedef struct {
	int max_iter, w, h;
	double xmin, xmax, ymin, ymax;
	int *k;
} global_t;

static void compute(void *_g, int i, int tid)
{
	global_t *g = (global_t*)_g;
	double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
	double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
	int k;
	x = x0, y = y0;
	for (k = 0; k < g->max_iter; ++k) {
		double z = x * y;
		x *= x; y *= y;
		if (x + y >= 4) break;
		x = x - y + x0;
		y = z + z + y0; 
	}
	g->k[i] = k;
}

int main(int argc, char *argv[])
{
	int i, tot, n_threads = 2;
	global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
	tot = global.w * global.h;
	global.k = calloc(tot, sizeof(int));
	for (i = 0; i < tot; ++i) compute(&global, i, 0);
	free(global.k);
	return 0;
}

The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):

kt_run

Cilk

OpenMP
1 CPU, machine1, gcc

29.4

2 CPU, machine1, gcc

16.0

17.5
4 CPU, machine1, gcc

8.6

1 CPU, machine1, icc

26.8

2 CPU, machine1, icc

14.7

16.3
4 CPU, machine1, icc

8.3

9.5

1 CPU, machine2, gcc

60.9

2 CPU, machine2, gcc

30.7

31 CPU, machine2, gcc

2.2

On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

Read Full Post »

The Mandelbrot set is THE most popular example of fractal. There are thousands of implementations to plot the Mandelbrot set in different languages using different techniques. I re-implemented Mandelbrot set mainly for fun and for learning OpenGL, GLSL and HTML5 canvas. Here are the four implementations:

  1. glfractal.c in C and plain OpenGL. This program demonstrates how to use the basic OpenGL. It has an interesting feature of dumping the coordinates in a short HEX string such that we can come back to the position later – when you find a beautiful area in the Mandelbrot set, you can hardly find it again without this feature.
  2. glslfractal.c in C and GLSL using single-precision floating point numbers. This program is modified from the source code on this page. It runs much faster than the CPU-only version. However, you cannot zoom in too much due to the loss of precision.
  3. glslfractale.c in C and GLSL using emulated double-precision numbers. This program aims to alleviate the precision problem, but the result is not so satisfactory.
  4. HTML5+javascript (a live web page!). It also supports HEX dump/restore. You can click the following links to see a few areas I like. You may need to wait for a few seconds as your browser computes the image on the fly: image 1, image 2, image 3 and image 4. Enjoy.

There is also an unfinished webGL implementation. It is supposed to be as fast as the C+GLSL version.
Image 3

Read Full Post »

Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:

typedef struct {
  unsigned key;
  unsigned char val;
} UnpackedStruct;

The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.

Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:

typedef struct {
  unsigned key;
  unsigned char val;
}  __attribute__ ((__packed__)) PackedStruct;

With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.

Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:

Key type

Value type

size per elem

CPU seconds
Unsigned

uint8_t

5

10.249
UnpackedStruct

N/A

8

9.429
PackedStruct

N/A

5

9.287

The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.

At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.

Read Full Post »

A quick post. I implemented matrix multiplication in Dart. It takes Dart 12 seconds to multiply two 500×500 matrices. In contrast, LuaJIT does the same job in less than 0.2 seconds. Perl takes 26 seconds. This means that Dart fails to JIT the critical loop even though I am trying to use explicit typing. Dart is not quite there yet to match the performance of V8 and LuaJIT. The source code is appended. It looks almost the same as C++.

mat_transpose(a)
{
	int m = a.length, n = a[0].length; // m rows and n cols
	var b = new List(n);
	for (int j = 0; j < n; ++j) b[j] = new List<double>(m);
	for (int i = 0; i < m; ++i)
		for (int j = 0; j < n; ++j)
			b[j][i] = a[i][j];
	return b;
}

mat_mul(a, b)
{
	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i) {
		x[i] = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0;
			var ai = a[i], cj = c[j];
			for (int k = 0; k < n; ++k) sum += ai[k] * cj[k];
			x[i][j] = sum;
		}
	}
	return x;
}

mat_gen(int n)
{
	var a = new List(n);
	double t = 1.0 / n / n;
	for (int i = 0; i < n; ++i) {
		a[i] = new List<double>(n);
		for (int j = 0; j < n; ++j)
			a[i][j] = t * (i - j) * (i + j);
	}
	return a;
}

main()
{
	int n = 500;
	var a = mat_gen(n);
	var b = mat_gen(n);
	var c = mat_mul(a, b);
	print(c[n~/2][n~/2]);
}

Read Full Post »

As sorting is the bottleneck of my application, I decided to optimize the code further with reference to this wonderful article. The optimized the version is about 40% faster than my original one. It is now about 2.5 times as fast as STL’s std::sort on random 32-bit integer arrays. The optimized version is slightly slower than the implementation in that article, but it is much faster on sorted arrays.

For sorting large integer arrays, radix sort is the king. It is much faster than other standard algorithms and is arguably simpler.

#define rstype_t uint64_t // type of the array
#define rskey(x) (x) // specify how to get the integer from rstype_t

#define RS_MIN_SIZE 64 // for an array smaller than this, use insertion sort

typedef struct {
  rstype_t *b, *e; // begin and end of each bucket
} rsbucket_t;

void rs_insertsort(rstype_t *beg, rstype_t *end) // insertion sort
{
  rstype_t *i;
  for (i = beg + 1; i < end; ++i)
    if (rskey(*i) < rskey(*(i - 1))) {
      rstype_t *j, tmp = *i;
      for (j = i; j > beg && rskey(tmp) < rskey(*(j-1)); --j)
        *j = *(j - 1);
      *j = tmp;
    }
}
// sort between [$beg, $end); take radix from ">>$s&((1<<$n_bits)-1)"
void rs_sort(rstype_t *beg, rstype_t *end, int n_bits, int s)
{
  rstype_t *i;
  int size = 1<<n_bits, m = size - 1;
  rsbucket_t *k, b[size], *be = b + size; // b[] keeps all the buckets

  for (k = b; k != be; ++k) k->b = k->e = beg;
  for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix
  for (k = b + 1; k != be; ++k) // set start and end of each bucket
    k->e += (k-1)->e - beg, k->b = (k-1)->e;
  for (k = b; k != be;) { // in-place classification based on radix
    if (k->b != k->e) { // the bucket is not full
      rsbucket_t *l;
      if ((l = b + (rskey(*k->b)>>s&m)) != k) { // different bucket
        rstype_t tmp = *k->b, swap;
        do { // swap until we find an element in bucket $k
          swap = tmp; tmp = *l->b; *l->b++ = swap;
          l = b + (rskey(tmp)>>s&m);
        } while (l != k);
        *k->b++ = tmp; // push the found element to $k
      } else ++k->b; // move to the next element in the bucket
    } else ++k; // move to the next bucket
  }
  for (b->b = beg, k = b + 1; k != be; ++k) k->b = (k-1)->e; // reset k->b
  if (s) { // if $s is non-zero, we need to sort buckets
    s = s > n_bits? s - n_bits : 0;
    for (k = b; k != be; ++k)
      if (k->e - k->b > RS_MIN_SIZE) rs_sort(k->b, k->e, n_bits, s);
      else if (k->e - k->b > 1) rs_insertsort(k->b, k->e);
  }
}

void radix_sort(rstype_t *beg, rstype_t *end)
{
  if (end - beg <= RS_MIN_SIZE) rs_insertsort(beg, end);
  else rs_sort(beg, end, 8, sizeof(rskey(*beg)) * 8 - 8);
}

EDIT: Just found this implementation. It is as fast as mine and is simpler. Recommended.

Read Full Post »

I am recently working on an algorithm, which surprisingly spends more than half of its time on sorting huge partially ordered arrays of 64-bit integer pairs (one for key and the other for value). Naturally, I want to optimize sorting such arrays. Initially, I tried my implementation of introsort. The program takes about 90 seconds on a sample data set. I then switched to my iterative mergesort in the same library. It takes 55 seconds. I guess the mergesort is faster because the arrays are partially ordered. However, my implementation of mergesort requires a temporary array of the same size. As the arrays are huge, it is unacceptable to allocate this array for real data. It seems that implementing an in-place mergesort is quite challenging. Then I think of radix sort, which I have not implemented before.

My radix sort implementation is here. It is not written as a library, but it should be easy to be adapted to other data types. The C program is quite simple and is not much different from existing ones.

How about the performance? With radix sort, my program takes 35 seconds using little extra working space. I get 100% speedup by replacing introsort with integer-only radix sort. To evaluate the performance of radix sort separately, I put the code in my old ksort_test.cc. Here are the CPU seconds spent on sorting 50 million random or sorted integers:

Algorithm

Sorted?

Mac CPU (sec)

Linux CPU
STL introsort

No

4.9

5.1
STL introsort

Yes

0.9

1.1
STL stablesort

No

6.7

6.1
STL stablesort

Yes

2.0

2.0
STL heapsort

No

54.1

32.2
STL heapsort

Yes

4.5

4.2
libc qsort

No

11.3

9.7
ac’s radix

No

1.9

2.0
ac’s radix

Yes

0.8

0.9
ac’s combsort

No

11.9

11.5
ac’s introsort

No

5.5

5.7
ac’s introsort

Yes

7.4

6.8
ac’s mergesort

No

6.1

6.6
ac’s mergesort

Yes

2.1

2.3
ac’s heapsort

No

54.4

31.8
ac’s heapsort

Yes

4.9

6.6
ph’s heapsort

No

54.6

30.8
ph’s quicksort

No

5.6

5.8
ph’s mergesort

No

7.0

6.9

Please see my old post for the information on other algorithms and implementations. You can also clone my klib repository and play with ksort_test.cc by yourself.

Read Full Post »

Recently I saw Linus’ post on Genome 3.4, where he was complaining that it is hard to make fonts smaller. This reminded me of my recent experience in installing Mint/Ubuntu under a virtual machine. I have exactly the same problem. In Unity, I wanted the font to be smaller, but I found I have to install a 3rd-party tweak tool that is not part of Ubuntu. I deleted the VM immediately.

Linus blamed Gnome not being customizable enough. I blamed Unity, too (in Xfce, it is easy to change font sizes), but I think a more fundamental problem is the default setting of Gnome3/Unity/Xfce. On Mac, I am happy with the default system fonts and font sizes all the time (I changed font settings in a couple of applications). I have not heard that a Windows user complaining about the system default, either.

No matter how fancy a user interface looks afar, if the fonts are ugly or in the wrong sizes, it is a complete failure. Why haven’t those UI designers realized this simple fact even till today? It really amazes me how the UI designers can live with such a big ugly font size, and they love such ugly fonts so much that they even disallow end users to change them!

Sorry for the rant.

Read Full Post »

The polls are conducted by Hacker News. It is still ongoing. The following is a snapshot of the results (generated by this ugly Perl script). The links to the two polls: like and dislike.

What is interesting is that this is so far the only poll I saw on “disliked” languages.

Read Full Post »

According to the TIOBE index, D quickly gained popularity around 2007, coicident with the release of D1, Tango, GDC and LDC. The popularity has receded since 2009, again coicident with the leave of several key D developers from the community. Looking back, I think the biggest mistake was made by the Tango developers who chose not to be compatible with the official standard library at the runtime level. I understand that the Tango runtime is probably more efficient, but asking developers to drop the official library is rare, unreasonable and a little impolite. In my view, the incompatibility also made it much harder to port Tango to D2, which further frustrated D developers. On the other hand, my feeling is that Tango indeed played a key role in attracting C++ and Java programmers. D was probably flourishing with Tango. When Tango stopped evolving, the ranking of D according to the TIOBE index was back to the pre-Tango days. In addition, the tension between Tango and Phobos/DMD did not come without reasons. The closeness of Phobos and DMD at the beginning, the slow adoption of 64-bit platforms and the transition of D1 to D2 all led to complaints among developers. While I believe the D1 to D2 transition is unfortunately the right direction overall, the other interfering factors should and could be avoided. If these other factors could be resolved well, I am sure the D1 to D2 transition would have been made smoother as well, in the belief that we are more likely to find solutions in a friendly atmosphere.

Looking forward, I see D2 has got over some major obstacles but may also be faced with new challenges. On the up side, D2 and Phobos are more open. D2 is in many ways a better language than D1. The official 64-bit support has been finished (at least on Linux). Phobos is much more powerful. Tango has been just ported to D2 without the incompatibility with Phobos. On the down side, however, D2/Phobos is still fast moving. I do not know if a D program written today can still be compiled a year later. The Tango development team has shrunk considerably. D has probably frustrated quite a few early adopters. It is not easy to win them back.

As an outsider, I think the first priority for D is to quickly stabilize the compiler and the standard library to convince everyone that programs written now will still be compiled years later. Without this guarantee, D2 will remain as a niche/hobby language. As a rant, I think the core development team should at least make D2+Phobos easier to compile and to install. I am not bad at installing software, but it took me a couple of hours to figure out the relationship between source codes and how to install and configure without the root permission. Many newcomers will just leave without looking back.

I always think D is the most promising modern programming language, successfully combining extreme efficiency and expressiveness. I continue to hope D can take flight – I have been hoping so for 5 years. No matter whether D flourishes again or gradually fades away, its past history will become a worthy case to the open source community.

Read Full Post »

Installing the D compiler

I am a long-time fan of the D programming language, though I have never seriously learned it. One of the many big and small reasons that I do not use D is the official compiler dmd is hard to install on the old Linux servers I routinely access. Although a binary has been available for some time, it requires a recent glibc. I tried to compile the source code, but the compiler is separated into several source code trees and the documentation is quite lousy. It is not like I can download a tar-ball, unpack it and run ‘./configure;make;make install’.

The situation has not been changed much nowadays – the binary is still not working – but I hope I can help those who have the same problem. I wrote a Makefile to compile the D compiler. It grabs source code from github and compile it. If you have an old machine where the binary does not work, you can do the following to compile D (you need a C++ compiler and git of course):

mkdir -p dmd2; cd dmd2
git clone git://gist.github.com/1848272.git
mv 1848272/compile-d.mak .
make -f compile-d.mak
export PATH="`pwd`/bin:$PATH"

The whole procedure only takes a few minutes, much faster than compiling gcc/g++. Hope it helps, just in case.

Read Full Post »

« Newer Posts - Older Posts »