Archive for the ‘Uncategorized’ Category

The best solution is pdftops from Poppler, a somewhat successor of xpdf (see also this article). It preserves the fonts in PDF and produces a small and proper vector graph. To compile poppler on OSX 10.9, I need to edit “configure” and remove compiling option “-fno-check-new” as clang does not support this option.

Following the answer from this page, I have also tried a few other options. InkScape generates a small vector EPS, but it loses some features. Convert from ImageMagick outputs a bitmap EPS, which defeats the goal of vector graphs.

Interestingly, directly using the “gs” command from GhostScript seems to generate a vector EPS, but using the pdf2ps script produces an EPS with bitmap fonts. It turns out that the difference is caused by “-dNOCACHE”, which is surprising. Anyway, even though “gs” works, it generates a much larger EPS in comparison to pdftops. The winner is still pdftops from xpdf/poppler, at least in my case.

Read Full Post »

Gv apparently calls pkg-config during configuration. When pkg-config or the pkg-config file for Xaw3D is not found, it will fall back to another configuration which does not work on Mac.

As Mac does not come with pkg-config by default, you need to first install it. You also need to specify where to find the pkg-config file for Xaw3D:

export PKG_CONFIG_PATH=/usr/X11/lib/pkgconfig/
./configure --x-includes=/usr/X11/include/ --x-libraries=/usr/X11/lib/ --enable-SIGCHLD-fallback

Read Full Post »

Several years ago I implemented knetfile for accessing remote files on ftp and http as if they are local (see also this blog post). I have been using the implementation for a while and the end users like the feature. However, with the increasing use of https among file sharing and cloud computing providers, supporting secured connection becomes more important. Several users have requested this feature. As a response, I implemented a new library kurl on top of libcurl.

Kurl is inspired by and learns from fopen.c, an example from the curl source code package. It supports random access and uses fixed-length buffer. It also fixes an issue where we may be waiting too long for select(). The APIs largely resemble knetfile, zlib and stdio. The following is a small example:

#include <stdio.h>
#include "kurl.h"
int main() {
  kurl_t *fp;
  unsigned char buf[256];
  fp = kurl_open("https://github.com", 0);
  kurl_seek(fp, 100, SEEK_SET);
  kurl_read(fp, buf, 256);
  return 0;

In addition, kurl.c also comes with a simple main() function to achieve the basic curl functionality, which can be compiled with:

gcc -g -Wall -O2 -lcurl -DKURL_MAIN kurl.c -o kurl

Here are a little more details about kurl:

  • Two-file library. No installation.
  • The only dependency is libcurl, though libcurl may further depend on other libraries: e.g. openssl for https; libssh2 for sftp.
  • Directly accesses files in S3 with
    kurl_open("s3://bucket/object", 0)

    AWS credentials are either provided to kurl_open(), or by default read from ~/.awssecret (AccessKeyId and SecretKey on two lines; see Tim Kay’s aws tool for details).

  • Compilable with C++ compilers.
  • Buffered reading with a fixed buffer length. No potential buffer bloat.

Read Full Post »

I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code

data_type *data;
for (int i = 0; i < N; ++i)
    do_work(data, i);

where each cycle is largely independent of other cycles, we can process the loop with 4 threads:

data_type *data;
kt_for(4, do_work, data, N);

The 4 threads will end at about the same time even if each cycle takes very different time to process.

The scheduler uses a simplified task stealing algorithm to balance the load of each thread. Initially, given m threads, kt_for() assigns the i-th task/cycle to thread i%m. If a thread finishes earlier than other threads, the thread will steal a task from the most loaded thread. Thus as long as there remain enough tasks, no threads will be idle.

The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.

To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:

#include <stdlib.h>

typedef struct {
	int max_iter, w, h;
	double xmin, xmax, ymin, ymax;
	int *k;
} global_t;

static void compute(void *_g, int i, int tid)
	global_t *g = (global_t*)_g;
	double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
	double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
	int k;
	x = x0, y = y0;
	for (k = 0; k < g->max_iter; ++k) {
		double z = x * y;
		x *= x; y *= y;
		if (x + y >= 4) break;
		x = x - y + x0;
		y = z + z + y0; 
	g->k[i] = k;

int main(int argc, char *argv[])
	int i, tot, n_threads = 2;
	global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
	tot = global.w * global.h;
	global.k = calloc(tot, sizeof(int));
	for (i = 0; i < tot; ++i) compute(&global, i, 0);
	return 0;

The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):

kt_run Cilk OpenMP
1 CPU, machine1, gcc 29.4
2 CPU, machine1, gcc 16.0 17.5
4 CPU, machine1, gcc 8.6
1 CPU, machine1, icc 26.8
2 CPU, machine1, icc 14.7 16.3
4 CPU, machine1, icc 8.3 9.5
1 CPU, machine2, gcc 60.9
2 CPU, machine2, gcc 30.7
31 CPU, machine2, gcc 2.2

On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

Read Full Post »

The Mandelbrot set is THE most popular example of fractal. There are thousands of implementations to plot the Mandelbrot set in different languages using different techniques. I re-implemented Mandelbrot set mainly for fun and for learning OpenGL, GLSL and HTML5 canvas. Here are the four implementations:

  1. glfractal.c in C and plain OpenGL. This program demonstrates how to use the basic OpenGL. It has an interesting feature of dumping the coordinates in a short HEX string such that we can come back to the position later – when you find a beautiful area in the Mandelbrot set, you can hardly find it again without this feature.
  2. glslfractal.c in C and GLSL using single-precision floating point numbers. This program is modified from the source code on this page. It runs much faster than the CPU-only version. However, you cannot zoom in too much due to the loss of precision.
  3. glslfractale.c in C and GLSL using emulated double-precision numbers. This program aims to alleviate the precision problem, but the result is not so satisfactory.
  4. HTML5+javascript (a live web page!). It also supports HEX dump/restore. You can click the following links to see a few areas I like. You may need to wait for a few seconds as your browser computes the image on the fly: image 1, image 2, image 3 and image 4. Enjoy.

There is also an unfinished webGL implementation. It is supposed to be as fast as the C+GLSL version.
Image 3

Read Full Post »

Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:

typedef struct {
  unsigned key;
  unsigned char val;
} UnpackedStruct;

The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.

Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:

typedef struct {
  unsigned key;
  unsigned char val;
}  __attribute__ ((__packed__)) PackedStruct;

With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.

Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:

Key type Value type size per elem CPU seconds
Unsigned uint8_t 5 10.249
UnpackedStruct N/A 8 9.429
PackedStruct N/A 5 9.287

The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.

At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.

Read Full Post »

A quick post. I implemented matrix multiplication in Dart. It takes Dart 12 seconds to multiply two 500×500 matrices. In contrast, LuaJIT does the same job in less than 0.2 seconds. Perl takes 26 seconds. This means that Dart fails to JIT the critical loop even though I am trying to use explicit typing. Dart is not quite there yet to match the performance of V8 and LuaJIT. The source code is appended. It looks almost the same as C++.

	int m = a.length, n = a[0].length; // m rows and n cols
	var b = new List(n);
	for (int j = 0; j < n; ++j) b[j] = new List<double>(m);
	for (int i = 0; i < m; ++i)
		for (int j = 0; j < n; ++j)
			b[j][i] = a[i][j];
	return b;

mat_mul(a, b)
	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i) {
		x[i] = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0;
			var ai = a[i], cj = c[j];
			for (int k = 0; k < n; ++k) sum += ai[k] * cj[k];
			x[i][j] = sum;
	return x;

mat_gen(int n)
	var a = new List(n);
	double t = 1.0 / n / n;
	for (int i = 0; i < n; ++i) {
		a[i] = new List<double>(n);
		for (int j = 0; j < n; ++j)
			a[i][j] = t * (i - j) * (i + j);
	return a;

	int n = 500;
	var a = mat_gen(n);
	var b = mat_gen(n);
	var c = mat_mul(a, b);

Read Full Post »

Older Posts »