Feeds:
Posts
Comments

The best solution is pdftops from Poppler, a somewhat successor of xpdf (see also this article). It preserves the fonts in PDF and produces a small and proper vector graph. To compile poppler on OSX 10.9, I need to edit “configure” and remove compiling option “-fno-check-new” as clang does not support this option.

Following the answer from this page, I have also tried a few other options. InkScape generates a small vector EPS, but it loses some features. Convert from ImageMagick outputs a bitmap EPS, which defeats the goal of vector graphs.

Interestingly, directly using the “gs” command from GhostScript seems to generate a vector EPS, but using the pdf2ps script produces an EPS with bitmap fonts. It turns out that the difference is caused by “-dNOCACHE”, which is surprising. Anyway, even though “gs” works, it generates a much larger EPS in comparison to pdftops. The winner is still pdftops from xpdf/poppler, at least in my case.

Gv apparently calls pkg-config during configuration. When pkg-config or the pkg-config file for Xaw3D is not found, it will fall back to another configuration which does not work on Mac.

As Mac does not come with pkg-config by default, you need to first install it. You also need to specify where to find the pkg-config file for Xaw3D:

export PKG_CONFIG_PATH=/usr/X11/lib/pkgconfig/
./configure --x-includes=/usr/X11/include/ --x-libraries=/usr/X11/lib/ --enable-SIGCHLD-fallback

Several years ago I implemented knetfile for accessing remote files on ftp and http as if they are local (see also this blog post). I have been using the implementation for a while and the end users like the feature. However, with the increasing use of https among file sharing and cloud computing providers, supporting secured connection becomes more important. Several users have requested this feature. As a response, I implemented a new library kurl on top of libcurl.

Kurl is inspired by and learns from fopen.c, an example from the curl source code package. It supports random access and uses fixed-length buffer. It also fixes an issue where we may be waiting too long for select(). The APIs largely resemble knetfile, zlib and stdio. The following is a small example:

#include <stdio.h>
#include "kurl.h"
int main() {
  kurl_t *fp;
  unsigned char buf[256];
  fp = kurl_open("https://github.com", 0);
  kurl_seek(fp, 100, SEEK_SET);
  kurl_read(fp, buf, 256);
  kurl_close(fp);
  return 0;
}

In addition, kurl.c also comes with a simple main() function to achieve the basic curl functionality, which can be compiled with:

gcc -g -Wall -O2 -lcurl -DKURL_MAIN kurl.c -o kurl

Here are a little more details about kurl:

  • Two-file library. No installation.
  • The only dependency is libcurl, though libcurl may further depend on other libraries: e.g. openssl for https; libssh2 for sftp.
  • Directly accesses files in S3 with
    kurl_open("s3://bucket/object", 0)

    AWS credentials are either provided to kurl_open(), or by default read from ~/.awssecret (AccessKeyId and SecretKey on two lines; see Tim Kay’s aws tool for details).

  • Compilable with C++ compilers.
  • Buffered reading with a fixed buffer length. No potential buffer bloat.

I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code

data_type *data;
for (int i = 0; i < N; ++i)
    do_work(data, i);

where each cycle is largely independent of other cycles, we can process the loop with 4 threads:

data_type *data;
kt_for(4, do_work, data, N);

The 4 threads will end at about the same time even if each cycle takes very different time to process.

The scheduler uses a simplified task stealing algorithm to balance the load of each thread. Initially, given m threads, kt_for() assigns the i-th task/cycle to thread i%m. If a thread finishes earlier than other threads, the thread will steal a task from the most loaded thread. Thus as long as there remain enough tasks, no threads will be idle.

The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.

To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:

#include <stdlib.h>

typedef struct {
	int max_iter, w, h;
	double xmin, xmax, ymin, ymax;
	int *k;
} global_t;

static void compute(void *_g, int i, int tid)
{
	global_t *g = (global_t*)_g;
	double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
	double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
	int k;
	x = x0, y = y0;
	for (k = 0; k < g->max_iter; ++k) {
		double z = x * y;
		x *= x; y *= y;
		if (x + y >= 4) break;
		x = x - y + x0;
		y = z + z + y0; 
	}
	g->k[i] = k;
}

int main(int argc, char *argv[])
{
	int i, tot, n_threads = 2;
	global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
	tot = global.w * global.h;
	global.k = calloc(tot, sizeof(int));
	for (i = 0; i < tot; ++i) compute(&global, i, 0);
	free(global.k);
	return 0;
}

The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):

kt_run Cilk OpenMP
1 CPU, machine1, gcc 29.4
2 CPU, machine1, gcc 16.0 17.5
4 CPU, machine1, gcc 8.6
1 CPU, machine1, icc 26.8
2 CPU, machine1, icc 14.7 16.3
4 CPU, machine1, icc 8.3 9.5
1 CPU, machine2, gcc 60.9
2 CPU, machine2, gcc 30.7
31 CPU, machine2, gcc 2.2

On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

The Mandelbrot set is THE most popular example of fractal. There are thousands of implementations to plot the Mandelbrot set in different languages using different techniques. I re-implemented Mandelbrot set mainly for fun and for learning OpenGL, GLSL and HTML5 canvas. Here are the four implementations:

  1. glfractal.c in C and plain OpenGL. This program demonstrates how to use the basic OpenGL. It has an interesting feature of dumping the coordinates in a short HEX string such that we can come back to the position later – when you find a beautiful area in the Mandelbrot set, you can hardly find it again without this feature.
  2. glslfractal.c in C and GLSL using single-precision floating point numbers. This program is modified from the source code on this page. It runs much faster than the CPU-only version. However, you cannot zoom in too much due to the loss of precision.
  3. glslfractale.c in C and GLSL using emulated double-precision numbers. This program aims to alleviate the precision problem, but the result is not so satisfactory.
  4. HTML5+javascript (a live web page!). It also supports HEX dump/restore. You can click the following links to see a few areas I like. You may need to wait for a few seconds as your browser computes the image on the fly: image 1, image 2, image 3 and image 4. Enjoy.

There is also an unfinished webGL implementation. It is supposed to be as fast as the C+GLSL version.
Image 3

Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:

typedef struct {
  unsigned key;
  unsigned char val;
} UnpackedStruct;

The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.

Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:

typedef struct {
  unsigned key;
  unsigned char val;
}  __attribute__ ((__packed__)) PackedStruct;

With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.

Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:

Key type Value type size per elem CPU seconds
Unsigned uint8_t 5 10.249
UnpackedStruct N/A 8 9.429
PackedStruct N/A 5 9.287

The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.

At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.

Introduction

About two years ago I evaluated the performance of ~20 compilers and interpreters on sudoku solving, matrix multiplication, pattern matching and dictionary operations. Two years later, I decide update a small part of the benchmark on Sudoku solving. I choose this problem because it is practically and algorithmically interesting, and simple enough to be easily ported to multiple languages. Meanwhile, I am also adding two new programming languages: Mozilla’s Rust and Google’s Dart. They are probably the most promising languages announced in the past two years.

Results

In this small benchmark, I am implementing Sudoku solvers in multiple programming languages. The algorithm, adapted from Guenter Stertenbrink’s solver, was first implemented in C and then ported to other languages. The C source code briefly describes the method. For more information about Sudoku solving in general, please see my other post.

Before I show the results, there are a couple of caveats to note:

  • Solving Sudoku is NP-hard. The choice of the solving algorithm will dramatically affect the speed. For example, my Rust implementation is ~2500 times faster than the one in the Rust official repository. For a language benchmark, we must implement exactly the same algorithm.
  • I am mostly familiar with C but am pretty much a newbie in other programming languages. I am sure some implementations are not optimal. If you can improve the code, please send me a pull request. I am happy to replace with a better version.

The following table shows the CPU time for solving 20 hard Sudokus repeated 50 500 times (thus 1000 10000 Sudokus in total). The programs, which are freely available, are compiled and run on my Mac laptop with a 2.66GHz Core i7 CPU.

Compiler/VM Version Language Option CPU time (sec)
clang 425.0.27 (3.2svn) C -O2 8.92
llvm-gcc 4.2.1 C -O2 9.23
dmd 2.062 D2 -O -release
-noboundscheck
11.54
11.47
rust 0.6 Rust –opt-level 3 11.51
java 1.6.0_37 Java -d64 11.57
go 1.1beta 20130406 Go (default)
-gcflags -B
14.96
13.78
dart 0.4.4.4-r20810 Dart 21.42
v8 3.16.3 Javascript 28.19
luajit 2.0.1 Lua 30.66
pypy 2.0-beta-130405 Python 44.29

In this small benchmark, C still takes the crown of speed, Other statically typed languages are about twice as slow but Rust and D are very close to C. It is pretty amazing that Rust as a new language is that performant given the developers have not put too much efforts on speed so far.

Among dynamically typed languages, Dart, V8 and LuaJIT are similar in speed, about 3 times as slow as C. 3 times is arguably not much to many applications. I really hope some day I can use a handy dynamically typed language for most programming. Pypy is slower here, but it is more than twice as fast as the version two years ago.

Related resources

Update

  • I forgot to use `-release’ with dmd. The new result looks much better. Sorry for my mistake.
  • Mac ships gcc-4.2.1 only due to licensing issues. I have just tried both gcc 4.7.2 and gcc 4.8 from MacPorts. The executables compiled by them take 0.99 second to run, slower than gcc-4.2.1.
  • Updated to the latest Go compiled from the repository.
  • Updated the Python implementation (thanks to Rob Smallshire).
  • Updated the Dart implementation (thanks to jwendel).
  • Updated the Rust implementation (thanks to dotdash).
  • Made input 10 times larger to reduce the fraction of time spent on VM startup. Dart/V8/LuaJIT have short VM startup time, but Java is known to have a long startup.
  • Updated the Go implementation (thanks to S├ębastien Paolacci).
  • Updated the Python implementation.
Follow

Get every new post delivered to your Inbox.

Join 36 other followers