Archive for the ‘development’ Category

Array and hash table are probably the most important data structures. Some programming languages such as Perl, Lua and Javascript, almost build the language core on top of the two data structures. While array is straightforward to implement, hash table is not. This is why we have paid continuous efforts in improving the hash table performance. This blog post reviews recent techniques not commonly found in classical textbooks.

Open addressing vs. chaining

This is not an advanced topic at all, but it is worth emphasizing: for small keys, open addressing hash tables are consistently faster and smaller than a standard chaining based hash tables. C++11 requires std::unordered_map to use chaining, which means if you want an efficient hash table for lots of small keys, choose another library. Some of the techniques below are applied to open addressing only.

Secondary hash functions

A hash function is bad if it often maps distinct keys to the same bucket. A hash function can also be bad if it follows a pattern. One example is the identity hash function, which maps any integer to itself. When you insert N adjacent integers to the table, inserting an integer colliding with one of the existing numbers may trigger an O(N) operation, much slower than the expected O(1). To reduce the effect of such hash functions, we can introduce a second hash function that maps one integer to another more random one. This blog post recommends the following:

static inline uint64_t fibonacci_hash(uint64_t hash) {
    return hash * 11400714819323198485llu;

This belongs to the larger category of multiplicative hash functions. It is a good choice on modern CPUs that implement fast integer multiplications.

Using a secondary hash function is like a safe guard. When users choose good hash functions, this secondary function only wastes time, a little bit.

Caching hash values

When we use long strings as keys, comparing two keys may take significant time. This comparison is often unnecessary. Note that the hash of a string is a good summary of the string. If two strings are different, their hashes are often different. We can cache the hash and only compare two keys when their hashes are equal. It is possible to implement the idea with any hash table implementations. We only need to change the key type like

typedef struct {
  uint64_t hash;
  char *str;
} HashedStr;
#define hs_hash_func(a) ((a).hash)
#define hs_equal(a, b) ((a).hash == (b).hash && \
                        strcmp((a).str, (b).str) == 0)
static void hs_fill(HashedStr *p, const char *str) {
  p->str = strdup(str);
  p->hash = my_string_hash_func(p->str);

Writing all these in user’s code is a little complicated. Some hashtable libraries provide options to cache hashes inside the library. It is a handy feature.

Quadratic probing and power-of-2 table size

This is not an advanced technique, either, but it seems that not everyone knows the following. The textbook I used over 15 years ago mentioned that quadratic probing may never visit some cells. To see that, you can run this:

void main(void) {
  int i, b = 10, n = 1<<b, *c = (int*)calloc(n, sizeof(int));
  for (i = 0; i < n; ++i) {
    int x = i * i & (n - 1);
    if (c[x]++) printf("hit: %d\n", i);

You will see 852 "hit" lines. This means even if the table has empty slots, quadratic probing may not find a place to put a new element. The wiki said: “there is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime.”

If you go to that wiki page, you will find the phrase ahead of the quoted sequence is “With the exception of the triangular number case for a power-of-two-sized hash table”. This was added in 2012. By “triangular”, we mean to change line 4 above to:

    int x = i * (i + 1) / 2 & (n - 1);

When you run the program again, you won’t see any “hit” lines. You can find a proof here, which is in fact an exercise in Knuth’s book. In all, the “half-full limitation” is largely a myth.

Robin Hood & Hopscotch hashing

Robin Hood hashing and Hopscotch hashing can be considered as extensions to Cuckoo hashing. Different from traditional solutions to hash collisions, they may displace a key in the hash table if the probe length is too long.

In the words of wiki, with Robin Hood hashing, “a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position”. It reduces the variance in searching keys and makes the table still efficient under a high load factor. Robin Hood hashing is gaining popularity. Several of the fastest hash table libraries, including Rust’s standard library, is using this strategy.

However, Robin Hood hashing is not universally better. First, insertion may be a little slower due to swaps of keys. Second, with an extra counter, each bucket is larger, which partly cancels the advantage under high load. In my benchmark, Robin Hood hashing is not obviously better on that particular task. A Google’s Abseil developer also commented that they tried Robin Hood hashing, but found it is not that impressive.

Hopscotch hashing generally follows a similar philosophy. I will not go into the very details. I just point out in my benchmark, this strategy is not clearly better, either (see this figure).

Swiss table

Swiss table is the name of Google’s new hash table absl::flat_hash_map and is explained in this video. It uses a meta-table to indicate if a bucket is empty or has been deleted before. khash.h uses a similar table, but Swiss table does it better: it uses two bits one bit to keep empty/deleted and six seven bits to cache hash values, such that most of time it can find the right bucket without querying the main bucket table. And because this meta-table is small (one byte per element), we can query 16 cells with a few SSE instructions.

I thought Swiss table could easily beat my khash.h at the cost of a little bit more memory. However, it doesn’t. I will look into this at some point.

Apparently inspired by the Swiss table, ska::bytell_hash_map also employes a one-byte-per-element meta-table, but instead of caching 6-bit of hash values, it uses the lower seven bits to calculate the distance to the next bucket (details remain unknown). This implementation achieves very good space-time balance.

Concluding remarks

There is not a universally best hash table library. Each library has to choose a balance between space and speed. I am yet to see a library that beats the rest in both aspects. As a matter of fact, there is probably not a fastest hash table library, either. Strategies fast at query may be slow at insertions; strategies fast for large keys may be overkilling for small keys.

However, some hash tables can be consistently faster and smaller than others. According to my recent evaluation, ska::flat_hash_map, ska::bytell_hash_map, tsl::robin_map and tsl::hopscotch_map are wise choices to C++11 programmers, at least for small keys. They are fast, standalone and relatively simple. Google’s absl::flat_hash_map is ok, but I thought it could be faster. Google’s dense_hash_map and my khash.h remain top options for C++98 and C, respectively.

Update: Swiss table caches 7 bits of hash in the meta-table, not 6 bits. Fixed a few typos.


Read Full Post »


Many command-line tools need to parse command-line arguments. In C, one of the most widely used functions for this purpose is getopt() and its GNU extension getopt_long(). However, these functions have two major issues. First, they are not portable. getopt is part of the POSIX standard but not the C standard; getopt_long is not part of any standards. In addition, getopt may behave differently depending on whether GNU extension is enabled. Using these functions can be tricky. Second, both functions rely on global variables, which may interfere with more complex use cases (e.g. sub-commands).

These limitations motivated the development of several other argument parsing libraries. While these libraries often have cleaner APIs and more functionality, most of them lack some getopt_long features. This blog post reviews several argument parsing libraries in C/C++ and introduces my own getopt replacement at the end.

Argument parsing libraries in C/C++

The following table lists common features in argument parsing libraries. Stars indicates getopt_long features.

Feature Explanation
post* Parse options after non-option/positional arguments
compact* When appropriate, “-a -b foo” can be written as “-abfoo”
mulocc* Keep track of an option occurring multiple times
order* Keep track of the order of options
oparg* A long option may optionally take an argument
type Built-in argument type checking and parsing
fmt Print formatted help messages
wchar Support multi-byte characters

The table below shows the feature sets of several command-line argument parsing libraries. Only libraries supporting both short and long options are considered (stars indicate 1- or 2-file libraries):

library lang post compact mulocc order oparg type fmt wchar
getopt_long C/C++ Y Y Y Y Y N N maybe
argh* C++11 semi N N N N N N ?
argp C/C++ Y Y Y Y ? N Y ?
argparse* C/C++ Y Y N N ? Y Y ?
args* C++11 Y Y Y N ? Y Y ?
argtable* C/C++ Y Y Y N ? Y Y ?
cxxopts* C++11 Y Y Y N ? Y Y ?
CLI11 C++11 Y Y switch N N Y Y ?
gopt* C/C++ Y Y switch N Y N N N
ketopt* C/C++ Y Y Y Y Y N N N
tclap C++ ? N N N ? Y Y ?

Notably, many libraries discard the relative order between options, arguably the least important getopt feature. They often add type checking and automatic help message formatting. I think type checking comes in handy, but message formatting is not as valuable because I prefer my own format over theirs.

The list in the table is of course incomplete. Some important ones that are missing include Boost’s Program_options and Google’s gflags, both of which are much heavier libraries. I haven’t spent enough time on them. If you have relevant information on them or your favorite library that is missing, or you think the table is wrong, please help me to improve it. Thanks in advance!

Ketopt: my single-header argument parsing library

I occasionally care about the order of options, a feature missing from most non-getopt libraries (argp has it but is not portable). In the end, I developed my own library ketopt (examples here, including one on sub-command). It is implemented in ANSI C and doesn’t invoke heap allocations. Ketopt has a similar API to getopt_long except that 1) ketopt doesn’t use any global variables and 2) ketopt has an explicit function argument to allow options placed after non-option arguments. Developers who are familiar with getopt_long should be able to learn ketopt quickly.


Command-line argument parsing is relatively simple (ketopt has <100 LOCs), but implementing it by yourself is tricky, in particular if you want to match the features in getopt_long. My ketopt is largely a portable getopt_long without global variables. In addition to mine, you may consider gopt in C. It is small, easy to use and supports key getopt_long features. For C++ programmers, cxxopts is a decent choice. It is feature rich, close to getopt_long, and has similar APIs to Boost’s Program_options and Python’s argparse.

I strongly discourage the use of libraries deviating too much from getopt (e.g. argh and tclap). Most end users expect getopt behaviors. When your tool acts differently, it will confuse users. Command-line interface is one the first things users experience. Please get it right.

Read Full Post »

Vector and matrix arithmetic (e.g. vector dot and matrix multiplication) are the basic to linear algebra and are also widely used in other fields such as deep learning. It is easy to implement vector/matrix arithmetic, but when performance is needed, we often resort to a highly optimized BLAS implementation, such as ATLAS and OpenBLAS. Are these libraries much faster than our own implementations? Is it worth introducing a dependency to BLAS if you only need basic vector/matrix arithmetic? The following post may give you some hints.


In this github repository, I implemented matrix multiplication in seven different ways, including a naive implementation, several optimized implementations with cache miss reduction, SSE and loop blocking, and two implementations on top of OpenBLAS. The following table shows the timing of multiplying two 2000×2000 or 4000×4000 random matrices on my personal Mac laptop and a remote linux server (please see the source code repo for details):







7.53 sec

188.85 sec

77.45 sec


6.66 sec

55.48 sec

9.73 sec
sdot w/o hints


6.66 sec

55.04 sec

9.70 sec
sdot with hints


2.41 sec

29.47 sec

2.92 sec
SSE sdot


1.36 sec

21.79 sec

2.92 sec
SSE+tiling sdot


1.11 sec

10.84 sec

1.90 sec
OpenBLAS sdot


2.69 sec

28.87 sec

5.61 sec
OpenBLAS sgemm


0.63 sec

4.91 sec

0.86 sec

7.43 sec

165.74 sec


0.61 sec

4.76 sec

You can see that a naive implementation of matrix multiplication is quite slow. Simply transposing the second matrix may greatly improve the performance when the second matrix does not fit to the CPU cache (the linux server has a 35MB cache, which can hold a 2000×2000 float matrix in cache, but not a 4000×4000 matrix). Transpose also enables vectorization of the inner loop. This leads to significant performance boost (SSE sdot vs Transposed). Loop blocking further reduces cache misses and timing for large matrices. However, OpenBLAS’ matrix multiplication (sgemm) is still the king of performance, twice as fast as my best hand-written implementation and tens of times faster than a naive implementation. OpenBLAS is fast mostly due to its advanced techniques to minimize cache misses.

As side notes, “sdot with hints” partially unrolls the inner loop. It gives a hint to the compiler that the loop may be vectorized. Clang on Mac can fully vectorize this loop, achieving the same speed of explicit vectorization. Gcc-4.4 seems not as good. The Intel compiler vectorizes the loop even without this hint (see the full table in README). Interestingly, the OpenBLAS sdot implementation is slower than my explicit vectorization on both Linux and Mac. I haven’t figured out the reason. I speculate that it may be related to cache optimization.

As to C++ libraries, Eigen has similar performance to OpenBLAS. The native uBLAS implementation in Boost is quite primitive, nearly as slow as the most naive implementation. Boost should ditch uBLAS. Even in the old days, it was badly implemented.


  • For multiplying two large matrices, sophisticated BLAS libraries, such as OpenBLAS, are tens of times faster than the most naive implementation.
  • With transposing, SSE (x86 only) and loop blocking, we can achieve half of the speed of OpenBLAS’ sgemm while still maintaining relatively simple code. If you want to avoid a BLAS dependency, this is the way to go.
  • For BLAS level-1 routines (vector arithmetic), an implementation with SSE vectorization may match or sometimes exceeds the performance of OpenBLAS.
  • If you prefer a C++ interface and are serious about performance, don’t use uBLAS; use Eigen instead.

Read Full Post »


About two years ago I evaluated the performance of ~20 compilers and interpreters on sudoku solving, matrix multiplication, pattern matching and dictionary operations. Two years later, I decide update a small part of the benchmark on Sudoku solving. I choose this problem because it is practically and algorithmically interesting, and simple enough to be easily ported to multiple languages. Meanwhile, I am also adding two new programming languages: Mozilla’s Rust and Google’s Dart. They are probably the most promising languages announced in the past two years.


In this small benchmark, I am implementing Sudoku solvers in multiple programming languages. The algorithm, adapted from Guenter Stertenbrink’s solver, was first implemented in C and then ported to other languages. The C source code briefly describes the method. For more information about Sudoku solving in general, please see my other post.

Before I show the results, there are a couple of caveats to note:

  • Solving Sudoku is NP-hard. The choice of the solving algorithm will dramatically affect the speed. For example, my Rust implementation is ~2500 times faster than the one in the Rust official repository. For a language benchmark, we must implement exactly the same algorithm.
  • I am mostly familiar with C but am pretty much a newbie in other programming languages. I am sure some implementations are not optimal. If you can improve the code, please send me a pull request. I am happy to replace with a better version.

The following table shows the CPU time for solving 20 hard Sudokus repeated 50 500 times (thus 1000 10000 Sudokus in total). The programs, which are freely available, are compiled and run on my Mac laptop with a 2.66GHz Core i7 CPU.





CPU time (sec)

425.0.27 (3.2svn)










-O -release




–opt-level 3






1.1beta 20130406


-gcflags -B













In this small benchmark, C still takes the crown of speed, Other statically typed languages are about twice as slow but Rust and D are very close to C. It is pretty amazing that Rust as a new language is that performant given the developers have not put too much efforts on speed so far.

Among dynamically typed languages, Dart, V8 and LuaJIT are similar in speed, about 3 times as slow as C. 3 times is arguably not much to many applications. I really hope some day I can use a handy dynamically typed language for most programming. Pypy is slower here, but it is more than twice as fast as the version two years ago.

Related resources


  • I forgot to use `-release’ with dmd. The new result looks much better. Sorry for my mistake.
  • Mac ships gcc-4.2.1 only due to licensing issues. I have just tried both gcc 4.7.2 and gcc 4.8 from MacPorts. The executables compiled by them take 0.99 second to run, slower than gcc-4.2.1.
  • Updated to the latest Go compiled from the repository.
  • Updated the Python implementation (thanks to Rob Smallshire).
  • Updated the Dart implementation (thanks to jwendel).
  • Updated the Rust implementation (thanks to dotdash).
  • Made input 10 times larger to reduce the fraction of time spent on VM startup. Dart/V8/LuaJIT have short VM startup time, but Java is known to have a long startup.
  • Updated the Go implementation (thanks to Sébastien Paolacci).
  • Updated the Python implementation.

Read Full Post »

Weekend project: K8 revived

Around a weekend two years ago, I wrote a Javascript shell, K8, based on Google’s V8 Javascript engine. It aimed to provide basic file I/O that was surprisingly lacking from nearly all Javascript shells that time. I have spent little time on that project since then. K8 is not compatible with the latest V8 any more.

Two years later, the situation of Javascript shells has not been changed much. Most of them, including Dart, still lack usable file I/O for general-purpose text processing, one of the most fundamental functionality in other programming languages from the low-level C to Java/D to the high-level Perl/Python. Web developers seem to follow a distinct programming paradigm in comparison to typical Unix programmers and programmers in my field.

This weekend, I revived K8, partly as an exercise and partly as my response to the appropriate file I/O APIs in Javascript. K8 is written in a 600-line C++ file. It is much smaller than other JS shells, but it provides features that I need most but lack from Javascript and other JS shells. You can find the docuemtation from K8 github. I will only show an example:

var x = new Bytes(), y = new Bytes();
x.set('foo'); x.set([0x20,0x20]); x.set('bar'); x.set('F', 0); x[3]=0x2c;
print(x.toString())   // output: 'Foo, bar'
y.set('BAR'); x.set(y, 5)
print(x)              // output: 'Foo, BAR'
x.destroy(); y.destroy()

if (arguments.length) { // read and print file
  var x = new Bytes(), s = new iStream(new File(arguments[0]));
  while (s.readline(x) >= 0) print(x)
  s.close(); x.destroy();

Read Full Post »

I have played with Dart a little bit. Although overall the language is interesting and full of potentials, it indeed has some rough edges.

1) Flawed comma operator.

main() {
	int i = 1, j = 2;
	i = 2, j = 3;

Dart will accept line 2 but report a syntax error at line 3. In C and Java, the line is perfectly legitimate.

2) Non-zero integers are different from “true”.

main() {
	if (1) print("true");
	else print("false");

The above program will output “false”, which will surprise most C/Java/Lua/Perl programmers.

3) No “real” dynamic arrays.

main() {
	var a = [];
	a[0] = 1;

Dart will report a run-time error at line 3. Most scripting languages will automatically expand an array. I know disabling this feature helps to prevent errors, but I always feel it is very inconvenient.

4) No easy ways to declare a constant-sized array. As Dart does not automatically expand arrays, to declare an array of size 10, you have to do this:

main() {
	var a = new List(10);

It is more verbose than “int a[10]” in C.

5) No on-stack replacement (OSR). I discussed this point in my last post, but I still feel it is necessary to emphasize again: if you do not know well how Dart works or are not careful enough, the bottleneck of your code may be interpreted but not compiled and then you will experience bad performance. The Dart developers argued that Dart is tuned for real-world performance, but in my view, if a language does not work well with micro-benchmarks, it has a higher chance to deliever bad performance in larger applications.

6) Lack of C/Perl-like file reading. The following is the Dart way to read a file by line:

main() {
	List<String> argv = new Options().arguments;
	var fp = new StringInputStream(new File(argv[0]).openInputStream(), Encoding.ASCII);
	fp.onLine = () {

Note that you have to use callback to achieve that. This is firstly verbose in comparison to other scripting languages and more importantly, it is very awkward to work with multiple files at the same time. I discussed the motivation of the design with Dart developers and buy their argument that such APIs are useful for event driven server applications. However, for most programmers in my field whose routine work is text processing for huge files, lack of C-like I/O is a showstopper. Although the Dart developers pointed out that openSync() and readSyncList() are closer to C APIs, openSync() does not work on STDIN (and Dart’s built-in STDIN still relies on callback). APIs are also significantly are lacking. For example, dart:io provides APIs to read the entire file as lines, but no APIs to read a single line. In my field, the golden rule is to always avoid reading the entire file into memory. This is probably the recommendation for most text processing.

On file I/O, Dart is not alone. Node.js and most shells built upon Javascript barely provides usable file I/O APIs. It is clear that these developers do not understand the needs of a large fraction of UNIX developers and bioinformatics developers like me.

Summary: Generally, Dart is designed for web development and for server-side applications. The design of Dart (largely the design of APIs) does not fit well for other applications. In principle, nothing stops Dart becoming a widely used general-purpose programming language like Python, but in this trend, it will only be a replacement, at the best, of javascript, or perhaps more precisely, node.js. At the same time, I admit I know little about server-side programming. It would be really good if different camps of programmers work together to come to a really great programming language. Dart is not there, at least not yet.

Read Full Post »

Dart: revisiting matrix multiplication

First of all, I know little about JIT and VM. Some of what I said below may well be wrong, so read this blog post with a grain of salt.

My previous microbenchmark showed that dart is unable to optimize the code to achieve comparable speed to LuaJIT. Vyacheslav Egorov commented that the key reason is that I have not “warmed up” the code. John McCutchan further wrote an article about how to perform microbenchmarks, also emphasizing the importance of warm-up. After reading a thread on the Dart mailing list, I know better about the importance of warm-up now.

If I am right, JIT can be classified as more traditional method JIT whereby the VM compiles a method/function to machine code at a time, and tracing JIT whereby the VM may optimize a single loop. There is a long discussion on Lambda the Ultimate about them. Typically, method JIT needs to identify hot functions and then compile AFTER the method call is finished. What I did not know previously is On Stack Replacement (OSR), with which we are able to compile a method (or part of the method?) to machine code while it is running. This in some way blurs the boundary between method JIT and tracing JIT.

Among popular JIT implementations, V8 and Java use method JIT with OSR, while Pypy and LuaJIT use tracing JIT. They are all able to perform well for matrix multiplication even if the hot method is called only once. In my previous post, Dart has bad performance because it uses method JIT but without OSR. It is unable to optimize the hot function while it is being executed. The Dart development team argued that the lack of OSR is because implementing OSR is complicated and “experience with Javascript and Java programs has shown that it very rarely benefits real applications.”

I hold the opposite opinion, strongly. There is no clear distinction between benchmarks and real applications. It is true that in web development, a program rarely spends more than a few seconds in a function called only once, but there are more real applications than web. In my daily work, I may need to do Smith-Waterman alignment between two long sequences or to compute the first few eigenvalues of a huge positive matrix. The core functions will be called only once. I have also written many one-off scripts having only a main function. Without OSR, Dart won’t perform better than Perl/Python, either, I guess. If the Dart development team want Dart to be widely adopted in addition to web development, OSR will be a key feature (well, a general-purpose language may not be the goal of Dart, which would be a pity!). I wholeheartedly hope they can implement OSR in future.

Fortunately, before OSR gets implemented in Dart (if ever), there is a simpler and more practical solution than warm-up: hoisting the content in the hot loop into a function to allow Dart compiles that function to machine code after it is called for a few times (though to do this, you need to know which loop is hot).

At the end of the post is an updated implementation of matrix multiplication, where “mat_mul1()” and “mat_mul2()” have the same functionality but differ in the use of function. The new implementation (mat_mul2) multiplies two 500×500 matrices in 1.0 second, as opposed to 14 seconds by the old one (mat_mul1). This is still much slower than LuaJIT (0.2 second) and V8 (0.3 second), but I would expect Dart to catch up in the future. Actually Vyacheslav commented that a nightly build might have already achieved or approached that.

SUMMARY: Dart as of now only compiles an entire method to machine code, but it cannot compile the method while it is running. Therefore, if the hot method is called only once, it will not be compiled and you will experience bad performance. An effective solution is to hoist the content of the hot loop to a separate function such that Dart can compile the function after it is executed a few times.

	int m = a.length, n = a[0].length; // m rows and n cols
	var b = new List(n);
	for (int j = 0; j < n; ++j) b[j] = new List<double>(m);
	for (int i = 0; i < m; ++i)
		for (int j = 0; j < n; ++j)
			b[j][i] = a[i][j];
	return b;

mat_mul1(a, b)
	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i) {
		x[i] = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0.0;
			for (int k = 0; k < n; ++k) sum += a[i][k] * c[j][k];
			x[i][j] = sum;
	return x;

mat_mul2(a, b)
	inner_loop(t, n, ai, c)
		var xi = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0.0;
			for (int k = 0; k < n; ++k) sum += ai[k] * c[j][k];
			xi[j] = sum;
		return xi;

	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i)
		x[i] = inner_loop(t, n, a[i], c);
	return x;

mat_gen(int n)
	var a = new List(n);
	double t = 1.0 / n / n;
	for (int i = 0; i < n; ++i) {
		a[i] = new List<double>(n);
		for (int j = 0; j < n; ++j)
			a[i][j] = t * (i - j) * (i + j);
	return a;

	int n = 500;
	var a = mat_gen(n), b = mat_gen(n);
	var c = mat_mul2(a, b);

Read Full Post »

Older Posts »