A circular doubly linked list (cdlist in brief) is a doubly linked list where the last node connects the first node. An intrusive cdlist doesn’t invoke heap allocation (aka malloc) in the library code. The Linux kernel famously implements an intrusive cdlist. That implementation is quite long, which hides the rationale behind the code and might be hard to understand even with well-written explanations (e.g. this).

This article gives a much simpler implementation of a basic intrusive cdlist with only push() and pop() operations. The library code only consists of <30 coding lines (named “cdlist.h”):

#pragma once // or use the #ifndef guard
#include <stddef.h> // for offsetof()

typedef struct cl_head_s {
	struct cl_head_s *prev, *next;
} cl_head_t;

static inline void cl_push(cl_head_t **h, cl_head_t *p, int push_back) {
	if (*h) p->prev = *h, p->next = (*h)->next, (*h)->next = p, p->next->prev = p;
	else *h = p, p->prev = p->next = p;
	if (push_back) *h = p;
static inline cl_head_t *cl_pop(cl_head_t **h, int pop_back) {
	cl_head_t *p, *q;
	if (*h == 0) return 0;
	p = pop_back? *h : (*h)->next, q = p->prev;
	if (p == q) *h = 0;
	else q->next = p->next, q->next->prev = q, *h = q;
	return p;

// Given a pointer to a struct member, get the pointer to the struct
#define cl_container_of(ptr, type, mb) ((type*)((char*)(ptr) - offsetof(type, mb)))

This header only implements the topology of a cdlist, but doesn’t specify how to store data. The following “test.c” shows how to use this library:

#include <stdlib.h> // for malloc()
#include <stdio.h>  // for printf()
#include "cdlist.h"

typedef struct { int x; cl_head_t head; } my_elem_t;

static inline my_elem_t *my_elem_create(int x) {
	my_elem_t *p = (my_elem_t*)malloc(sizeof(*p));
	p->x = x;
	return p;

int main(void) {
	cl_head_t *head = 0;
	cl_push(&head, &my_elem_create(3)->head, 1);
	cl_push(&head, &my_elem_create(4)->head, 1);
	cl_push(&head, &my_elem_create(2)->head, 0);
	cl_push(&head, &my_elem_create(5)->head, 1);
	cl_push(&head, &my_elem_create(1)->head, 0);
	while (head) {
		cl_head_t *p = cl_pop(&head, 1);
		my_elem_t *q = cl_container_of(p, my_elem_t, head);
		printf("out: %d\n", q->x);
		free(q); // use code manages memory
	return 0;

Line 5 defines the struct that holds data. It has a “cl_head_t” member variable – the cdlist library “intrudes” the definition of user data types. Line 14 initializes an empty cdlist, which is simply a NULL pointer. Line 15–19 adds data to the list. Notably, we are adding pointers to the “cl_head_t” member variable, not pointers to the data. Then we have a list of “cl_head_t” objects. Line 22 gets a pointer to “my_elem_t” from a pointer to “cl_head_t”. The trick here is that the offset between a struct pointer and a pointer to its member is fixed and can be computed by the offsetof macro.

With this intrusive list, users have to take care of memory allocation. The advantage is this is more flexible. Users may allocate “my_elem_t” objects from heap or stack or freely choose their own allocators. The downside is intrusive lists are harder to use, as users have to manage memory by themselves. To me, this flexibility is more important than convenience. I generally recommend intrusive lists over non-intrusive ones.

TL;DR: With linear probing, we can delete elements from an open addressing hash table without tombstones. Here are the C and the C++ implementations.


When implementing a hash table based on open addressing, we usually set a tombstone for each deleted element, which indicates a bucket used to have an element. These tombstones maintain proper probe chains in the presence of hash collisions. They are critical to the correctness of open-addressing hash tables. My khash.h and most hash table implementations use tombstones.

However, the tombstone strategy is problematic. These tombstones waste memory and may increase the frequency of expensive rehashing. To alleviate these adverse effects, the FaceBook F14 hash table implements reference-counted tombstones. Such tombstones are removed from the hash table when they are at the end of probe chains. Note that F14 still uses tombstones. It just removes them more effectively.

Deletions without tombstones for linear probing

For a long time, I thought tombstones are inevitable. I was wrong. A recent reddit post pointed out that the wiki linear probing page has already offered a no-tombstone solution for years. The basic idea is not complex: when we delete an element, we move the next element to the deleted location if it is pushed away by the deleted element due to hash collision; we repeat this process until we come to an empty bucket. In C++, this algorithm only takes ~10 lines.

The current wiki page describes the idea well, but it is incomplete –– you can’t implement the algorithm with the description there. Fortunately, Google search directs me to an old StackOverflow answer which gives the correct pseudocode. Unlike F14 or robin-hood hashing, this algorithm doesn’t require any additional storage, not even a single bit.


I implemented the algorithm in a new experimental hash table library khashl.h along with its C++ version khashl.hpp. I started to use Fibonacci hashing and optional hash value caching as are described in my earlier post. It uses one bit per bucket to indicate whether the bucket is empty.


I modified my earlier benchmark to evaluate deletions. The new benchmark feeds each hash table library a list of random integers. We insert an integer if it is absent from the table; we delete an integer if it is already in the table. For the largest dataset, there are 50 million input integers but there are only 6.2 million integers left in the final table. There are plenty of deletions.

The timing and memory of several hash table libraries are shown below:

In the figure, each library is associated with five points corresponding to 10, 18, 26, 34, 42 and 50 million input integers. The red circle line shows khashl, the new implementation. It has lower memory footprint across the board.

Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, due to the presence of tombstones, khash.h has to double the number of buckets, resulting in fewer collisions. It implicitly trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Copying buckets can be slow. That said, the new deletion algorithm is only a little slower than khash.h and is faster than many other libraries for this particular task. It might also become faster than khash under a different payload (e.g. large batch of deletions). In addition, khashl has simpler insertion. It is faster than khash even if no deletions are involved.


Considering the clear advantage in memory, I think the new deletion algorithm without tombstones is overall better than traditional algorithms. It should become the preferred way to implement hash tables. I am surprised that I only found this algorithm a couple of days ago.

Appendix: comments on other libraries

  • I am only showing fast libraries in the plot. Adding slow libraries will squeeze all the current points into a corner, making them harder to see.
  • Many libraries are also evaluated in another benchmark.
  • Abseil/absl hash map was not very impressive in my earlier benchmark, but the recent version seems better.
  • phmap is largely a more portable version of older Abseil hash map. It is not as fast as Abseil nowadays.
  • Consistent with the other benchmark, emilib is the fastest on inserting 32-bit integers. It implements a relatively simple hash table with linear probing. Emilib is faster than khashl possibly because 1) it uses a small load factor of 67% (vs 75% with khashl) and 2) it puts the empty/deleted bits inside each bucket, which may help cache efficiency. Emilib is very slow on deletions. I am not sure why.

Update on 2019-12-28: added absl::flat_hash_map and removed the rant about Abseil.

Array and hash table are probably the most important data structures. Some programming languages such as Perl, Lua and Javascript, almost build the language core on top of the two data structures. While array is straightforward to implement, hash table is not. This is why we have paid continuous efforts in improving the hash table performance. This blog post reviews recent techniques not commonly found in classical textbooks.

Open addressing vs. chaining

This is not an advanced topic at all, but it is worth emphasizing: for small keys, open addressing hash tables are consistently faster and smaller than a standard chaining based hash tables. C++11 requires std::unordered_map to use chaining, which means if you want an efficient hash table for lots of small keys, choose another library. Some of the techniques below are applied to open addressing only.

Secondary hash functions

A hash function is bad if it often maps distinct keys to the same bucket. A hash function can also be bad if it follows a pattern. One example is the identity hash function, which maps any integer to itself. When you insert N adjacent integers to the table, inserting an integer colliding with one of the existing numbers may trigger an O(N) operation, much slower than the expected O(1). To reduce the effect of such hash functions, we can introduce a second hash function that maps one integer to another more random one. This blog post recommends the following:

static inline uint64_t fibonacci_hash(uint64_t hash) {
    return hash * 11400714819323198485llu;

This belongs to the larger category of multiplicative hash functions. It is a good choice on modern CPUs that implement fast integer multiplications.

Using a secondary hash function is like a safe guard. When users choose good hash functions, this secondary function only wastes time, a little bit.

Caching hash values

When we use long strings as keys, comparing two keys may take significant time. This comparison is often unnecessary. Note that the hash of a string is a good summary of the string. If two strings are different, their hashes are often different. We can cache the hash and only compare two keys when their hashes are equal. It is possible to implement the idea with any hash table implementations. We only need to change the key type like

typedef struct {
  uint64_t hash;
  char *str;
} HashedStr;
#define hs_hash_func(a) ((a).hash)
#define hs_equal(a, b) ((a).hash == (b).hash && \
                        strcmp((a).str, (b).str) == 0)
static void hs_fill(HashedStr *p, const char *str) {
  p->str = strdup(str);
  p->hash = my_string_hash_func(p->str);

Writing all these in user’s code is a little complicated. Some hashtable libraries provide options to cache hashes inside the library. It is a handy feature.

Quadratic probing and power-of-2 table size

This is not an advanced technique, either, but it seems that not everyone knows the following. The textbook I used over 15 years ago mentioned that quadratic probing may never visit some cells. To see that, you can run this:

void main(void) {
  int i, b = 10, n = 1<<b, *c = (int*)calloc(n, sizeof(int));
  for (i = 0; i < n; ++i) {
    int x = i * i & (n - 1);
    if (c[x]++) printf("hit: %d\n", i);

You will see 852 "hit" lines. This means even if the table has empty slots, quadratic probing may not find a place to put a new element. The wiki said: “there is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime.”

If you go to that wiki page, you will find the phrase ahead of the quoted sequence is “With the exception of the triangular number case for a power-of-two-sized hash table”. This was added in 2012. By “triangular”, we mean to change line 4 above to:

    int x = i * (i + 1) / 2 & (n - 1);

When you run the program again, you won’t see any “hit” lines. You can find a proof here, which is in fact an exercise in Knuth’s book. In all, the “half-full limitation” is largely a myth.

Robin Hood & Hopscotch hashing

Robin Hood hashing and Hopscotch hashing can be considered as extensions to Cuckoo hashing. Different from traditional solutions to hash collisions, they may displace a key in the hash table if the probe length is too long.

In the words of wiki, with Robin Hood hashing, “a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position”. It reduces the variance in searching keys and makes the table still efficient under a high load factor. Robin Hood hashing is gaining popularity. Several of the fastest hash table libraries, including Rust’s standard library, is using this strategy.

However, Robin Hood hashing is not universally better. First, insertion may be a little slower due to swaps of keys. Second, with an extra counter, each bucket is larger, which partly cancels the advantage under high load. In my benchmark, Robin Hood hashing is not obviously better on that particular task. A Google’s Abseil developer also commented that they tried Robin Hood hashing, but found it is not that impressive.

Hopscotch hashing generally follows a similar philosophy. I will not go into the very details. I just point out in my benchmark, this strategy is not clearly better, either (see this figure).

Swiss table

Swiss table is the name of Google’s new hash table absl::flat_hash_map and is explained in this video. It uses a meta-table to indicate if a bucket is empty or has been deleted before. khash.h uses a similar table, but Swiss table does it better: it uses two bits one bit to keep empty/deleted and six seven bits to cache hash values, such that most of time it can find the right bucket without querying the main bucket table. And because this meta-table is small (one byte per element), we can query 16 cells with a few SSE instructions.

I thought Swiss table could easily beat my khash.h at the cost of a little bit more memory. However, it doesn’t. I will look into this at some point.

Apparently inspired by the Swiss table, ska::bytell_hash_map also employes a one-byte-per-element meta-table, but instead of caching 6-bit of hash values, it uses the lower seven bits to calculate the distance to the next bucket (details remain unknown). This implementation achieves very good space-time balance.

Concluding remarks

There is not a universally best hash table library. Each library has to choose a balance between space and speed. I am yet to see a library that beats the rest in both aspects. As a matter of fact, there is probably not a fastest hash table library, either. Strategies fast at query may be slow at insertions; strategies fast for large keys may be overkilling for small keys.

However, some hash tables can be consistently faster and smaller than others. According to my recent evaluation, ska::flat_hash_map, ska::bytell_hash_map, tsl::robin_map and tsl::hopscotch_map are wise choices to C++11 programmers, at least for small keys. They are fast, standalone and relatively simple. Google’s absl::flat_hash_map is ok, but I thought it could be faster. Google’s dense_hash_map and my khash.h remain top options for C++98 and C, respectively.

Update: Swiss table caches 7 bits of hash in the meta-table, not 6 bits. Fixed a few typos.


One of the most frustrating experiences with Julia is that many tutorials you find online don’t work any more because the language has changed so much. Creating a new package is one of the most fundamental tasks to a language, but it took me quite a while to figure that out. In the end, I managed to submit Getopt to Julia’s central registry. It implements a Python-like getopt in 70 lines, much shorter than ArgParse.jl.

This blog post explains what I have learned in this process. Here, a line starting with “sh>” indicates a shell command line, “julia>” denotes the Julia REPL and “pkg>” denotes the pkg mode, which can be entered by typing “]” in REPL.

Creating a package

To create a package repository, you may:

sh> julia -e 'using Pkg; Pkg.generate("Foo")' # or in the pkg mode
sh> mv Foo Foo.jl
sh> cd Foo.jl

The first command line creates a “Foo” directory, a “Foo/src/Foo.jl” file and a “Project.toml” file. We renamed this directory to “Foo.jl” because this is the convention. In the “Foo.jl” directory, you can add dependencies with

(v1.0) pkg> activate .                # enter virtual environment
(Foo) pkg> add Test                   # module for unit tests
sh> rm Manifest.toml                  # we don't need this
sh> echo 'julia 1.0' > REQUIRE        # but we need this
sh> mkdir -p test && touch test/runtests.jl  # tests go here

This updates the “[deps]” section of “Project.toml”. You probably need the “Test” package because apparently without it, you can’t run tests for your new package. Now, you can edit “Foo.jl/src/Foo.jl” to write the actual library code. Remember to read the documentation on Modules.

Deploying the package

You can’t import “Foo” yet because your new package is not added to your local registry. In the “Foo.jl” directory, you have to run the following first

(v1.0) pkg> dev .

It is always a good idea to write tests. To do that, edit “test/runtests.jl” following the tutorial on Unit Test. My Getopt.jl test is here. It is not a good example but may give you a basic idea. Tests can be run with

(v1.0) pkg> test Foo

Registering the package

After you push “Foo.jl” to github, others can install your package with

sh> julia -e 'using Pkg; Pkg.add("https://github.com/your/Foo.jl")'

They can’t install with the package name because it is not in Julia’s central registry yet. To register your package, you’d better use attobot, a GitHub App that automatically sends pull requests to METADATA.jl. For a new package, it asks you to wait for three days. Someone else (I guess a human) will merge the PR, which will be automatically synchronized to the Julia registry after several hours. At this point, the world will be able to install your package with

sh> julia -e 'using Pkg; Pkg.add("Foo")'

Your package won’t be found in the Julia package list because that page is outdated. Julia doesn’t have built-in package search. The best place to discover packages seems Julia Observer. It has issues, too. For example, it doesn’t tell you which Julia versions a package supports. It is very slow.

Concluding remarks

Getopt is the first package I developed in Julia. The workflow described here might not be optimal. I will update this post once I learn a better solution.


Many command-line tools need to parse command-line arguments. In C, one of the most widely used functions for this purpose is getopt() and its GNU extension getopt_long(). However, these functions have two major issues. First, they are not portable. getopt is part of the POSIX standard but not the C standard; getopt_long is not part of any standards. In addition, getopt may behave differently depending on whether GNU extension is enabled. Using these functions can be tricky. Second, both functions rely on global variables, which may interfere with more complex use cases (e.g. sub-commands).

These limitations motivated the development of several other argument parsing libraries. While these libraries often have cleaner APIs and more functionality, most of them lack some getopt_long features. This blog post reviews several argument parsing libraries in C/C++ and introduces my own getopt replacement at the end.

Argument parsing libraries in C/C++

The following table lists common features in argument parsing libraries. Stars indicates getopt_long features.

Feature Explanation
post* Parse options after non-option/positional arguments
compact* When appropriate, “-a -b foo” can be written as “-abfoo”
mulocc* Keep track of an option occurring multiple times
order* Keep track of the order of options
oparg* A long option may optionally take an argument
type Built-in argument type checking and parsing
fmt Print formatted help messages
wchar Support multi-byte characters

The table below shows the feature sets of several command-line argument parsing libraries. Only libraries supporting both short and long options are considered (stars indicate 1- or 2-file libraries):

library lang post compact mulocc order oparg type fmt wchar
getopt_long C/C++ Y Y Y Y Y N N maybe
argh* C++11 semi N N N N N N ?
argp C/C++ Y Y Y Y ? N Y ?
argparse* C/C++ Y Y N N ? Y Y ?
args* C++11 Y Y Y N ? Y Y ?
argtable* C/C++ Y Y Y N ? Y Y ?
cxxopts* C++11 Y Y Y N ? Y Y ?
CLI11 C++11 Y Y switch N N Y Y ?
gopt* C/C++ Y Y switch N Y N N N
ketopt* C/C++ Y Y Y Y Y N N N
tclap C++ ? N N N ? Y Y ?

Notably, many libraries discard the relative order between options, arguably the least important getopt feature. They often add type checking and automatic help message formatting. I think type checking comes in handy, but message formatting is not as valuable because I prefer my own format over theirs.

The list in the table is of course incomplete. Some important ones that are missing include Boost’s Program_options and Google’s gflags, both of which are much heavier libraries. I haven’t spent enough time on them. If you have relevant information on them or your favorite library that is missing, or you think the table is wrong, please help me to improve it. Thanks in advance!

Ketopt: my single-header argument parsing library

I occasionally care about the order of options, a feature missing from most non-getopt libraries (argp has it but is not portable). In the end, I developed my own library ketopt (examples here, including one on sub-command). It is implemented in ANSI C and doesn’t invoke heap allocations. Ketopt has a similar API to getopt_long except that 1) ketopt doesn’t use any global variables and 2) ketopt has an explicit function argument to allow options placed after non-option arguments. Developers who are familiar with getopt_long should be able to learn ketopt quickly.


Command-line argument parsing is relatively simple (ketopt has <100 LOCs), but implementing it by yourself is tricky, in particular if you want to match the features in getopt_long. My ketopt is largely a portable getopt_long without global variables. In addition to mine, you may consider gopt in C. It is small, easy to use and supports key getopt_long features. For C++ programmers, cxxopts is a decent choice. It is feature rich, close to getopt_long, and has similar APIs to Boost’s Program_options and Python’s argparse.

I strongly discourage the use of libraries deviating too much from getopt (e.g. argh and tclap). Most end users expect getopt behaviors. When your tool acts differently, it will confuse users. Command-line interface is one the first things users experience. Please get it right.

TL;DR: The code is available in klib/kavl.h with a toy example in comments and at the end of this post. kavl_test.c tests correctness. Insertion performance is evaluated separately.


I need a container which, upon each insertion, tells me the number of smaller objects than the inserted one. A natural choice is a binary search tree. We store at each node the number of objects descended from the node. On insertion, we sum over numbers on nodes immediately left to the search path to find the answer. This algorithm sounds easy but is not implemented in existing libraries. In addition, I also want to learn how AVL tree and intrusive containers work down to every detail. So, here we go.


An intrusive container is a container that requires each object in it to have one or multiple predefined member variables. Such a container intrudes the object definition – this is how it is named.

Implementation overview

kavl.h is broadly similar to khash.h. It requires you to expand a macro to insert the actual implementation into your code before using it. As an intrusive container, kavl.h doesn’t call the malloc() equivalent inside the library. In fact, it doesn’t even depend on libc. Like my other container implementations, kavl.h strives for performance. It avoids recursion, and doesn’t keep a pointer to the parent node – this saves space at the cost of code complexity.

A popular way to implement intrusive containers is to use offsetof, as is described in this blog post. This strategy avoids all the macro magic, but makes it impossible to inline simple comparisons. It is less efficient.

The advantage of intrusive containers

A non-intrusive container allocates memory inside the library. It is non-trivial (if possible at all) to replace the allocator used in the library. A true intrusive container lets you allocate memory in whatever way you prefer. You can opt to a custom heap allocator, a memory pool or even allocate on stack, which may help performance a little if used correctly.

In addition, when storing strings or other variable-length data, an intrusive tree/list may reduce one heap allocation per node. In case of kavl.h, you can define a tree node with a flexible array member:

struct my_node {
    int len;
    KAVL_HEAD(struct my_node) head;
    char str[];

This way, you can allocate the node along with the string, which again may help performance.

The disadvantage

With an intrusive container, you have to take care of all memory management. This is inconvenient and opens doors to potential memory leaks. At least in C, the APIs of intrusive containers are less intuitive and harder to understand, requiring users to have a deeper knowledge in language features.

The myth

The Boost library argues that intrusive containers are faster with less stress on memory management. They tried to prove this with a benchmark. That goes a little too far. Intrusive lists shine there mainly because their programs “allocate” list nodes from a pre-allocated vector. In practice, we still have to allocate each node individually on heap when deletions are involved or when we can’t preallocate all nodes. Intrusive containers can be faster, but most often they are not. Even when they are faster, the performance gap is small.

It is believed among C programmers that intrusive data structures are a great way to achieve generic programming. This is only partially true. First, of common containers, only lists and binary search trees (BSTs) can be made truly intrusive in the sense that they need no heap allocation inside the libraries. Dynamic chaining-based hash tables still have to allocate the bucket array on heap, and they are often slower than open-addressing hash tables and should be avoided anyway. Second, only intrusive lists, the least useful data structure, can be implemented efficiently without ugly macros everywhere. For BSTs, we still have to use the macro magic to achieve the performance of type-specific code. Intrusive containers are not a general solution to generic programming in C.


To most developers, non-intrusive containers are the better choice. However, when you implement a memory allocator or when you micro-manage memory for the best performance, you will appreciate the flexibility of intrusive containers. Combined with a simple memory pool, kavl.h does speed up my program in the end.

Code example

The following implements the AVL tree example on wiki.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "kavl.h"

struct my_node {
    char key;
    KAVL_HEAD(struct my_node) head;
#define my_cmp(p, q) (((q)->key < (p)->key) - ((p)->key < (q)->key))
KAVL_INIT(my, struct my_node, head, my_cmp)

int main(void) {
    const char *str = "MNOLKQOPHIA"; // from wiki, except a duplicate
    struct my_node *root = 0;
    int i, l = strlen(str);
    for (i = 0; i < l; ++i) {        // insert in the input order
        struct my_node *q, *p = malloc(sizeof(*p));
        p->key = str[i];
        q = kavl_insert(my, &root, p, 0);
        if (p != q) free(p);         // if already present, free
    kavl_itr_t(my) itr;
    kavl_itr_first(my, root, &itr);  // place at first
    do {                             // traverse
        const struct my_node *p = kavl_at(&itr);
        free((void*)p);              // free node
    } while (kavl_itr_next(my, &itr));
    return 0;

Update on 2018-09-29: updated ska::flat_hash_map and tsl::hopscotch_map to the latest versions. Added absl::flat_hash_map, ska::bytell_hash_map and tsl::robin_map. Updated texts accordingly.

I evaluated multiple hash table libraries nearly 10 years ago. A lot have been changed since then: hash table is officially part of C++, my khash library is about twice as fast, and more advanced algorithms/heuristics are being applied to hash table implementations. Where are we now? Is unordered_map in C++11 the preferred choice? What hash table library should we use? This post aims to find partial answers to these questions.

In this micro-benchmark (source code here), we have N 32-bit integers with ~25% of them are distinct. The task is to find the occurrence of each distinct integer with a hash table. It is inspired by real applications in my routine work. I will show the result first and then discuss all the caveats.


In the figure above, each connected line represents a library. Each line harbors 6 dots, corresponding to N=10,18,26,34,42,50 million, respectively. I used multiple numbers to show the effect of rehashing. The X-axis measures CPU time and Y-axis measures peak memory, including temporary swapping space used for rehashing.

10 years ago, Google’s dense_hash_map was significantly faster than all the peers. It is still among the fastest in this benchmark. When we consider the speed-memory balance, the more sophisticated probing algorithms such as Hopscotch hashing (used by tsl::hopscotch-map), Robin Hood hashing (by ska::flat_hash_map and tsl::robin_map) and Swiss table (by absl::flat_hash_map) are not better. I speculate this is partially because they need to store extra data in each bucket, which cancels some of their advantages under high load. In addition, these advanced hashing methods are better at query. My benchmark always invokes insertions, though 75% of time no new elements are added.

It bugs me that the official unordered_map implementation in GCC-6.3 is that inefficient. In fact, it is slower and uses more memory than SGI’s ancient ext/hash_map and tr1/unordered_map – both of them are still available in GCC. All these libraries use chaining to resolve collisions, which is apparently required by the C++11 spec. It is unfortunate that the C++ standard committee ruled out open addressing. Nearly all the hash table benchmarks indicate open addressing is significantly faster on small keys. As to C libraries, uthash is the most popular, but its performance lags far behind others. When you need a large hash table, ska::*_hash_map and tsl::*_map are the better choices if you prefer C++11 APIs; Google dense_hash and khash remain top options after 10 years.

Additional notes:

  • All implementations use the same integer hash function. Switching to the identity hash function has little effect on performance.
  • I haven’t tuned the maximum load factor and the growth factor. They may affect the balance between time and space.
  • Libraries in the benchmark use different memory allocators. For example, khash uses glibc’s malloc that supports realloc, unordered_map naturally uses std::allocator and Google dense_map/sparsepp are using their own allocators. I suspect that memory allocators play a role in performance. More testing needed.
  • There are several other hashtable benchmarks about tsl::*_map, ska::flat_hash_map and ska::bytell_hash_map. These are all good. TommyDS shows a benchmark where it performs the best. That is a bad one because it doesn’t put data into the table (as TommyDS can’t do that). Opic hashtable also has a benchmark. It seems to ignore the effect of rehashing. I thought to evaluate it but couldn’t get it working.
  • The source code repo evaluates several more libraries. Their results can be found in “__logs/*.tgz”.
  • For demonstration purposes, I have translated khash.h to a C++ single header. Khash implements a fairly naive algorithm. It may not work well with other types of data.
  • Benchmark programs were run on a fresh “standard-1” machine from Google Cloud. The results on a local Linux server are a little different:


On CPU dispatch

Modern x86 CPUs implement advanced instruction sets, such as SSE and AVX, which may greatly help performance. However, when distributing precompiled binaries (think about Debian, CentOS, AnaConda, etc), we often prefer to fall back on older instruction sets for the sake of portability. Is there a way to dynamically choose CPU instruction sets at runtime such that we can achieve performance and portability at the same time? Yes, the answer is CPU dispatch. For a program that supports CPU dispatch, we typically compile it on a recent CPU to generate a fat(ish) binary that contains multiple implementations of a function or a code block with different instruction sets. When we run, the program dynamically chooses internal implementations based on the CPU features. I first heard of “CPU dispatch” from an Intel developer a few years ago. Unfortunately, googling “CPU dispatch” does not give me much relevant information immediately even today. This post aims to briefly explain the strategies to implement CPU dispatch in C/C++.

On x86, my preferred way to implement CPU dispatch is to detect the supported SIMD instruction sets via CPUID, which can be retrieved with x86 assembly, or with the __cpuid intrinsics specific to MS VC++. The following shows an example.

#include <stdio.h>

#define SIMD_SSE     0x1
#define SIMD_SSE2    0x2
#define SIMD_SSE3    0x4
#define SIMD_SSE4_1  0x8
#define SIMD_SSE4_2  0x10
#define SIMD_AVX     0x20
#define SIMD_AVX2    0x40
#define SIMD_AVX512F 0x80

unsigned x86_simd(void) {
  unsigned eax, ebx, ecx, edx, flag = 0;
#ifdef _MSC_VER
  int cpuid[4];
  __cpuid(cpuid, 1);
  eax = cpuid[0], ebx = cpuid[1], ecx = cpuid[2], edx = cpuid[3];
  asm volatile("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (1));
  if (edx>>25&1) flag |= SIMD_SSE;
  if (edx>>26&1) flag |= SIMD_SSE2;
  if (ecx>>0 &1) flag |= SIMD_SSE3;
  if (ecx>>19&1) flag |= SIMD_SSE4_1;
  if (ecx>>20&1) flag |= SIMD_SSE4_2;
  if (ecx>>28&1) flag |= SIMD_AVX;
  if (ebx>>5 &1) flag |= SIMD_AVX2;
  if (ebx>>16&1) flag |= SIMD_AVX512F;
  return flag;
int main() {
  printf("%x\n", x86_simd());
  return 0;

It is known to work with gcc-4.4, icc-15.0, clang-8.0 and msvc-14.0, fairly portable.

The second way is to use a GCC built-in: __builtin_cpu_supports(). This function tests if CPU the program is running on supports certain instruction sets. It is a new function only available to recent C compilers. I can confirm it is working with gcc-4.9 on Linux and clang-8.1.0 on Mac. Clang-8.0.0 has this built-in but is buggy: it compiles but can’t link. Intel C compiler (ICC) v15.0 has a similar problem. MS VC++ doesn’t support this function. The IBM compiler appears to has a similar built-in, though it only tests Power-related instruction sets. On x86, this second approach is simpler but less portable.

Icc has a similar built-in with an interesting name: _may_i_use_cpu_feature(). Icc alternatively allows to creates multiple versions of a function with a compiler extension __declspec(cpu_dispatch()). Gcc-4.8+ has a similar feature, though for C++ only. I don’t like these methods because they are not portable at all.

By the way, there were some interesting discussions on supporting CPU dispatch in the C++ standard. The thread covers serval strategies mentioned here. It went down, though.

What is KANN?

See the GitHub repo page. In short, KANN is a flexible 4-file deep learning library, supporting convolutional neural networks (CNNs), recurrent neural networks (RNNs) and non-standard topologies addressable with differentiable computation graphs.

Why a new library?

The initial motivation is that I wanted to understand how deep learning frameworks work, down to the very details. The best way is to implement one by myself. After I got the basic working, I realized the code may be of use to other C/C++ programmers who prefer an efficient and flexible library without carrying all the non-trivial dependencies of mainstream frameworks. So, here we go.

Comparison to other deep learning frameworks

Theano and Tensorflow, with a code base many times larger than KANN, are definitely more powerful than KANN. Importantly, they can take the advantage of GPUs and even distributed computing, while KANN not. On the other hand, KANN comes close in flexibility and can be faster in the multi-threading mode for CPU-only training. KANN also has no extra dependencies by default, which makes it easy to deploy.

Tiny-dnn is a popular lightweight framework in C++. Importing pre-trained Caffe models is its particular strength that KANN lacks. However, tiny-dnn does not support RNNs and has difficulties in constructing non-standard model (e.g. variational autoencoder). It is several times slower than KANN and mainstream frameworks. Tiny-dnn also requires a C++11 compiler, which is not available everywhere yet (e.g. on CentOS 6).


KANN does not support GPU right now. For MLPs and RNNs with no more than a couple of hundred hidden neurons, multi-threaded KANN is actually no slower than GPU-based implementations, because small matrix multiplications have not saturated the capacity of GPU yet. However, for CNNs and large RNNs, I have seen GPU-based implementations outperforming KANN by a factor of 5. The performance gap is probably larger with bigger networks.

KANN lacks some important operators, such as batch normalization (BN). A direct implementation of the original BN method is tricky as training needs an extra step different from normal training. It seems that Caffe et al are implementing a variant of BN with running average, but I am not so sure.

KANN does not support bidirectional RNNs and seq2seq models out of box. In principle, these models can be constructed with KANN by manually chaining RNN blocks, but I have not tried.


If you are looking for a tiny, standalone, performant, open source library in C/C++ that supports common components including MLP, CNN and RNN, and has the flexibility and extensibility close to mainstream deep learning frameworks, KANN might be your only viable choice as of now.

Vector and matrix arithmetic (e.g. vector dot and matrix multiplication) are the basic to linear algebra and are also widely used in other fields such as deep learning. It is easy to implement vector/matrix arithmetic, but when performance is needed, we often resort to a highly optimized BLAS implementation, such as ATLAS and OpenBLAS. Are these libraries much faster than our own implementations? Is it worth introducing a dependency to BLAS if you only need basic vector/matrix arithmetic? The following post may give you some hints.


In this github repository, I implemented matrix multiplication in seven different ways, including a naive implementation, several optimized implementations with cache miss reduction, SSE and loop blocking, and two implementations on top of OpenBLAS. The following table shows the timing of multiplying two 2000×2000 or 4000×4000 random matrices on my personal Mac laptop and a remote linux server (please see the source code repo for details):







7.53 sec

188.85 sec

77.45 sec


6.66 sec

55.48 sec

9.73 sec
sdot w/o hints


6.66 sec

55.04 sec

9.70 sec
sdot with hints


2.41 sec

29.47 sec

2.92 sec
SSE sdot


1.36 sec

21.79 sec

2.92 sec
SSE+tiling sdot


1.11 sec

10.84 sec

1.90 sec
OpenBLAS sdot


2.69 sec

28.87 sec

5.61 sec
OpenBLAS sgemm


0.63 sec

4.91 sec

0.86 sec

7.43 sec

165.74 sec


0.61 sec

4.76 sec

You can see that a naive implementation of matrix multiplication is quite slow. Simply transposing the second matrix may greatly improve the performance when the second matrix does not fit to the CPU cache (the linux server has a 35MB cache, which can hold a 2000×2000 float matrix in cache, but not a 4000×4000 matrix). Transpose also enables vectorization of the inner loop. This leads to significant performance boost (SSE sdot vs Transposed). Loop blocking further reduces cache misses and timing for large matrices. However, OpenBLAS’ matrix multiplication (sgemm) is still the king of performance, twice as fast as my best hand-written implementation and tens of times faster than a naive implementation. OpenBLAS is fast mostly due to its advanced techniques to minimize cache misses.

As side notes, “sdot with hints” partially unrolls the inner loop. It gives a hint to the compiler that the loop may be vectorized. Clang on Mac can fully vectorize this loop, achieving the same speed of explicit vectorization. Gcc-4.4 seems not as good. The Intel compiler vectorizes the loop even without this hint (see the full table in README). Interestingly, the OpenBLAS sdot implementation is slower than my explicit vectorization on both Linux and Mac. I haven’t figured out the reason. I speculate that it may be related to cache optimization.

As to C++ libraries, Eigen has similar performance to OpenBLAS. The native uBLAS implementation in Boost is quite primitive, nearly as slow as the most naive implementation. Boost should ditch uBLAS. Even in the old days, it was badly implemented.


  • For multiplying two large matrices, sophisticated BLAS libraries, such as OpenBLAS, are tens of times faster than the most naive implementation.
  • With transposing, SSE (x86 only) and loop blocking, we can achieve half of the speed of OpenBLAS’ sgemm while still maintaining relatively simple code. If you want to avoid a BLAS dependency, this is the way to go.
  • For BLAS level-1 routines (vector arithmetic), an implementation with SSE vectorization may match or sometimes exceeds the performance of OpenBLAS.
  • If you prefer a C++ interface and are serious about performance, don’t use uBLAS; use Eigen instead.