I evaluated multiple hash table libraries nearly 10 years ago. A lot have been changed since then: hash table is officially part of C++, my khash library is about twice as fast, and more advanced algorithms/heuristics are being applied to hash table implementations. Where are we now? Is unordered_map in C++11 the preferred choice? What hash table library should we use? This post aims to find partial answers to these questions.

In this micro-benchmark (source code here), we have N 32-bit integers with ~25% of them are distinct. The task is to find the occurrence of each distinct integer with a hash table. It is inspired by real applications in my routine work. I will show the result first and then discuss all the caveats.


In the figure above, each connected line represents a library. Each line harbors 6 dots, corresponding to N=10,18,26,34,42,50 million, respectively. I used multiple numbers to show the effect of rehashing. The X-axis measures CPU time and Y-axis measures peak memory, including temporary swapping space used for rehashing.

10 years ago, Google’s dense_hash_map was significantly faster than all the peers. It is still the fastest in this benchmark, but others are close in speed. When considering speed-memory balance, the more sophisticated probing algorithms such as Hopscotch hashing (used by hopscotch-map) and Robin Hood hashing (by flat_hash_map) are not better. I guess this is partially because they need to store extra data in each bucket, which cancels some of their advantages under high load. In addition, these advanced hashing methods are better at query, but my benchmark also involves insertions often.

It bugs me that the official unordered_map implementation in GCC-6.3 is that inefficient. In fact, it is slower and uses more memory than SGI’s ancient ext/hash_map and tr1/unordered_map – both of them are still available in GCC. All these libraries use chaining to resolve collisions, which is apparently required by the C++11 spec. It is unfortunate that the C++ standard committee ruled out open addressing. Nearly all the hash table benchmarks indicate open addressing is significantly faster on small keys. As to C libraries, uthash is the most popular, but its performance lags far behind others. When you need a large hash table, flat_hash_map and hopscotch-map are the better choices if you prefer C++11 APIs; Google dense_hash and khash remain top options after 10 years.

Additional notes:

  • Benchmark programs were run on a fresh “standard-1” machine from Google Cloud.
  • I haven’t tuned the maximum load factor and the growth factor. They may affect the balance between speed and space.
  • Libraries in the benchmark use different memory allocators. For example, khash uses glibc’s malloc that supports realloc, unordered_map naturally uses std::allocator and Google dense_map/sparsepp are using their own allocators. I suspect that memory allocators play a role in performance. More testing needed.
  • TommyDS shows a benchmark where it performs the best. It doesn’t. The developer only uses the hash table to store pointers and puts the actual data in a separate array. Real applications rarely work that way. When we put data into the hash table, it becomes much slower and larger due to unnecessary malloc calls.
  • Glib hash table uses a similar algorithm to TommyDS. It can optionally treat integers as pointers and avoids unnecessary malloc. However, this hack doesn’t work in general and even with it, Glib is twice as slow as top performers.

On CPU dispatch

Modern x86 CPUs implement advanced instruction sets, such as SSE and AVX, which may greatly help performance. However, when distributing precompiled binaries (think about Debian, CentOS, AnaConda, etc), we often prefer to fall back on older instruction sets for the sake of portability. Is there a way to dynamically choose CPU instruction sets at runtime such that we can achieve performance and portability at the same time? Yes, the answer is CPU dispatch. For a program that supports CPU dispatch, we typically compile it on a recent CPU to generate a fat(ish) binary that contains multiple implementations of a function or a code block with different instruction sets. When we run, the program dynamically chooses internal implementations based on the CPU features. I first heard of “CPU dispatch” from an Intel developer a few years ago. Unfortunately, googling “CPU dispatch” does not give me much relevant information immediately even today. This post aims to briefly explain the strategies to implement CPU dispatch in C/C++.

On x86, my preferred way to implement CPU dispatch is to detect the supported SIMD instruction sets via CPUID, which can be retrieved with x86 assembly, or with the __cpuid intrinsics specific to MS VC++. The following shows an example.

#include <stdio.h>

#define SIMD_SSE     0x1
#define SIMD_SSE2    0x2
#define SIMD_SSE3    0x4
#define SIMD_SSE4_1  0x8
#define SIMD_SSE4_2  0x10
#define SIMD_AVX     0x20
#define SIMD_AVX2    0x40
#define SIMD_AVX512F 0x80

unsigned x86_simd(void) {
  unsigned eax, ebx, ecx, edx, flag = 0;
#ifdef _MSC_VER
  int cpuid[4];
  __cpuid(cpuid, 1);
  eax = cpuid[0], ebx = cpuid[1], ecx = cpuid[2], edx = cpuid[3];
  asm volatile("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (1));
  if (edx>>25&1) flag |= SIMD_SSE;
  if (edx>>26&1) flag |= SIMD_SSE2;
  if (ecx>>0 &1) flag |= SIMD_SSE3;
  if (ecx>>19&1) flag |= SIMD_SSE4_1;
  if (ecx>>20&1) flag |= SIMD_SSE4_2;
  if (ecx>>28&1) flag |= SIMD_AVX;
  if (ebx>>5 &1) flag |= SIMD_AVX2;
  if (ebx>>16&1) flag |= SIMD_AVX512F;
  return flag;
int main() {
  printf("%x\n", x86_simd());
  return 0;

It is known to work with gcc-4.4, icc-15.0, clang-8.0 and msvc-14.0, fairly portable.

The second way is to use a GCC built-in: __builtin_cpu_supports(). This function tests if CPU the program is running on supports certain instruction sets. It is a new function only available to recent C compilers. I can confirm it is working with gcc-4.9 on Linux and clang-8.1.0 on Mac. Clang-8.0.0 has this built-in but is buggy: it compiles but can’t link. Intel C compiler (ICC) v15.0 has a similar problem. MS VC++ doesn’t support this function. The IBM compiler appears to has a similar built-in, though it only tests Power-related instruction sets. On x86, this second approach is simpler but less portable.

Icc has a similar built-in with an interesting name: _may_i_use_cpu_feature(). Icc alternatively allows to creates multiple versions of a function with a compiler extension __declspec(cpu_dispatch()). Gcc-4.8+ has a similar feature, though for C++ only. I don’t like these methods because they are not portable at all.

By the way, there were some interesting discussions on supporting CPU dispatch in the C++ standard. The thread covers serval strategies mentioned here. It went down, though.

What is KANN?

See the GitHub repo page. In short, KANN is a flexible 4-file deep learning library, supporting convolutional neural networks (CNNs), recurrent neural networks (RNNs) and non-standard topologies addressable with differentiable computation graphs.

Why a new library?

The initial motivation is that I wanted to understand how deep learning frameworks work, down to the very details. The best way is to implement one by myself. After I got the basic working, I realized the code may be of use to other C/C++ programmers who prefer an efficient and flexible library without carrying all the non-trivial dependencies of mainstream frameworks. So, here we go.

Comparison to other deep learning frameworks

Theano and Tensorflow, with a code base many times larger than KANN, are definitely more powerful than KANN. Importantly, they can take the advantage of GPUs and even distributed computing, while KANN not. On the other hand, KANN comes close in flexibility and can be faster in the multi-threading mode for CPU-only training. KANN also has no extra dependencies by default, which makes it easy to deploy.

Tiny-dnn is a popular lightweight framework in C++. Importing pre-trained Caffe models is its particular strength that KANN lacks. However, tiny-dnn does not support RNNs and has difficulties in constructing non-standard model (e.g. variational autoencoder). It is several times slower than KANN and mainstream frameworks. Tiny-dnn also requires a C++11 compiler, which is not available everywhere yet (e.g. on CentOS 6).


KANN does not support GPU right now. For MLPs and RNNs with no more than a couple of hundred hidden neurons, multi-threaded KANN is actually no slower than GPU-based implementations, because small matrix multiplications have not saturated the capacity of GPU yet. However, for CNNs and large RNNs, I have seen GPU-based implementations outperforming KANN by a factor of 5. The performance gap is probably larger with bigger networks.

KANN lacks some important operators, such as batch normalization (BN). A direct implementation of the original BN method is tricky as training needs an extra step different from normal training. It seems that Caffe et al are implementing a variant of BN with running average, but I am not so sure.

KANN does not support bidirectional RNNs and seq2seq models out of box. In principle, these models can be constructed with KANN by manually chaining RNN blocks, but I have not tried.


If you are looking for a tiny, standalone, performant, open source library in C/C++ that supports common components including MLP, CNN and RNN, and has the flexibility and extensibility close to mainstream deep learning frameworks, KANN might be your only viable choice as of now.

Vector and matrix arithmetic (e.g. vector dot and matrix multiplication) are the basic to linear algebra and are also widely used in other fields such as deep learning. It is easy to implement vector/matrix arithmetic, but when performance is needed, we often resort to a highly optimized BLAS implementation, such as ATLAS and OpenBLAS. Are these libraries much faster than our own implementations? Is it worth introducing a dependency to BLAS if you only need basic vector/matrix arithmetic? The following post may give you some hints.


In this github repository, I implemented matrix multiplication in seven different ways, including a naive implementation, several optimized implementations with cache miss reduction, SSE and loop blocking, and two implementations on top of OpenBLAS. The following table shows the timing of multiplying two 2000×2000 or 4000×4000 random matrices on my personal Mac laptop and a remote linux server (please see the source code repo for details):







7.53 sec

188.85 sec

77.45 sec


6.66 sec

55.48 sec

9.73 sec
sdot w/o hints


6.66 sec

55.04 sec

9.70 sec
sdot with hints


2.41 sec

29.47 sec

2.92 sec
SSE sdot


1.36 sec

21.79 sec

2.92 sec
SSE+tiling sdot


1.11 sec

10.84 sec

1.90 sec
OpenBLAS sdot


2.69 sec

28.87 sec

5.61 sec
OpenBLAS sgemm


0.63 sec

4.91 sec

0.86 sec

7.43 sec

165.74 sec


0.61 sec

4.76 sec

You can see that a naive implementation of matrix multiplication is quite slow. Simply transposing the second matrix may greatly improve the performance when the second matrix does not fit to the CPU cache (the linux server has a 35MB cache, which can hold a 2000×2000 float matrix in cache, but not a 4000×4000 matrix). Transpose also enables vectorization of the inner loop. This leads to significant performance boost (SSE sdot vs Transposed). Loop blocking further reduces cache misses and timing for large matrices. However, OpenBLAS’ matrix multiplication (sgemm) is still the king of performance, twice as fast as my best hand-written implementation and tens of times faster than a naive implementation. OpenBLAS is fast mostly due to its advanced techniques to minimize cache misses.

As side notes, “sdot with hints” partially unrolls the inner loop. It gives a hint to the compiler that the loop may be vectorized. Clang on Mac can fully vectorize this loop, achieving the same speed of explicit vectorization. Gcc-4.4 seems not as good. The Intel compiler vectorizes the loop even without this hint (see the full table in README). Interestingly, the OpenBLAS sdot implementation is slower than my explicit vectorization on both Linux and Mac. I haven’t figured out the reason. I speculate that it may be related to cache optimization.

As to C++ libraries, Eigen has similar performance to OpenBLAS. The native uBLAS implementation in Boost is quite primitive, nearly as slow as the most naive implementation. Boost should ditch uBLAS. Even in the old days, it was badly implemented.


  • For multiplying two large matrices, sophisticated BLAS libraries, such as OpenBLAS, are tens of times faster than the most naive implementation.
  • With transposing, SSE (x86 only) and loop blocking, we can achieve half of the speed of OpenBLAS’ sgemm while still maintaining relatively simple code. If you want to avoid a BLAS dependency, this is the way to go.
  • For BLAS level-1 routines (vector arithmetic), an implementation with SSE vectorization may match or sometimes exceeds the performance of OpenBLAS.
  • If you prefer a C++ interface and are serious about performance, don’t use uBLAS; use Eigen instead.

The best solution is pdftops from Poppler, a somewhat successor of xpdf (see also this article). It preserves the fonts in PDF and produces a small and proper vector graph. To compile poppler on OSX 10.9, I need to edit “configure” and remove compiling option “-fno-check-new” as clang does not support this option.

Following the answer from this page, I have also tried a few other options. InkScape generates a small vector EPS, but it loses some features. Convert from ImageMagick outputs a bitmap EPS, which defeats the goal of vector graphs.

Interestingly, directly using the “gs” command from GhostScript seems to generate a vector EPS, but using the pdf2ps script produces an EPS with bitmap fonts. It turns out that the difference is caused by “-dNOCACHE”, which is surprising. Anyway, even though “gs” works, it generates a much larger EPS in comparison to pdftops. The winner is still pdftops from xpdf/poppler, at least in my case.

Gv apparently calls pkg-config during configuration. When pkg-config or the pkg-config file for Xaw3D is not found, it will fall back to another configuration which does not work on Mac.

As Mac does not come with pkg-config by default, you need to first install it. You also need to specify where to find the pkg-config file for Xaw3D:

export PKG_CONFIG_PATH=/usr/X11/lib/pkgconfig/
./configure --x-includes=/usr/X11/include/ --x-libraries=/usr/X11/lib/ --enable-SIGCHLD-fallback

Several years ago I implemented knetfile for accessing remote files on ftp and http as if they are local (see also this blog post). I have been using the implementation for a while and the end users like the feature. However, with the increasing use of https among file sharing and cloud computing providers, supporting secured connection becomes more important. Several users have requested this feature. As a response, I implemented a new library kurl on top of libcurl.

Kurl is inspired by and learns from fopen.c, an example from the curl source code package. It supports random access and uses fixed-length buffer. It also fixes an issue where we may be waiting too long for select(). The APIs largely resemble knetfile, zlib and stdio. The following is a small example:

#include <stdio.h>
#include "kurl.h"
int main() {
  kurl_t *fp;
  unsigned char buf[256];
  fp = kurl_open("https://github.com", 0);
  kurl_seek(fp, 100, SEEK_SET);
  kurl_read(fp, buf, 256);
  return 0;

In addition, kurl.c also comes with a simple main() function to achieve the basic curl functionality, which can be compiled with:

gcc -g -Wall -O2 -lcurl -DKURL_MAIN kurl.c -o kurl

Here are a little more details about kurl:

  • Two-file library. No installation.
  • The only dependency is libcurl, though libcurl may further depend on other libraries: e.g. openssl for https; libssh2 for sftp.
  • Directly accesses files in S3 with
    kurl_open("s3://bucket/object", 0)

    AWS credentials are either provided to kurl_open(), or by default read from ~/.awssecret (AccessKeyId and SecretKey on two lines; see Tim Kay’s aws tool for details).

  • Compilable with C++ compilers.
  • Buffered reading with a fixed buffer length. No potential buffer bloat.