As a Perl programmer, I enjoy a lot using hash tables. I keep this habit in C/C++ programming. Then what C/C++ hash libraries are available? How are they compared to each other? In this post, I will give a brief review of hash libraries and present a small benchmark showing their practical performance.
Hash table libraries
In C++, the most widely used hash table implementation is hash_map/set in SGI STL, which is part of the GCC compiler. Note that hash_map/set is SGI’s extention to STL, but is not part of STL. TR1 (technical report 1) tries to standardize hash tables. It provides unordered_map/set with similar API to hash_map/set. Most of TR1 routines are available since gcc-4.0. Google sparse hash is another C++ hash table template library with similar API to hash_map/set. It provides two implementations, one is efficient in speed and the other is in memory.
In contrast, there are few good C libraries around. I have tried SunriseDD, uthash, glibc hash table, hashit, Christopher Clark’s hashtable, glib hash table and ghthash. SunriseDD sounds a great library that implements a lock-free hash table. However, I am not sure how to install it or use it, although the code itself is well documented. Uthash is a single header file. It is quite complex to use and incompatiable with C++. It also lacks basic APIs such as counting how many elements in the hash table. Glibc hash and hashit seem to only implement static hash tables. Glibc hash even does not have deletion operation. Only glib hash, CC’s hashtable and ghthash implement most of common operations. And they still have their weakness in comparison to C++ implementations (see below).
Design of the benchmark
The benchmark is comprised of two experiments. In the first experiment, a random integer array of 5 million elements is generated with about 1.25 million distinct keys. Each element is then tested whether it is present in the hash. If the element is in the hash, it will be removed; otherwise, it will be inserted. 625,792 distinct keys will be in the hash after this process. To test performance on string input, I convert integers to strings with sprintf().
The second experiment is designed by Craig Silverstein, the author of sparsehash. I am using his source codes. This experiment tests the performance of insertion from zero sized hash, insertion from preallocated hash, replacement, query, query of empty hash, and removal.
Results
The following table gives the results in the first experiment:
Library | Mac-intCPU (sec) | Mac-strCPU (sec) | Mac PeakMem (MB) | Linux-intCPU (sec) | Linux-strCPU (sec) | Linux PeakMem (MB) |
glib | 1.904 | 2.436 | 11.192 | 3.490 | 4.720 | 24.968 |
ghthash | 2.593 | 2.869 | 29.0/39.0 | 3.260 | 3.460 | 61.232 |
CC’s hashtable | 2.740 | 3.424 | 59.756 | 3.040 | 4.050 | 129.020 |
TR1 | 1.371 | 2.571 | 16.140 | 1.750 | 3.300 | 28.648 |
STL hash_set | 1.631 | 2.698 | 14.592 | 2.070 | 3.430 | 25.764 |
google-sparse | 2.957 | 6.098 | 4.800 | 2.560 | 6.930 | 5.42/8.54 |
google-dense | 0.700 | 2.833 | 24.616 | 0.550 | 2.820 | 24.7/49.3 |
khash (C++) | 1.089 | 2.372 | 6.772 | 1.100 | 2.900 | 6.88/13.1 |
khash (C) | 0.987 | 2.294 | 6.780 | 1.140 | 2.940 | 6.91/13.1 |
STL set (RB) | 5.898 | 12.978 | 19.868 | 7.840 | 18.620 | 29.388 |
kbtree (C) | 3.080 | 13.413 | 3.268 | 4.260 | 17.620 | 4.86/9.59 |
NP’s splaytree | 8.455 | 23.369 | 8.936 | 11.180 | 27.610 | 19.024 |
Notes:
- Please be aware that changing the size of input data may change the ranking of speed and memory. The speed of a library may vary up to 10% in two different runs.
- CPU time is measured in seconds. Memory denotes the peak memory, measured in MB.
- For string hash, only the pointer to a string is inserted. Memory in the table does not count the space used by strings.
- If two numbers are given for memory, the first is for integer keys and the second for string keys.
- For all C++ libraries and khash.h, one operation is needed to achieve “insert if absent; delete otherwise”. Glib and ghthash require two operations, which does not favour these two libraries.
- The speed may also be influenced by the efficiency of hash funtions. Khash and Glib use the same hash function. TR1/SGI-STL/google-hash use another hash function. Fortunately, to my experiment, the two string hash functions have quite similar performance and so the benchmark reflects the performance of the overall hash libraries instead of just hash functions.
- For glib and ghthash, what is inserted is the pointer to the integer instead of the integer itself.
- Ghthash supports dynamic hash table. However, the results do not seem correct when this is switched on. I am using fixed-size hash table. This favours ghthash.
- CC’s hashtable will force to free a key, which is not implemented in all the other libraries. This behaviour will add overhead on both speed and memory in my benchmark (but probably not in other applications). The memory is measured for integer keys.
- This simple benchmark does not test the strength and weakness of splay tree.
And here is the result of the second experiment:
Library | grow | pred/grow | replace | fetch | fetchnull | remove | Memory |
TR1 | 194.2 | 183.9 | 30.7 | 15.6 | 15.2 | 83.4 | 224.6 |
STL hash_map | 149.0 | 110.5 | 35.6 | 11.5 | 14.0 | 87.2 | 204.2 |
STL map | 289.9 | 289.9 | 141.3 | 134.3 | 7.0 | 288.6 | 236.8 |
google-sparse | 417.2 | 237.6 | 89.5 | 84.0 | 12.1 | 100.4 | 85.4 |
google-dense | 108.4 | 39.4 | 17.8 | 8.3 | 2.8 | 18.0 | 256.0 |
khash (C++) | 111.2 | 99.2 | 26.1 | 11.5 | 3.0 | 17.4 | 198.0 |
Notes:
- CPU time is measured in nanosecond for each operation. Memory is measured by TCmalloc. It is the memory difference before and after the allocation of the hash table, instead of the peak memory.
- In this experiment, integers are inserted in order and there are no collisions in the hash table.
- All these libraries provide similar API.
Discussions
- Speed and memory. The larger the hash table, the fewer collisions may occur and the faster the speed. For the same hash library, increasing memory always increases speed. When we compare two libraries, both speed and memory should be considered.
- C vs. C++. All C++ implementations have similar API. It is also very easy to use for any type of keys. Both C libraries, ghthash and glib, can only keep pointers to the keys, which complicates API and increases memory especially for 64-bit systems where a pointer takes 8 bytes. In general, C++ libraries is perferred over C ones. Surprisingly, on 32-bit Mac OS X, glib outperforms TR1 and STL for string input. This might indicate that the glib implementation itself is very efficient, but just the lack of functionality in C affects the performance.
- Generic programming in C. Except my khash.h, all the other C hash libraries use (void*) to achieve generic typing. Using void* is okey for strings, but will cause overhead for integers. This is why all C libraries, except khash.h, is slower than C++ libraries on integer keys, but close to on string keys.
- Open addressing vs. chaining hash. Khash and google hash implement open addressing hash while the remaining implement chaining hash. In open addressing hash, the size of each bucket equals the size of a key plus 0.25 byte. Google sparsehash further compresses unused bucket to 1 bit, achieving high memory efficiency. In chaining hash, the memory overhead of each bucket is at least 4 bytes on 32bit machines, or 8 bytes on 64bit machines. However, chaining hash is less affected when the hash table is nearly full. In practice, both open addressing and chaining hash occupy similar memory under similar speed. Khash takes less peak memory mainly due to its advanced technique in rehashing which reduces memory usage. So far as speed is concerned, chaining hash may have fewer comparison between keys. We can see this from the fact that the speed of chaining hash approaches that of open addressing hash on string keys but much slower on integer keys.
- Memory usage of search trees. B-tree is the winner here. Each element in the B-tree only needs one additional pointer. When there are enough elements, a B-tree is at least halfly full; on average it should be around 75% full. And so on 64-bit systems, for a B-tree with N elements, we need additional N*8/0.75=10N bytes memory. Splay tree will need N*8*2=16N extra space. RB tree is the worst.
- Other issues. a) Google hash becomes unbearably slow when I try to put a lot of strings in the hash table. All the other libraries do not have this problem. b) Google hash performs more comparisons than khash. This is obvious because google-dense is clearly faster on integer keys but comparable to khash on string keys.
Concluding remarks
- C++ hash library is much easier to use than C libraries. This is definitely where C++ is preferred over C.
- TR1 hash implementation is no faster than STL implementation. They may outperform one another under certain input or settings.
- SGI hash_map is faster and takes less memory than STL map. Unless ordering is important, hash_map is a better container than map.
- Google hash is a worthy choice when we understand why it is slow for many string keys.
- My khash library, which is a single-file C++ template header, achieves good balance between speed and memory. All my source codes are available at the Programs page.
Update
- C interface can be elegant, too, if we implement it cleverly. See this post.
- I realize that we just need one lookup to achieve “insert if absent; delete otherwise”. This further improves the speed for all C++ libraries.
- I have analyzed google dense hash table in this post which explains why it is faster than khash on integer keys but close to or slower than on string keys.
- This thread directed me to gcc hashtable, and cocom hashtable. They are more or less independent of other source codes, but it would still take time to separate the source codes. So, I have not benchmarked them. Just keep a record.
- Python dictionary is in fact a hash table. The dictnotes.txt in that directory gives some quite interesting discussion about how to implement hash efficiently.
- hashlib library. A bit hard to use and I cannot get it running correctly. Possibly I have not provided a proper second hash function for rehashing.
- Added results for STL set (based on red-black tree) and John-Mark Gurney’s B-tree implementation (JG’s btree). Both libraries are considerably slower than hash tables. Of course search trees provide more functionality than hash tables, and every nice thing comes with a price. I have also tried Jason Evans’s and Niels Provos’ red-black tree implementations. On integer keys, JE’s takes 6.110 seconds on Mac-Intel using 18.884 MB memory and NP’s taks 6.611 seconds using the same amount of memory. This performance is close to that of STL set. They appear to be slower mainly due to the additional malloc/free calls I have to made under their APIs. Unlike hash table which have a variety of ways to implement it, red-black tree usually has one way (well, can be more. See also Jason’s blog.). And so I only show the performance of STL set as a representitive.
- Replaced JG’s B-tree with a modified version. The new version is both faster and more light-weighted.
First of all, thank you for posting such interesting information about hash sets and sorting. I love your site. I have a couple of questions.
In the concluding remarks, you stated that Google hash is really slow due to numerous comparisons, yet, from the benchmarks, it seems that the google hash overall is faster than khash. If memory is not a consideration, is Google hash a better hashset for strings?
Also, it was not clear what the units are for Memory. Is that in megabytes?
Is there a way to lookup a key in your hashset whether a key exists and if it does not to add the key and value entry without incurring an additional lookup?
Finally, does your code work on windows?
Thanks!
One other question, I’m thinking about using your hashset, but I need to be able to save the data incrementally as it is being updated. I was thinking about using a memory mapped file for this. Any thoughts on this?
Thank you for your interest. Actually I meant to insert/delete 10 million string keys in the first experiment. However, it seems to take ages for google hash to finish these operations. I have not read google hash’s source codes and I do not know why this may ever happen. In the end I have to present results on 5 million keys. Except this issue, google-dense hash is a better choice when memory is not critical.
Memory is in megabytes.
The C++ interface is quite similar, though less elegant, to STL hash_set. You can simply insert the key value. There is only one lookup. If the key is already in the hash, you have:
h->insert(key).second == false
C version: You can call kh_put(int, h, key, &ret). If ret equals zero, the key is already in the hash.
I have not tested it on Windows as I do not have one. But I think it should work as I am not using anything that is specific to *nix.
In your second comment, could you explain a bit more about “save the data incrementally as it is being updated”?
Thanks,
-ac
I’m trying to do a few things efficiently.
The value for a key changes periodically, so if a key exists, I don’t just want to not insert it, but rather I’d like to update the value if it’s different.
Second, the hash table I’m creating is huge. So I need to be able to make it persistent. That way, if the computer reboots, I’ll be able to recover from the filesystem the last changes. I can’t afford to constantly write out the hash table. It’s got to be saved out as it gets changed.
Any ideas? Thanks!
With all libraries in the benchmark except glib, you can achieve query+insertion with one lookup. For example, all C++ libraries and khash.h will return an iterator upon insertion. You can update the value pointed by the iterator and so only one lookup is required.
No hash library can directly dump hash table on the disk. But it would not be too hard (still hard) to implement this for khash and google-hash as they are using open addressing to solve collisions. I can implement this functionality if you really need it.
In addition to hash table, you may also consider using libsqlite by creating a SQL database for your data. Although working with an on-disk database is less efficient than using in-memory hash table, the memory footprint is much better than using hash. Sqlite is very robust and convenient. You may also use b-tree library, but my experience is it is slower than libsqlite, at least on the data I have tried.
If you want to dump hash-table on the disk, please also let me know the types of key and value.
-ac
The keys are strings and the values are POD’s (Plain old data). So basically, with a pod, you can do a straight dump to disk from memory with memcpy.
I considered using sqlite, but it’s too slow for my needs.
Ideally, the khash contructor would enable me to specify an optional disk file to read and write contents as I update the hash table. The performance for reading and writing should be identical minus a tiny amount of overhead for calling memcpy.
Actually in your case, I would prefer to manually manage those strings, instead of relying on a new hash table implementation. Here is a possible strategy.
You can generate two files, one for all PODs and the other for the keys and the offsets in the large POD file. You need to build two hash tables. The first has strings as keys and offset as values and the second has offset as keys and strings as values. When your program is invoked, you should load the second file into the first hash table. Upon query, you can look up the key in the first hash table to get offset and then query the second hash table. If the offset exists, you then get the POD; if not, you insert the offset and load the corresponding POD into the second hash table. To inserting a new POD, you can increamentally append the new one to the first file. When your program finishes, you need to dump all the keys and offset in the first hash table to the second file. Loading/saving a few million of short string keys can be done in a few seconds.
The second hash table works as a cache. If your program does not traverse all the PODs, this cache will save a lot of memory by leaving unused PODs on the disk. You may worry that you need to query two hash tables, but I think you can preallocate the second hash table and the overhead of an additional query is trivial.
What do you think?
Hi attractivechaos,
Your hash table topic is very interesting. I am also in the mids of thinking implementing the fast external storage for dumping the hash table data. Your idea is quite good, but I am still finding a better solution. Btw, have you tried to implement it so far? Thanks.
I am a bit busy now. I have not planned to implement this idea in the very near future. However, I think it would not be hard to implement this, using existing hash table implementations. In addition, probably my current idea is not optimal. I just came up to this idea in a few minutes. I think you may also have a look at how MySQL implements on-disk hash tables. By default, MySQL uses B-tree for indexing, but it also supports hash tables.
Btw, to Paul: if your program does not use POD data frequently, you may consider compressing each POD to save memory and disk space. Decompressing a POD should be very fast. LZO library claims that it only ~4X slower than memcpy(). I have not tried though, as I do not have similar applications.
We have similar thinking. I had a similar idea to use a file for storing the data and an offset is built into the pod so that we don’t need the second hash table or second lookup. Unfortunately, I can’t save the hash table at the end because my program may terminate unexpectedly and I need to be able to recover without losing data. That is why I need to be able to save the data incrementally as I add to the hash table. Regarding the LZO, It may make sense to use this, depending on how much savings I get in memory space. I think having a persistent hash table where the data can be saved as it is being inserted or removed is fundamentally a useful container.
By the way, it would be nice to be able to set a flag in the constructor for khash to optionally use LZO compression for values.
With my strategy, you can immediately save the key-value when you find the key does not exist in the hash table. It suits your aim. Just remember to call fflush() or the C++ equivalent on every write operation. I would prefer using two hash tables as this looks to me like a good caching. In addition, as I said before, unless you want to traverse the hash table, the second one is very sparse and the query on this table is very cheap.
By “unexpectedly”, do you mean your program will get killed or the machine will be restarted, or something else? If it is get killed, you would consider to take actions on receiving the signal. You know, killing might happen when your progam is wrting data; then you will end up with a truncated file.
As a philosophy in designing template/macro library, no .cpp/.c files should be part of the library. Irrelevant functionality should also be excluded. I would not add LZO compression. In addition, from its documentation, LZO sounds quite easy to use.
In a comparison of C++ containers I saw about a year ago which compared some STL implementation, some Borland containers, Qt and something I forgot Qt was the fastest more or less across the board. This was on Windows, FWIW.
So, I’d like to see Qt in the comparison. It has containers tuned for actual existing machines unlike the STL and I’d really like to see how well it does. Qt is a fine cross-platform C++ toolkit – you can get it under the GPL license from trolltech.com.
Boost’s intrusive containers will probably bring a speed advantage, too.
You should really make sure that the hash function used are the same, which is not hard at all. I’ve seen switching to a different hash function can yield at least 20% improvement for STL hash_set/map for string hashes (dictionary words and phrases etc.)
What hash function were you using? I thought STL’s and glib’s hash string funtions, which are quite similar to each other, are among the most efficient ones. I have tried to provide the glib string hash function to C++ implementations. It makes almost no difference. It is a little hard to change the hash function for some C implementations, but I did try to use the same.
On integer keys, as my in put is random, the best hash function is to use integer itself. STL hash, glib and khash do this by default. Some C libraries, such as ghthash, takes an integer as a fixed-length data and will hash it anyway. I did not change this behaviour.
Very nice comparison. Where can I find the benchmark code you have used to test the hash libs?
I would recommend to check this updated page:
http://attractivechaos.awardspace.com/udb.html
Source code for can be downloaded here:
http://attractivechaos.awardspace.com/download/udb-20080928.tar.bz2
In this new benchmark, I have removed glib and ghthash as they need installation while all the other libraries do not. If you are still interested in these two libraries, please check here:
http://www.freewebs.com/attractivechaos/hash_test.cc.html
for the source code. This file is a bit disorganized and not easy to compile.
[…] Comparison of Hash Table Libraries – comparison of 12 different hash table implementations (performance/memory consumption). […]
@Andreas
I have tested QT4 and boost. QT4’s QHash is marginally faster than SGI’s hash_map but costs more memory. Boost’s unordered_map is slightly slower than hash_map with almost the same memory footprint.
Hi,
Thanks a lot for your compasirions, very good resource.
I was wondering if you have any idea / info how the intel libraries compare to the ones you have covered.
http://www.threadingbuildingblocks.org/files/documentation/a00086.html
Thanks…
[…] [upmod] [downmod] Comparison of Hash Table Libraries « Attractive Chaos (attractivechaos.wordpress.com) 1 points posted 7 months, 3 weeks ago by jeethu tags […]
Congratulations on an excellent article.
It brings together a lot of information in a clear and readable fashion. It was a pleasure to read after all those hundred of pages preceding yours in the last 3 years or so.
Hi, thanks for doing the benchmarking work.
I’m wondering — do you know of any C implementations that will work on Windows, using Visual Studio? I tried using your code with this environment but I get a lot of compile errors related to Unix-specific functions.
Thank you for your time… Very good article.
Thanks for this wonderful write-up. Just want to point out that Google dense and Google sparse now have read/write from disk support for basic datatypes.
add a comparison with boost?
Please check out this page: Another Look at my old Benchmark.
You can find an intersting article about STL Hash Container Benchmark here: http://www.intelliproject.net/articles/showArticle/index/unorderedbenchmark
Be aware that khash is not safe for c++ classes which contain complex objects (e.g. std::string). The reason is because the library doesn’t call default constructor when reallocates memory, so objects are in uninitialized state.
Do you have script to reproduce your main results?
To my observation, google’s dense_hash_map seems faster than khash in terms of using const char* as key and integer as value.
google::dense_hash_map class.
size = 999679
Elapsed 0.85(s)
google::dense_hash_map class — CityHash64.
size = 999679
Elapsed 0.90(s)
khash
size = 999644
Elapsed 1.38(s)
However, I have not find a convenient way to check memory consumption.
To my understanding, it seems google dense_hash_map (version 1.1) seems a better solution (in terms of speed)?
Google dense is already faster than khash on a 64-bit linux machine. Google dense has also been updated since this benchmark. It is quite possible that it becomes faster. Still, the advantage of khash is its small memory footprint.
[…] [11] http://preshing.com/20110603/hash-table-performance-tests [12] https://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/ [13] https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ [14] […]
uthash has the HASH_COUNT() to return the no. of items in a hash table
I believe it was added later. This post is very old.
Hi, as I saw from this post you have great experience with hash tables. I’m looking for some simple implementation in C becuase I need to store maximum 100 structures in hash table (the reason I use hash tables is because I will have real time app and it needs to be fast). The structure will have key and value (int and char) and another int filed. I need to store strucutres in hash table, retreive them by key (id) and to delete all the structures from table. I’m reading about uthash header and I was thinking if someone have better advice ? Thx
Great article! Hi, just wanted to mention sparsepp, the updated version of Google’s sparse_hash_map/set, which is significantly faster. see https://github.com/greg7mdp/sparsepp.
[…] Chaos have a comparison of Hash Table Libraries and a update. The source code is available and it is in C and […]