Zilong Tan has kindly shared his observation that using a power-of-2 bucket count may dramatically improve the speed for integer keys. He implemented this idea in his ulib library. He is right. Here is a tiny benchmark done on my laptop (Intel Core 2 Duo; Mac OS X 10.6; gcc 4.2.1; program khash_test.c).
Version | Int-set-5M (sec) | Str-set-5M | Int-set-4M | Int-set-6M |
0.2.5 | 0.898 | 2.174 | 0.593 | 1.327 |
0.2.6 | 0.476 | 1.814 | 0.285 | 0.664 |
0.2.6 linear | 0.333 | 1.744 | 0.250 | 0.455 |
where “0.2.6” is the new version and “0.2.6 linear” indicates that linear probing is in use (by default, khash uses double hashing for stability). We can see that 0.2.6 is nearly twice as fast as 0.2.5 for integer hash set. Linear probing is even faster than the default double hashing due to the better cache performance and a little fewer CPU instructions, but double hashing tends to be more robust to certain non-random input.
The new version is available here.
Update: Note that due to rehashing the performance of a hash table as a function of input size is not a smooth curve. Rehashing may take time. Thus I added two more data points to show that the speedup is not due to this caveat.
I’ve written extensively about ways khash could be improved :
http://cbloomrants.blogspot.com/2008/10/10-21-08-2.html
http://cbloomrants.blogspot.com/2008/10/10-21-08-3.html
http://cbloomrants.blogspot.com/2010/11/11-29-10-useless-hash-test.html
The most glaring one is that the separate arrays for flags,keys & data are very bad for cache.
I think I will reconsider about “mod prime” and using special keys for “DELETED/EMPTY” instead of using a flag array. However, I should comment that the separation of flags, keys and values is intended and was discussed in my previous blog posts (post on google dense hash table). The key reason is in addition to time efficiency khash also strives for memory efficiency. If we put keys and values in one C struct, we may waste memory due to memory alignment in a struct (khash was first used for keeping 8-byte key and 1-byte value; 7 bytes wasted per bucket when we put a key and a value in the same struct). In addition, one can always store “struct {int key,val;}” in the hash table and hash “key” only to avoid a cache miss.
A separate flags array reduces memory when rehashing due to the kick-out process implemented in khash. With the traditional method, we keep the old buckets and insert them to the new one. Khash only keeps the old flags array instead of the old buckets.
I was also aware of the advantage of caching the hashed values (post on rde::hash_map), especially for string keys. I did not implement this again for a smaller memory footprint. And frequently I do not need a very fast string hash table in my own applications.
So are you updating BWA with your new khash?
I can, but it does not really matter. Negligible CPU time goes to hash table operations. Samtools probably does not need a fast hash table, either.
OK, maybe here there is something i don’t really get.
What’s the point of this khash algorithm ? I mean, hashing an input into a figure is hardly something new nor difficult. So what makes khash special and worthwhile ? Is that size efficiency ?
Khash is one of the most efficient generic hash table libraries in C/C++, in terms of both time and space efficiency. It is also among the shortest: everything in one header file in ~250 coding lines (core algorithm in 150 lines) with no dependency to other libraries except libc. Yes, it is not hard to write a 250-line program. You can write your own.
Well, actually, i do.
I’ve been looking into your code, and although single-file long, i do not find it easy to read. There are many clever pre-processing tricks, which make it difficult to catch at first glimpse.
I’m writing hash-related stuff fairly usually, so that’s why i don’t see what’s difficult in it. But reading the comments, i think there is something i don’t get. Maybe it is the word “generic”, which may hide some kind of complexity i’m not aware of.
I just compared khash on my side, and it is definitely one of the fastest hash library you can get. Implementing a hash lib is not hard but it is hard to get a good one.
The preprocessing trick of khash.h is not trivial.
Great code. I have had success using khash for a small number of relatively large hashes here and there. At present I am trying to make a large array of hashes or, similarly, embedding a khash in a struct, and making a large array of structs. If someone could please post a simple example of how this could be (elegantly) achieved using khash, it would be greatly appreciated.
Many thanks for the prompt response… testing the approach out now. Question: For hashing a known dataset, how does khash compare to gperf, cmph in terms of speed and efficiency?
Gperf and cmph are for perfect hashing, while khash is for generic dynamic hash table. These are not the same thing.
how does khash handle collision? separating chaining or open address?
Open addressing.
First, thank you for taking the time to write this awesome hash and giving it freely.
I am noticing a weird bahvior when using kh_resize(32, h, theSize) before any keys are entered.
Depending on the value of theSize (which is an int), every key put into h will apear as present [using z=kh_get(32, h, k); kh_exist(h, z)] even if nothing has been inserted before.
Sometimes it behaves as expected as kh_exist() returns 0 and other times incorrectly returning 1, again depending on the value of theSize.
I have a reproducible example if that will help things make more sense.
Thank you for any help you can provide. I look forward to using your tool properly.
Do you have an example? Thanks.
Sure. Do you want me to post it here or send through email?
Better a gist at github, or you may post it here. Thank you. The problem you were describing could happen.
Any luck with my example? I’m perplexed by it, then again, my skills at this aren’t as high as yours.
I’ve never done a gist, but I put the code on github: https://github.com/jaredlander/join/blob/master/src/mainhash.cpp
Thanks for checking it out.
I wrote a quick benchmark and tested it against AC_VERSION_KHASH_H “0.2.4” and AC_VERSION_KHASH_H “0.2.8” with the following results:
$ gcc -O3 khash-example.c && time ./a.out 32000000
— generating data… done!
[ht_khash_str_put] size: 32000000
[ht_timing] 11.340 sec or 2821869.5 per second
[ht_khash_str_get] size: 32000000
[ht_timing] 4.970 sec or 6438631.8 per second
$ gcc -O3 khash-example.c && time ./a.out 32000000
— generating data… done!
[ht_khash_str_put] size: 32000000
[ht_timing] 41.160 sec or 777453.8 per second
[ht_khash_str_get] size: 32000000
[ht_timing] 3.490 sec or 9169054.4 per second
Conclusion: Insert time has become horrible 😦 Is this to be expected for some reason?
Just for fun I removed inlining from the last test and repeated it:
$ gcc -O3 khash-example.c && time ./a.out 32000000
— generating data… done!
[ht_khash_str_put] size: 32000000
[ht_timing] 41.410 sec or 772760.2 per second
[ht_khash_str_get] size: 32000000
[ht_timing] 18.090 sec or 1768933.1 per second
Conclusion: On modern CPUs inlining makes almost no difference?
What keys are you inserting? What is the hash function? Are they random enough? I guess what you are getting is caused by collision. The latest version at github uses quadratic probing, which has better cache efficiency but may have worse performance given lots of collisions. If your keys and hash function are good enough and you still get worse performance, I will probably revert back to double hashing. Thanks.
Unique 16 char hex strings. I haven’t changed the hash function which is built into khash. Should I? Have you looked at MurmurHash3 for hashing strings? I also noticed that the newer version of khash caused the process to grow to 1.9g instead of 1.8g. Not a big deal I guess.
Do you have the source code somewhere? I will look into this. If quadratic probing really hurts the performance, I will switch back to double hashing. Thanks! As to the memory, the older version doubles the table size when it is 77% full. The new version doubles when 75% full.
Source code is here: https://gist.github.com/anonymous/5691733