Synopsis
Here is an simple example showing how to use khash.h library:
#include "khash.h" KHASH_MAP_INIT_INT(32, char) int main() { int ret, is_missing; khiter_t k; khash_t(32) *h = kh_init(32); k = kh_put(32, h, 5, &ret); if (!ret) kh_del(32, h, k); k = kh_get(32, h, 10); is_missing = (k == kh_end(h)); k = kh_get(32, h, 5); kh_del(32, h, k); for (k = kh_begin(h); k != kh_end(h); ++k) if (kh_exist(h, k)) kh_value(h, k) = 1; kh_destroy(32, h); return 0; }
The second line says we want to use a hash map with int as key and char as value. khash_t(int) is a type. kh_get() and kh_put() returns an iterator, or the position in the hash table. kh_del() erases the key-value in the bucket pointed by the iterator. kh_begin() and kh_end() return the begin and the end of iterator, respectively. And kh_exist() tests whether the bucket at the iterator is filled with a key-value. The APIs are not so concise in comparison to C++ template, but are very straightforward and flexible. How can this be done?
Achieving generic programming in C
The core part of khash.h is:
#define KH_INIT(name, key_t, val_t, is_map, _hashf, _hasheq) \ typedef struct { \ int n_buckets, size, n_occupied, upper_bound; \ unsigned *flags; \ key_t *keys; \ val_t *vals; \ } kh_##name##_t; \ static inline kh_##name##_t *init_##name() { \ return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \ } \ static inline int get_##name(kh_##name##_t *h, key_t k) \ ... \ static inline void destroy_##name(kh_##name##_t *h) { \ if (h) { \ free(h->keys); free(h->flags); free(h->vals); free(h); \ } \ } #define _int_hf(key) (unsigned)(key) #define _int_heq(a, b) (a == b) #define khash_t(name) kh_##name##_t #define kh_init(name) init_##name() #define kh_get(name, h, k) get_##name(h, k) #define kh_destroy(name, h) destroy_##name(h) ... #define KHASH_MAP_INIT_INT(name, val_t) \ KH_INIT(name, unsigned, val_t, is_map, _int_hf, _int_heq)
In macro ‘KH_INIT’, name is a unique symbol that distinguishes hash tables of different types, key_t the type of key, val_t the type of value, is_map is 0 or 1 indicating whether to allocate memory for vals, _hashf is a hash function/macro and _hasheq the comparison function/macro. Macro ‘KHASH_MAP_INIT_INT’ is a convenient interface to hash with integer keys.
When ‘KHASH_MAP_INIT_INT(32, char)’ is used in a C source code file the following codes will be inserted:
typedef struct { int n_buckets, size, n_occupied, upper_bound; unsigned *flags; unsigned *keys; char *vals; } kh_int_t; static inline kh_int_t *init_int() { return (kh_int_t*)calloc(1, sizeof(kh_int_t)); } static inline int get_int(kh_int_t *h, unsigned k) ... static inline void destroy_int(kh_int_t *h) { if (h) { free(h->keys); free(h->flags); free(h->vals); free(h); } }
And when we call: ‘kh_get(int, h, 5)’, we are calling ‘get_int(h, 5)’ which is defined by calling KH_INIT(int) macro. In this way, we can effectively achieve generic programming with simple interfaces. As we use inline and macros throughout, the efficiency is not affected at all. In my hash table benchmark, it is as fast and light-weighted as the C++ implementation.
Other technical concerns
- Solving collisions. I have discussed this in my previous post. I more like to achieve smaller memory and therefore I choose open addressing.
- Grouping key-value pairs or not. In the current implementation, keys and values are kept in separated arrays. This strategy will cause additional cache misses when keys and values are retrieved twice. Grouping key-value in a struct is more cache efficient. However, the good side of separating keys and values is this avoids waste of memory when key type and value type cannot be aligned well (e.g. key is an integer while value is a character). I would rather trade speed a bit for smaller memory. In addition, it is not hard to use a struct has a key in the current framework.
- Space efficient rehashing. Traditional rehashing requires to allocate one addition hash and move elements in the old hash to the new one. For most hash implementations, this means we need 50% extra working space to enlarge a hash. This is not necessary. In khash.h, only a new flags array is allocated on rehashing. Array keys and values are enlarged with realloc which does not claim more memory than the new hash. Keys and values are move from old positions to new positions in the same memory space. This strategy also helps to clear all buckets marked as deleted without changing the size of a hash.
Hi, i didn’t find it clearly from the source code, so i’m asking you. Do this hash function support multi-threading. I’m particularly interested in the iterations between get, put and del. From what I have read in the source I believe it doesn’t support it, but maybe i miss something.
Thanks
Fabricio González
Sorry, it does not support multi-threading. Maybe you could have a look at SunriseDD library: http://www.sunrisetel.net/software/devtools/sunrise-data-dictionary.shtml. I have not tried, though.
Thanks for the fast response. Will look into Sunrise.
In your example…
Wouldn’t KHASH_MAP_INIT_INT(32, char) create a struct called kh_32_t?
Yes, exactly.
In your example it says it would create a struct called kh_int_t. That was confusing me a bit.
Corey is right. That seems to be a typo in the post.
[…] [upmod] [downmod] Implementing Generic Hash Library in C « Attractive Chaos (attractivechaos.wordpress.com) 1 points posted 7 months, 3 weeks ago by jeethu tags generic c […]
What would be the best way to handle failures of memory allocations? Currently the return values are not checked causing a segfault on failure.
What would be the best way to detect and handle colissions?
thanks,
Tobias
Here is my example code:
#include
#include
#include “khash.h”
typedef void (*HANDLER)(char *);
KHASH_MAP_INIT_STR(32, HANDLER)
void P1Handler (char *val)
{
printf(“P1Handler got value=%s\n”, val);
}
void P2Handler (char *val)
{
printf(“P2Handler got value=%s\n”, val);
}
void P3Handler (char *val)
{
printf(“P2Handler got value=%s\n”, val);
}
void Register(khash_t(32) *h1, char *paramName, HANDLER h)
{
int ret;
khiter_t k;
k = kh_put(32, h1, paramName, &ret);
kh_value(h1, k) = h;
}
void Process(khash_t(32) *h, char *paramName, char *val)
{
khiter_t k;
HANDLER handler;
k = kh_get(32, h, paramName);
handler = kh_value(h, k);
handler(val);
}
int main() {
int ret, is_missing;
int hash;
khash_t(32) *h = kh_init(32);
/* add parameterhandlers with namehash as key */
Register(h, “11”, P1Handler);
Register(h, “22”, P2Handler);
Register(h, “13”, P3Handler);
// process some parameters with value
Process(h, “11”, “hello”);
Process(h, “22”, “world”);
Process(h, “13”, “duplicate”);
kh_destroy(32, h);
return 0;
}
P3Handler should print P3Handler and not P2Handler. The program does not create a colission, howewer it would certainly be possible.
Sorry.
Hello,
I am adding kash to a project and am a little confused by this example code:
k = kh_put(32, h, 5, &ret);
if (!ret) kh_del(32, h, k);
kh_value(h, k) = 10;
It looks like the code will:
put the key=5 into the table,
then check the return value from kh_put
then delete the key at index k
then store a value at index k
I must be wrong. How could we store a value if the key has just been deleted?
What does kh_del do here if it is not deleting the element associated with key=5 from the table?
Thank you in advance
Yes, the example seems to be wrong (although it is ok to do that). Please check out the latest version from github:
https://github.com/attractivechaos/klib/blob/master/khash.h
It has the wrong example fixed.
Thank you for the quick reply. So, from what you say, I do not need to handle the zero return value from kh_put with a delete. I am a little confused about the return values meaning and how I need to handle them… or is there no need to worry about the secondary return value from kh_put?
Hello. I think that the khash structure and klib are very cool ideas. It seems to me (though I’ve only played with it for about 45 minutes) that it isn’t possible to create a map-of-maps using khash.h. Ie, a khash table with an int as a key and a khashtable as a value.
Am I correct, or are there some tricks that I should use?
You can do something like this (not tested though):
KHASH_SET_INIT_INT(int)
typedef khash_t(int) *inthash_p;
KHASH_INIT(int2, khint32_t, inthash_p, 1, kh_int_hash_func, kh_int_hash_equal)
Note that as C does not have constructor, you need to allocate inthash_p when you insert the first key. Something like this:
int absent;
khint_t k;
khash_t(int2) *h;
h = kh_init(int2);
k = kh_put(int2, h, 10, &absent);
if (absent) kh_val(h, k) = kh_init(int);
…
Not elegant, but doable.
Hello, and thanks for such a fast response. The idea you’ve described seems to work. Now I have maps-of-maps and maps-of-sets residing a struct, The code all compiles, but I’ve yet to finish the algorithms which are going to rely on the structure. I am optimistic that it will work well. Thanks for the help.
Hi HL3,
Given that the hash is fixed ( no more insert/delete), can it be retrieved (using get) by multiple threads concurrently? I think yes?
Yes.
Hi! Great library.
One question though. what is the reasoning behind the ‘name’ parameter in these macros? Couldn’t you just use the key-value pair types (or simply the type for vec(t), for example)? Is it because pointer types couldn’t be sanitized properly, or something else?
Because this way (it seems, perhaps I am mistaken) I always have to keep track of the triplet (name, type, variable)
With “name”, you can use two different hash functions for the same type in one .c file. Also, with token concatenation, you can’t put “char*” in a struct/function name.
But I thought that these macros *define* structs? Isn’t `khash_t(32) *h` a pointer to a specific *instance* of the struct defined above? From what I see, you are still passing `h` to all macros, so I am presuming that `khash_t(32) *h1, *h2` should work too?
Ok, now that I re-read it I got what you’re saying, different hash functions. I wonder if it makes sense for stuff like `klist` though? Never mind, thanks!
Hi,
I am planning to use the iterator to index an other table. is the iterator retuned by kh_put() always fits the size of the created hash table or should be checked?
In other words what will be the max value of an iterator returned by kh_put() ?
Thank you in advance.
Larbizzz
khash will resize the table when it is not large enough. However, when you don’t have sufficient memory to allocate the new table, the return value may equal kh_end(table) with the last variable set to -1.
Thank you for the quick reply, I am new to hashing and trying to understand khash! so one more question: I have a table used by different threads, Can I use khash to find for me just the correct index without storing all table content in khash table? if yes how can I keep the iterators returned by khash fits for the table that is seen by all threads?
Thanks 1s more!
Larbizzz
Hi,
Is it possible to retrieve the exact index of a key within the khash table ?
Br
Larbizzz
Consider this alternative hashmap implementation that provides a more natural type-safe generic API:
https://github.com/DavidLeeds/hashmap
can is use this to for negative key value(int64_t) ??