Synopsis
Here is an simple example showing how to use khash.h library:
#include "khash.h"
KHASH_MAP_INIT_INT(32, char)
int main() {
int ret, is_missing;
khiter_t k;
khash_t(32) *h = kh_init(32);
k = kh_put(32, h, 5, &ret);
if (!ret) kh_del(32, h, k);
k = kh_get(32, h, 10);
is_missing = (k == kh_end(h));
k = kh_get(32, h, 5);
kh_del(32, h, k);
for (k = kh_begin(h); k != kh_end(h); ++k)
if (kh_exist(h, k)) kh_value(h, k) = 1;
kh_destroy(32, h);
return 0;
}
The second line says we want to use a hash map with int as key and char as value. khash_t(int) is a type. kh_get() and kh_put() returns an iterator, or the position in the hash table. kh_del() erases the key-value in the bucket pointed by the iterator. kh_begin() and kh_end() return the begin and the end of iterator, respectively. And kh_exist() tests whether the bucket at the iterator is filled with a key-value. The APIs are not so concise in comparison to C++ template, but are very straightforward and flexible. How can this be done?
Achieving generic programming in C
The core part of khash.h is:
#define KH_INIT(name, key_t, val_t, is_map, _hashf, _hasheq) \
typedef struct { \
int n_buckets, size, n_occupied, upper_bound; \
unsigned *flags; \
key_t *keys; \
val_t *vals; \
} kh_##name##_t; \
static inline kh_##name##_t *init_##name() { \
return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \
} \
static inline int get_##name(kh_##name##_t *h, key_t k) \
... \
static inline void destroy_##name(kh_##name##_t *h) { \
if (h) { \
free(h->keys); free(h->flags); free(h->vals); free(h); \
} \
}
#define _int_hf(key) (unsigned)(key)
#define _int_heq(a, b) (a == b)
#define khash_t(name) kh_##name##_t
#define kh_init(name) init_##name()
#define kh_get(name, h, k) get_##name(h, k)
#define kh_destroy(name, h) destroy_##name(h)
...
#define KHASH_MAP_INIT_INT(name, val_t) \
KH_INIT(name, unsigned, val_t, is_map, _int_hf, _int_heq)
In macro ‘KH_INIT’, name is a unique symbol that distinguishes hash tables of different types, key_t the type of key, val_t the type of value, is_map is 0 or 1 indicating whether to allocate memory for vals, _hashf is a hash function/macro and _hasheq the comparison function/macro. Macro ‘KHASH_MAP_INIT_INT’ is a convenient interface to hash with integer keys.
When ‘KHASH_MAP_INIT_INT(32, char)’ is used in a C source code file the following codes will be inserted:
typedef struct {
int n_buckets, size, n_occupied, upper_bound;
unsigned *flags;
unsigned *keys;
char *vals;
} kh_int_t;
static inline kh_int_t *init_int() {
return (kh_int_t*)calloc(1, sizeof(kh_int_t));
}
static inline int get_int(kh_int_t *h, unsigned k)
...
static inline void destroy_int(kh_int_t *h) {
if (h) {
free(h->keys); free(h->flags); free(h->vals); free(h);
}
}
And when we call: ‘kh_get(int, h, 5)’, we are calling ‘get_int(h, 5)’ which is defined by calling KH_INIT(int) macro. In this way, we can effectively achieve generic programming with simple interfaces. As we use inline and macros throughout, the efficiency is not affected at all. In my hash table benchmark, it is as fast and light-weighted as the C++ implementation.
Other technical concerns
- Solving collisions. I have discussed this in my previous post. I more like to achieve smaller memory and therefore I choose open addressing.
- Grouping key-value pairs or not. In the current implementation, keys and values are kept in separated arrays. This strategy will cause additional cache misses when keys and values are retrieved twice. Grouping key-value in a struct is more cache efficient. However, the good side of separating keys and values is this avoids waste of memory when key type and value type cannot be aligned well (e.g. key is an integer while value is a character). I would rather trade speed a bit for smaller memory. In addition, it is not hard to use a struct has a key in the current framework.
- Space efficient rehashing. Traditional rehashing requires to allocate one addition hash and move elements in the old hash to the new one. For most hash implementations, this means we need 50% extra working space to enlarge a hash. This is not necessary. In khash.h, only a new flags array is allocated on rehashing. Array keys and values are enlarged with realloc which does not claim more memory than the new hash. Keys and values are move from old positions to new positions in the same memory space. This strategy also helps to clear all buckets marked as deleted without changing the size of a hash.
Read Full Post »