Implementing Generic Hash Library in C

September 2, 2008 by attractivechaos

Synopsis

Here is an simple example showing how to use khash.h library:

#include "khash.h"
KHASH_MAP_INIT_INT(32, char)
int main() {
	int ret, is_missing;
	khiter_t k;
	khash_t(32) *h = kh_init(32);
	k = kh_put(32, h, 5, &ret);
	if (!ret) kh_del(32, h, k);
	k = kh_get(32, h, 10);
	is_missing = (k == kh_end(h));
	k = kh_get(32, h, 5);
	kh_del(32, h, k);
	for (k = kh_begin(h); k != kh_end(h); ++k)
		if (kh_exist(h, k)) kh_value(h, k) = 1;
	kh_destroy(32, h);
	return 0;
}

The second line says we want to use a hash map with int as key and char as value. khash_t(int) is a type. kh_get() and kh_put() returns an iterator, or the position in the hash table. kh_del() erases the key-value in the bucket pointed by the iterator. kh_begin() and kh_end() return the begin and the end of iterator, respectively. And kh_exist() tests whether the bucket at the iterator is filled with a key-value. The APIs are not so concise in comparison to C++ template, but are very straightforward and flexible. How can this be done?

Achieving generic programming in C

The core part of khash.h is:

#define KH_INIT(name, key_t, val_t, is_map, _hashf, _hasheq) \
  typedef struct { \
    int n_buckets, size, n_occupied, upper_bound; \
    unsigned *flags; \
    key_t *keys; \
    val_t *vals; \
  } kh_##name##_t; \
  static inline kh_##name##_t *init_##name() { \
    return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \
  } \
  static inline int get_##name(kh_##name##_t *h, key_t k) \
  ... \
  static inline void destroy_##name(kh_##name##_t *h) { \
    if (h) { \
      free(h->keys); free(h->flags); free(h->vals); free(h); \
    } \
  }

#define _int_hf(key) (unsigned)(key)
#define _int_heq(a, b) (a == b)
#define khash_t(name) kh_##name##_t
#define kh_init(name) init_##name()
#define kh_get(name, h, k) get_##name(h, k)
#define kh_destroy(name, h) destroy_##name(h)
...
#define KHASH_MAP_INIT_INT(name, val_t) \
	KH_INIT(name, unsigned, val_t, is_map, _int_hf, _int_heq)

In macro ‘KH_INIT’, name is a unique symbol that distinguishes hash tables of different types, key_t the type of key, val_t the type of value, is_map is 0 or 1 indicating whether to allocate memory for vals, _hashf is a hash function/macro and _hasheq the comparison function/macro. Macro ‘KHASH_MAP_INIT_INT’ is a convenient interface to hash with integer keys.

When ‘KHASH_MAP_INIT_INT(32, char)’ is used in a C source code file the following codes will be inserted:

  typedef struct {
    int n_buckets, size, n_occupied, upper_bound;
    unsigned *flags;
    unsigned *keys;
    char *vals;
  } kh_int_t;
  static inline kh_int_t *init_int() {
    return (kh_int_t*)calloc(1, sizeof(kh_int_t));
  }
  static inline int get_int(kh_int_t *h, unsigned k)
  ...
  static inline void destroy_int(kh_int_t *h) {
    if (h) {
      free(h->keys); free(h->flags); free(h->vals); free(h);
    }
  }

And when we call: ‘kh_get(int, h, 5)’, we are calling ‘get_int(h, 5)’ which is defined by calling KH_INIT(int) macro. In this way, we can effectively achieve generic programming with simple interfaces. As we use inline and macros throughout, the efficiency is not affected at all. In my hash table benchmark, it is as fast and light-weighted as the C++ implementation.

Other technical concerns

Solving collisions. I have discussed this in my previous post. I more like to achieve smaller memory and therefore I choose open addressing.
Grouping key-value pairs or not. In the current implementation, keys and values are kept in separated arrays. This strategy will cause additional cache misses when keys and values are retrieved twice. Grouping key-value in a struct is more cache efficient. However, the good side of separating keys and values is this avoids waste of memory when key type and value type cannot be aligned well (e.g. key is an integer while value is a character). I would rather trade speed a bit for smaller memory. In addition, it is not hard to use a struct has a key in the current framework.
Space efficient rehashing. Traditional rehashing requires to allocate one addition hash and move elements in the old hash to the new one. For most hash implementations, this means we need 50% extra working space to enlarge a hash. This is not necessary. In khash.h, only a new flags array is allocated on rehashing. Array keys and values are enlarged with realloc which does not claim more memory than the new hash. Keys and values are move from old positions to new positions in the same memory space. This strategy also helps to clear all buckets marked as deleted without changing the size of a hash.

Posted in development | Tagged C, hash, myprog, programming | 29 Comments

29 Responses

on January 9, 2009 at 1:21 pm | Reply Fabricio

Hi, i didn’t find it clearly from the source code, so i’m asking you. Do this hash function support multi-threading. I’m particularly interested in the iterations between get, put and del. From what I have read in the source I believe it doesn’t support it, but maybe i miss something.

Thanks
Fabricio González
on January 9, 2009 at 1:27 pm | Reply attractivechaos

Sorry, it does not support multi-threading. Maybe you could have a look at SunriseDD library: http://www.sunrisetel.net/software/devtools/sunrise-data-dictionary.shtml. I have not tried, though.
on January 9, 2009 at 1:29 pm | Reply Fabricio

Thanks for the fast response. Will look into Sunrise.
on May 3, 2009 at 3:16 am | Reply Corey

In your example…
Wouldn’t KHASH_MAP_INIT_INT(32, char) create a struct called kh_32_t?
on May 3, 2009 at 11:07 am | Reply attractivechaos

Yes, exactly.
on May 3, 2009 at 10:52 pm | Reply Corey

In your example it says it would create a struct called kh_int_t. That was confusing me a bit.
- on September 4, 2009 at 3:02 pm | Reply Alex
  
  Corey is right. That seems to be a typo in the post.
on May 16, 2009 at 5:06 pm | Reply Tagz | "Implementing Generic Hash Library in C « Attractive Chaos" | Comments

[…] [upmod] [downmod] Implementing Generic Hash Library in C « Attractive Chaos (attractivechaos.wordpress.com) 1 points posted 7 months, 3 weeks ago by jeethu tags generic c […]
on July 17, 2009 at 12:30 pm | Reply MikeJ

What would be the best way to handle failures of memory allocations? Currently the return values are not checked causing a segfault on failure.
on April 15, 2010 at 12:19 pm | Reply Tobias

What would be the best way to detect and handle colissions?

thanks,
Tobias

Here is my example code:

#include
#include
#include “khash.h”

typedef void (*HANDLER)(char *);
KHASH_MAP_INIT_STR(32, HANDLER)

void P1Handler (char *val)
{
printf(“P1Handler got value=%s\n”, val);
}

void P2Handler (char *val)
{
printf(“P2Handler got value=%s\n”, val);
}

void P3Handler (char *val)
{
printf(“P2Handler got value=%s\n”, val);
}

void Register(khash_t(32) *h1, char *paramName, HANDLER h)
{
int ret;
khiter_t k;

k = kh_put(32, h1, paramName, &ret);
kh_value(h1, k) = h;
}

void Process(khash_t(32) *h, char *paramName, char *val)
{
khiter_t k;
HANDLER handler;

k = kh_get(32, h, paramName);
handler = kh_value(h, k);
handler(val);
}

int main() {
int ret, is_missing;
int hash;
khash_t(32) *h = kh_init(32);

/* add parameterhandlers with namehash as key */
Register(h, “11”, P1Handler);
Register(h, “22”, P2Handler);
Register(h, “13”, P3Handler);

// process some parameters with value
Process(h, “11”, “hello”);
Process(h, “22”, “world”);
Process(h, “13”, “duplicate”);

kh_destroy(32, h);
return 0;
}
- on April 15, 2010 at 12:21 pm | Reply Tobias
  
  P3Handler should print P3Handler and not P2Handler. The program does not create a colission, howewer it would certainly be possible.
  
  Sorry.
on April 20, 2012 at 8:32 pm | Reply Megan

Hello,

I am adding kash to a project and am a little confused by this example code:

k = kh_put(32, h, 5, &ret);
if (!ret) kh_del(32, h, k);
kh_value(h, k) = 10;

It looks like the code will:
put the key=5 into the table,
then check the return value from kh_put
then delete the key at index k
then store a value at index k

I must be wrong. How could we store a value if the key has just been deleted?
What does kh_del do here if it is not deleting the element associated with key=5 from the table?

Thank you in advance
- on April 20, 2012 at 8:38 pm | Reply attractivechaos
  
  Yes, the example seems to be wrong (although it is ok to do that). Please check out the latest version from github:
  
  https://github.com/attractivechaos/klib/blob/master/khash.h
  
  It has the wrong example fixed.
  - on April 21, 2012 at 8:16 pm Megan
    
    Thank you for the quick reply. So, from what you say, I do not need to handle the zero return value from kh_put with a delete. I am a little confused about the return values meaning and how I need to handle them… or is there no need to worry about the secondary return value from kh_put?
on August 1, 2013 at 9:29 pm | Reply Michael

Hello. I think that the khash structure and klib are very cool ideas. It seems to me (though I’ve only played with it for about 45 minutes) that it isn’t possible to create a map-of-maps using khash.h. Ie, a khash table with an int as a key and a khashtable as a value.

Am I correct, or are there some tricks that I should use?
- on August 1, 2013 at 9:39 pm | Reply attractivechaos
  
  You can do something like this (not tested though):
  
  KHASH_SET_INIT_INT(int)
  
  typedef khash_t(int) *inthash_p;
  KHASH_INIT(int2, khint32_t, inthash_p, 1, kh_int_hash_func, kh_int_hash_equal)
  
  Note that as C does not have constructor, you need to allocate inthash_p when you insert the first key. Something like this:
  
  int absent;
  khint_t k;
  khash_t(int2) *h;
  h = kh_init(int2);
  k = kh_put(int2, h, 10, &absent);
  if (absent) kh_val(h, k) = kh_init(int);
  …
  
  Not elegant, but doable.
  - on August 3, 2013 at 8:23 pm Michael
    
    Hello, and thanks for such a fast response. The idea you’ve described seems to work. Now I have maps-of-maps and maps-of-sets residing a struct, The code all compiles, but I’ve yet to finish the algorithms which are going to rely on the structure. I am optimistic that it will work well. Thanks for the help.
on July 5, 2015 at 2:08 pm | Reply conmaths

Hi HL3,
Given that the hash is fixed ( no more insert/delete), can it be retrieved (using get) by multiple threads concurrently? I think yes?
- on July 18, 2015 at 9:17 pm | Reply attractivechaos
  
  Yes.
on March 25, 2017 at 2:16 pm | Reply didilbert

Hi! Great library.

One question though. what is the reasoning behind the ‘name’ parameter in these macros? Couldn’t you just use the key-value pair types (or simply the type for vec(t), for example)? Is it because pointer types couldn’t be sanitized properly, or something else?

Because this way (it seems, perhaps I am mistaken) I always have to keep track of the triplet (name, type, variable)
- on March 25, 2017 at 2:21 pm | Reply attractivechaos
  
  With “name”, you can use two different hash functions for the same type in one .c file. Also, with token concatenation, you can’t put “char*” in a struct/function name.
  - on March 25, 2017 at 4:04 pm didilbert
    
    But I thought that these macros *define* structs? Isn’t `khash_t(32) *h` a pointer to a specific *instance* of the struct defined above? From what I see, you are still passing `h` to all macros, so I am presuming that `khash_t(32) *h1, *h2` should work too?
  - on March 25, 2017 at 4:06 pm didilbert
    
    Ok, now that I re-read it I got what you’re saying, different hash functions. I wonder if it makes sense for stuff like `klist` though? Never mind, thanks!
on January 9, 2018 at 11:39 am | Reply Larbiz

Hi,
I am planning to use the iterator to index an other table. is the iterator retuned by kh_put() always fits the size of the created hash table or should be checked?

In other words what will be the max value of an iterator returned by kh_put() ?

Thank you in advance.
Larbizzz
- on January 9, 2018 at 2:04 pm | Reply attractivechaos
  
  khash will resize the table when it is not large enough. However, when you don’t have sufficient memory to allocate the new table, the return value may equal kh_end(table) with the last variable set to -1.
on January 10, 2018 at 7:47 am | Reply Larbiz

Thank you for the quick reply, I am new to hashing and trying to understand khash! so one more question: I have a table used by different threads, Can I use khash to find for me just the correct index without storing all table content in khash table? if yes how can I keep the iterators returned by khash fits for the table that is seen by all threads?

Thanks 1s more!
Larbizzz
on January 19, 2018 at 7:20 am | Reply Larbiz

Hi,

Is it possible to retrieve the exact index of a key within the khash table ?

Br
Larbizzz
on June 17, 2020 at 5:29 pm | Reply David

Consider this alternative hashmap implementation that provides a more natural type-safe generic API:

https://github.com/DavidLeeds/hashmap
on September 17, 2023 at 3:31 am | Reply Sumit Kumar

can is use this to for negative key value(int64_t) ??

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog