I have done a more thorough benchmark of various hash table and search tree implementations. I believe it is the best page from my blog so far. Unfortunately, few people have looked at this page according to the wordpress statistics. Possibly putting too much information in one page without enough explanations has scared people; or the page is posted on another domain and people do not like to follow the link. Anyway, it might be good to discuss further in my wordpress blog. I will write a series of posts to extend that page. Please look at here for the whole page and here for the source code.
Let’s start this series with generic programming in C.
We usualy use void* to implement generic containers in C. To do in this way, we need to pay an overhead on retrieving data pointed by a void* pointer. We often, but not always, need a function call for operations on void* data, such as copying, comparing or hashing. This adds additional overhead on speed. Furthermore, if the size of an object is small, we would rather put the whole object in the container instead of wasting the memory for an addition pointer: using void* may also bring memory overhead.
To see how large this void* effect is, I ported libavl’s AVL tree code from C to C++ (source code is available in the download link showed above). I just honestly translated void* to C++ template without applying any additional optimizations (well, Ben Pfaff does not leave much room for optimization, either). As you can see from the table, my C++ template version is both faster and more light-weighted than the original C version. The conclusion is: using void* in this way may be inefficient.
There is another different, but similar, way to use void*. In implementing sort, B-trees or open addressing hash tables, we use arrays. It is possible to put an object itself in a void* array instead of putting its pointer into the array. We can do most of operations when we know the size of each object. For example, we can implement swap in this way:
void swap(void *p, int obj_size, int i, int j) { void *tmp = malloc(obj_size); // for efficiency, we can preallocate tmp memcpy(tmp, p+obj_size*i, obj_size); memcpy(p+obj_size*i, p+obj_size*j, obj_size); memcpy(p+obj_size*j, tmp, obj_size); free(tmp); }
and even implement a whole sorting algorithm with similar ideas. I believe libc’s qsort is implemented in this way. However, as we can see, we need to call memcpy() on each assignment. On large memory, memcpy() is probably efficient with the help of vectorization, but on small memory, it is slower than direct assignment as we have to retrieve obj_size and memcpy() does not know what obj_size could be. Also, as the compiler does not know what obj_size is at the compiling time, the compiler cannot optimize obj_size*50 to something like 50<<2 even if we know obj_size=4. In all, using void* in this way will not have overhead on memory, but will have on speed. This is why libc’s qsort is always slower than STL’s or my sorting implementations (see my Comparison of Internal Sorting Algorithms).
Here is the lesson. If we care about efficiency, using void* to achieve generic programming in C is a bad idea. C++ template does much better at this point and causes almost no overhead on instantiation. If we want to stick to C, I would suggest switching to macros as are used in my khash.h/kbtree.h/ksort.h (I learned this from FreeBSD’s tree.h). C macros mimics C++ template and does not cause more overhead than C++ template. However, C macros are hard to read or write. That is why I ported libavl’s AVL implementation to a C++ template instead of to C macros. It would be a pain.
Hi,
I am really interested in seeing how big the performance difference is, but that table is really hard to parse. It feels like I am missing some previous post or something. If you broke that information out into several graphs (it looks like there is enough info there for several posts!) I am sure it would be quite popular.
-K
How about some charts or at least a table? You give no metrics or even any data. How much faster? What data did you use to test? How much more lightweight?
Additionally, OF COURSE templates are going to be faster in this instance. The C compiler cannot make assumptions at compile time about the content of void pointers. The types are being bound at compile time in C++ templates so what you’re really doing is reducing the runtime flexibility of the code.
@Kevin
Sorry that I did not state it clearly. For the libavl example, we are looking at libavl_avl and libavl_avl_cpp in the table. The latter is implmented in C++ template. We can see that libavl_avl_cpp is 56% (=(1/6.73-1/10.51)/(1/10.51)) faster than libavl_avl on MacIntel with integer as keys. On Linux-Intel the difference is less (31% faster on integer keys or 15% on char* keys), but still significant. As for memory, libavl_avl_cpp does not save resident memory on Mac probably because memory alignment. On a 64-bit Linux, it saves 39% (=(48.41-29.41)/48.41) of resident memory.
The comparison of libc’s qsort and STL’s sort are showed in another post. My implementation of introsort (essentially quicksort) is 110% (=(1/7.887-1/16.579)/(1/16.579)) faster than libc’s qsort on MacIntel, or 298% faster on LinuxAMD. The speedup comes at two points: saving a function call on comparisons and avoiding memcpy(). On Mac, it seems that most of the speedup comes from the first point, but on Linux most comes from the second point.
@Facepalm
I have showed the link to the table, as well as the design of the experiment. The table is too large to fit here and so I posted on another website. Sorry that that table is not very clear. Please see my reply to Kevin for more information.
Template provides more flexibility. We can always put void* in template when necessary, but we have to live with the inefficiency of some C libraries even if we do not mean to. In many practical applications, we know the type of data in a container. We do not need the runtime flexibility in common cases, I think.
I have also tried to put void* in SGI’s std::set. On integer keys (MacIntel), std::set finishes in 12.738 sec (vs. 5.99 in my benchmark), using 29.80 MB memory (vs. 19.87). This is another evidence that using void* can be very inefficient.
The most likely reason why void* is slow (and C++ templates improve otherwise identical code) is that a void* may point anywhere. This means they can alias anything. This defeats many compiler optimizations. As a result, register values must be often be re-read after writing through a void*.
Mike Acton wrote up a good article on strict aliasing on void* performance.
Unfortunately the domain name appears to have been hijacked but you can still read the article in the Internet Archive
http://web.archive.org/web/20071223232457/www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
One thing I noticed with your khash.h/kbtree.h API is that it requires that the objects be allocated with calloc(). Another key for high performance in C is to minimize the amount of memory allocations that occur. Right now your khash interface requires a specific calloc() be called for it. The list types in Linux were designed to constructed in a parent structure such that you do not need to do a separate allocation for the list itself. They also provide a macro called INIT_LIST_HEAD() that will let the compiler build a binary that has the list pre-initialized.
http://lxr.linux.no/linux+v2.6.27.10/include/linux/list.h#L19
A bit late, but what compiler and settings are you using?
Any time you cast from void* to a type you should get some overhead so there will always be some slight overhead in those cases and as pointed out above it could throw off the optimizer.
Swapping with memcpy is going to be faster for anything but primitives like longs, though, even with bad memcpy implementation. For more complicated objects it can be hundreds of times faster, though. With better memcpy implementation you can virtually always beat assignment operator.
I always wonder why dont’ c support like template?Atleast as powerful as c# would be good enough for most of the case
Well, I guess it is just too late.
Your klib C library looks good, but the function duplication for each type seems excessive. You can define a union for a set of sizes that get inlined into your types, say up to 8 bytes.
Then terminate each struct with “char[];” and add a byte somewhere to track the size of the payload. Then accessing the type is simply that offset from end of your structure that you cast to a value of the right size. The offset will be in the same cache line as the data being access, and so not really adversely affect performance.
Just a rough idea that will need some refinement, but avoids all the code duplication and the difficulty maintaining functions defined as macros.
Which file has excessive function duplications? Macro instantiation is not functions and only takes two lines. You solution will make each element one byte larger at least and will be harder to use.
I was going through your blog and noticed that some of the links are broken. Is the source code still available to be viewed somewhere along with the “whole page” you mentioned in your blog?
That aside, awesome blog! I learned a bit about this in uni but they never explained some of the nitty gritty parts you mentioned here.