Feeds:
Posts
Comments

Archive for the ‘Uncategorized’ Category

Again, this post is a follow-up of this page. Source code is available here.

AVL Tree vs. Red-Black Tree

If you google “avl vs. red-black”, the first returned page will largely answer your question. In summary, the height of an AVL tree is at most ~1.44log(N), lower than the maximum height of a red-black tree, ~2log(N). However, upon insertion, AVL tree may need O(log(N)) operations to rebalance the tree, but red-black tree requires O(1). This means AVL tree is probably faster on searching but slower on insertion. In addition, red-black tree is easier to be implemented as a persistent data structure.

Practical analysis is not so clear as theoretical analysis. I think AVL and red-black may have quite similar overall performance. Firstly, although the maximum possible height of a AVL tree is better, both AVL and red-black trees are probably well balanced without reaching the maximum height frequently. Secondly, although AVL tree has an O(log(N)) loop upon insertion, it is a very tight loop with a small constant, while red-black tree may have a loop of small size with big constant (amortized O(1), though). I am not sure whether red-black tree really wins on insertion.

In Daniel’s benchmark, AVL tree and red-black tree have quite similar speed (in consideration that Daniel implemented AVL tree). In my benchmark, libavl’s red-black tree and AVL tree also deliver similar speed.

On memory, a standard AVL tree requires two pointers and one balancing factor at each node, while a red-black tree can be implemented using two or three pointers plus a color bit (either the balancing factor or the color bit will cost the size of a pointer due to memory alignment). A red-black tree with two pointers is usually slower than a red-black tree with three pointers (libavl_rb_cpp vs. NP_rbtree), but the difference is marginal. In my benchmark, a red-black tree with two pointers has pretty similar performance in comparison to an AVL tree (libavl_rb_cpp vs. libavl_avl_cpp).

In all, my opinion is AVL and red-black are very similar in many aspects. Which to use does not matter too much if you are not implementing a persistent data structure.

Splay Tree vs. AVL/RB Tree

Search on a splay tree may be O(N) if elements are inserted in order. This “feature” alone pushes me away. Although wiki says it is possible to avoid the worst case, I am not aware of a practical implementation. On random input, splay tree is also slower than AVL and red-black trees.

One of the advantages of a splay tree is it only requires two pointers with no additional information. This is helpful when memory is critical, but I would rather use a modified red-black to achieve the same memory footprint. Another advantage of splay tree is it is self-adjusting. I have not tested this in my benchmark.

Treap v. AVL/RB Tree

I do not know much about treap. It seems to me treap does not guarantee O(log(N)) search. Fortunately, in practice the worst O(N) case should almost never happen due to the random behaviour of a treap. In my benchmark, treap is slower than splay tree on random input. Probably it is not the best choice.

Implementations

As to red-black tree, SGI STL’s set/map is really a good one. You can find other C++ implementations (TN_rbtree libavl_rb_cpp) in my benchmark. They are also efficient, but incomplete. Libavl_rb_cpp is adapted from libavl_rb which only uses two pointers. NP_rbtree (FreeBSD’s tree.h) is the best choice for a C programmer.

As to AVL tree, I am not aware of good AVL implementations in either C or C++. Stlavlmap is supposed to be efficient, but I cannot compile it. WK_avl may be a good library, judged from its source code, but it is not easy to use. GM_avl uses recursion and so cannot be efficient. Libavl_avl_cpp is my ported version of libavl_avl. It is efficient but incomplete. You may finish it if you really like to try an AVL tree. BTW, libavl is in fact a very high-quality library, but using void* makes it inferior to others. Someone should improve its API using C macros or C++ template.

Concluding Remarks

For general purpose, red-black tree might still be the best choice. It is among the fastest binary search trees and more easily adapted to a persistent data structure. AVL tree is equally good when persistence is not needed (well, all my applications do not require persistence). At last, do not forget my previous post: in-memory B-tree can be both faster and more light-weighted than red-black/AVL trees.

Update

Sambal told me that on insertion, red-black only guarantees amortized O(1). The worst case can be O(log(N)) with large constant. Then the advantage of red-black tree on insertion is less obvious.

Update the claim about the memory usage of AVL and red-black tree.

Read Full Post »

I have done a more thorough benchmark of various hash table and search tree implementations. I believe it is the best page from my blog so far. Unfortunately, few people have looked at this page according to the wordpress statistics. Possibly putting too much information in one page without enough explanations has scared people; or the page is posted on another domain and people do not like to follow the link. Anyway, it might be good to discuss further in my wordpress blog. I will write a series of posts to extend that page. Please look at here for the whole page and here for the source code.

Let’s start this series with generic programming in C.

We usualy use void* to implement generic containers in C. To do in this way, we need to pay an overhead on retrieving data pointed by a void* pointer. We often, but not always, need a function call for operations on void* data, such as copying, comparing or hashing. This adds additional overhead on speed. Furthermore, if the size of an object is small, we would rather put the whole object in the container instead of wasting the memory for an addition pointer: using void* may also bring memory overhead.

To see how large this void* effect is, I ported libavl’s AVL tree code from C to C++ (source code is available in the download link showed above). I just honestly translated void* to C++ template without applying any additional optimizations (well, Ben Pfaff does not leave much room for optimization, either). As you can see from the table, my C++ template version is both faster and more light-weighted than the original C version. The conclusion is: using void* in this way may be inefficient.

There is another different, but similar, way to use void*. In implementing sort, B-trees or open addressing hash tables, we use arrays. It is possible to put an object itself in a void* array instead of putting its pointer into the array. We can do most of operations when we know the size of each object. For example, we can implement swap in this way:

void swap(void *p, int obj_size, int i, int j) {
  void *tmp = malloc(obj_size); // for efficiency, we can preallocate tmp
  memcpy(tmp, p+obj_size*i, obj_size);
  memcpy(p+obj_size*i, p+obj_size*j, obj_size);
  memcpy(p+obj_size*j, tmp, obj_size);
  free(tmp);
}

and even implement a whole sorting algorithm with similar ideas. I believe libc’s qsort is implemented in this way. However, as we can see, we need to call memcpy() on each assignment. On large memory, memcpy() is probably efficient with the help of vectorization, but on small memory, it is slower than direct assignment as we have to retrieve obj_size and memcpy() does not know what obj_size could be. Also, as the compiler does not know what obj_size is at the compiling time, the compiler cannot optimize obj_size*50 to something like 50<<2 even if we know obj_size=4. In all, using void* in this way will not have overhead on memory, but will have on speed. This is why libc’s qsort is always slower than STL’s or my sorting implementations (see my Comparison of Internal Sorting Algorithms).

Here is the lesson. If we care about efficiency, using void* to achieve generic programming in C is a bad idea. C++ template does much better at this point and causes almost no overhead on instantiation. If we want to stick to C, I would suggest switching to macros as are used in my khash.h/kbtree.h/ksort.h (I learned this from FreeBSD’s tree.h). C macros mimics C++ template and does not cause more overhead than C++ template. However, C macros are hard to read or write. That is why I ported libavl’s AVL implementation to a C++ template instead of to C macros. It would be a pain.

Read Full Post »

Currently I am using freewebs.com to hold my programs and free-style HTML pages. It is down today and I am really annoyed by its single-file uploader and lack of web statistics. Finally I decide to shift to awardspace.com which seems to offer much more than freewebs.com. My next post will use awardspace.com.

Read Full Post »

It is always good to learn from good programmers, but where are they? Do they write blogs, join forums or discuss in some mailing lists? It seems to me that most of them do not. The few who do so only write posts or send emails very occassionally. I guess most good programmers are busy with programming in computer languages instead of writing in natural languages, which in fact is why they become good programmers.

In finding good programmers, I have checked quite a few blogs. Most of the bloggers (including me unfortunately) act as good programmers. They are presenting their thoughts on some concepts or quarrelling over what is better. But they do not present source codes, or just give elementary pieces of codes, from which no one can ever tell their abilities. What is the meaning to hear the thought of a bad programmer? If I cannot tell, I would simply ignore their words.

I have also joined comp.lang.c for a few days. It also disappoints me, at least up to now. Again, most people argue over details of concepts or show off their knowledges which I do not think necessary in a lot of cases. A lot of people here like to ignore the central issue and to extend the discussions on mediocre points to nowhere. Few present high-quality source codes and again I do not know who are really good programmers or who are just good lecturers on C. Probably good programmers in this list do not have the time to show their abilities. I need to wait a bit longer.

Then where to find good programmers? I would suggest searching for good codes first and then for the authors behind those codes. Here is an example that how I found a good programmer. I came across jemalloc written by Jason Evans and then I noticed his implementation of red-black tree. Further search brought me to his blog. Only a few posts there but you can quickly know he is a good programmer. I would say his thoughts on programming weight much more than the most emails I have seen on the newsgroup or the blogs I have read there several days. Always remember: before reading one’s thoughts, read his/her codes first.

Read Full Post »

« Newer Posts