Feeds:
Posts
Comments

Posts Tagged ‘benchmark’

Introduction

About two years ago I evaluated the performance of ~20 compilers and interpreters on sudoku solving, matrix multiplication, pattern matching and dictionary operations. Two years later, I decide update a small part of the benchmark on Sudoku solving. I choose this problem because it is practically and algorithmically interesting, and simple enough to be easily ported to multiple languages. Meanwhile, I am also adding two new programming languages: Mozilla’s Rust and Google’s Dart. They are probably the most promising languages announced in the past two years.

Results

In this small benchmark, I am implementing Sudoku solvers in multiple programming languages. The algorithm, adapted from Guenter Stertenbrink’s solver, was first implemented in C and then ported to other languages. The C source code briefly describes the method. For more information about Sudoku solving in general, please see my other post.

Before I show the results, there are a couple of caveats to note:

  • Solving Sudoku is NP-hard. The choice of the solving algorithm will dramatically affect the speed. For example, my Rust implementation is ~2500 times faster than the one in the Rust official repository. For a language benchmark, we must implement exactly the same algorithm.
  • I am mostly familiar with C but am pretty much a newbie in other programming languages. I am sure some implementations are not optimal. If you can improve the code, please send me a pull request. I am happy to replace with a better version.

The following table shows the CPU time for solving 20 hard Sudokus repeated 50 500 times (thus 1000 10000 Sudokus in total). The programs, which are freely available, are compiled and run on my Mac laptop with a 2.66GHz Core i7 CPU.

Compiler/VM

Version

Language

Option

CPU time (sec)
clang

425.0.27 (3.2svn)

C

-O2

8.92
llvm-gcc

4.2.1

C

-O2

9.23
dmd

2.062

D2

-O -release
-noboundscheck

11.54
11.47
rust

0.6

Rust

–opt-level 3

11.51
java

1.6.0_37

Java

-d64

11.57
go

1.1beta 20130406

Go

(default)
-gcflags -B

14.96
13.78
dart

0.4.4.4-r20810

Dart

21.42
v8

3.16.3

Javascript

28.19
luajit

2.0.1

Lua

30.66
pypy

2.0-beta-130405

Python

44.29

In this small benchmark, C still takes the crown of speed, Other statically typed languages are about twice as slow but Rust and D are very close to C. It is pretty amazing that Rust as a new language is that performant given the developers have not put too much efforts on speed so far.

Among dynamically typed languages, Dart, V8 and LuaJIT are similar in speed, about 3 times as slow as C. 3 times is arguably not much to many applications. I really hope some day I can use a handy dynamically typed language for most programming. Pypy is slower here, but it is more than twice as fast as the version two years ago.

Related resources

Update

  • I forgot to use `-release’ with dmd. The new result looks much better. Sorry for my mistake.
  • Mac ships gcc-4.2.1 only due to licensing issues. I have just tried both gcc 4.7.2 and gcc 4.8 from MacPorts. The executables compiled by them take 0.99 second to run, slower than gcc-4.2.1.
  • Updated to the latest Go compiled from the repository.
  • Updated the Python implementation (thanks to Rob Smallshire).
  • Updated the Dart implementation (thanks to jwendel).
  • Updated the Rust implementation (thanks to dotdash).
  • Made input 10 times larger to reduce the fraction of time spent on VM startup. Dart/V8/LuaJIT have short VM startup time, but Java is known to have a long startup.
  • Updated the Go implementation (thanks to Sébastien Paolacci).
  • Updated the Python implementation.

Read Full Post »

This is a follow-up of my previous post. Here I change the table to several charts. Hope it seems more friendly to readers. You can find the links to these libraries in that table. Their source codes, including my testing code, are available here. You may also want to see my previous posts in the last few days for my interpretation to the results.

On C string (char*) keys, I fail to use JE_rb_old and JE_rb_new to get the correct result on Mac and so they are not showed in the charts. I would really appreciate if someone may give me the correct implementation using these libraries. In addition, tr1_unordered_map uses a lot of memory according to my program. The memory for string keys are faked.

For conveniece, here are some brief descriptions of these libraries (with no order):

  • google_dense and google_sparse: google’s sparsehash library. Google_dense is fast but memory hungery while google_sparse is the opposite.
  • sgi_hash_map and sgi_map: SGI’s STL that comes with g++-4. The backend of sgi_map is a three-pointer red-black tree.
  • tr1::unordered_map: GCC’s TR1 library that comes with g++-4. It implements a hash table.
  • rdestl::hash_map: from RDESTL, another implementation of STL.
  • uthash: a hash library in C
  • JG_btree: John-Mark Gurney’s btree library.
  • JE_rb_new, JE_rb_old, JE_trp_hash and JE_trp_prng: Jason Evans’ binary search tree libraries. JE_rb_new implements a left-leaning red-black tree; JE_rb_old a three-pointer red-black tree; both JE_trp_hash and JE_trp_prng implement treaps but with different strategies on randomness.
  • libavl_rb, libavl_prb, libavl_avl and libavl_bst: from GNU libavl. They implment a two-pointer red-black tree, a three-pointer red-black tree, an AVL tree and a unbalanced binary search tree, respectively.
  • NP_rbtree and NP_splaytree: Niels Provos’ tree library for FreeBSD. A three-pointer red-black tree and a splay tree.
  • TN_rbtree: Thomas Niemann’s red-black tree. I ported it to C++.
  • sglib_rbtree: from SGLIB. It implements a two-pointer recursive red-black tree (all the other binary search trees are implemented without recursion).
  • libavl_avl_cpp, libavl_rb_cpp and libavl_rb_cpp2: incomplete C++ version of libavl (no iterator), ported by me. Libavl_rb_cpp2 further uses the same technique in JE_rb_new to save the color bit. Source codes available in the package.
  • khash and kbtree: my hash table and B-tree implementation. kbtree is based on JG_rbtree.

Read Full Post »

I was wondering whether retrieving an element in a struct will incur additional overhead. And so I did the following experiment. Here the same array is sorted in two ways: with or without data retrieving from a struct. Both ways yield identical results. The question is whether the compiler knows the two ways are the same and can achieve the same efficiency.

#include
#include
#include
#include “ksort.h”

typedef struct {
int a;
} myint_t;

#define myint_lt(_a, _b) ((_a).a < (_b).a) KSORT_INIT_GENERIC(int) KSORT_INIT(my, myint_t, myint_lt) int main() { int i, N = 10000000; myint_t *a; clock_t t; a = (myint_t*)malloc(sizeof(myint_t) * N); srand48(11); for (i = 0; i != N; ++i) a[i].a = lrand48(); t = clock(); ks_introsort(int, N, (int*)a); printf("%.3lf\n", (double)(clock() - t) / CLOCKS_PER_SEC); srand48(11); for (i = 0; i != N; ++i) a[i].a = lrand48(); t = clock(); ks_introsort(my, N, a); printf("%.3lf\n", (double)(clock() - t) / CLOCKS_PER_SEC); free(a); return 0; } [/sourcecode] Here is the speed with different compilers on different CPUs (first value for without data retrieving and second with):

  • Mac-Intel, gcc-4.0, -O2: 1.422 sec vs. 1.802 sec
  • Mac-Intel, gcc-4.2, -O2: 1.438 vs. 1.567
  • Mac-Intel, gcc-4.0, -O2 -fomit-frame-pointer: 1.425 vs. 1.675
  • Mac-Intel, gcc-4.2, -O2 -fomit-frame-pointer: 1.438 vs. 1.448
  • Linux-Intel, gcc-4.1, -O2: 1.600 vs. 1.520
  • Linux-Intel, gcc-4.1, -O2 -fomit-frame-pointer: 1.620 vs. 1.530
  • Linux-Intel, icc, -O2 -fomit-frame-pointer: 1.600 vs. 1.580

The conclusion is retrieving data from a struct may have marginal overhead in comprison to direct data access. However, a good compiler can avoid this and produce nearly optimal machine code. Using “-fomit-frame-pointer” may help for some machines, but not for others. In addition, it is a bit surprising to me that gcc-linux generates faster code for data retrieval in a struct. Swapping the two ways does not change the conclusion.

Read Full Post »

Over the weekend, I have done a more comprehensive benchmark of various libraries on search trees. Two AVL, seven red-black tree, one Splay tree, two treap implementations are involved, together with seven hash table libraries. As I need to present a big table, I have to write it in a free-style HTML page. You can find the complete benchmark here and all the source codes here. I only copy the “concluding remarks” in the benchmark page as follows:

  • Hash table is preferred over search trees if we do not require order.
  • In applications similar to my example, B-tree is better than most of binary search trees in terms of both speed and memory.
  • AVL tree and red-black tree are the best general-purposed BSTs. They are very close in efficiency.
  • For pure C libraries, using macros is usually more efficient than using void* to achieve generic programming.

You can find the result and much more discussions in that page. If you think the source codes or the design of benchmark can be improved, please leave comments here or send me E-mail. In addition, I failed to use several libraries and so you can see some blank in the table. I would also appreciate if someone could show me how to use those libraries correctly.

Read Full Post »

C Array vs. C++ Vector

Here is a piece of source codes that compare C arrays and C++ vectors. It tests six scenarios: a) preallocated C array; b) dynamically growing C array; c) dynamical C vector calling kv_a macro (in my kvec.h); d) dynamical C vector calling kv_push macro (in my kvec.h); e) preallocated C++ vector and f) dynamically growing C++ vector. You can find my kvec.h on my blog.

#include
#include
#include
#include
#include “kvec.h”

int main()
{
int M = 10, N = 20000000, i, j;
clock_t t;
t = clock();
for (i = 0; i < M; ++i) { int *array = (int*)malloc(N * sizeof(int)); for (j = 0; j < N; ++j) array[j] = j; free(array); } printf("C array, preallocated: %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock(); for (i = 0; i < M; ++i) { int *array = 0, max = 0; for (j = 0; j < N; ++j) { if (j == max) { max = !max? 1 : max << 1; array = (int*)realloc(array, sizeof(int)*max); } array[j] = j; } free(array); } printf("C array, dynamic: %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock(); for (i = 0; i < M; ++i) { kvec_t(int) array; kv_init(array); kv_resize(int, array, N); for (j = 0; j < N; ++j) kv_a(int, array, j) = j; kv_destroy(array); } printf("C vector, dynamic (kv_a): %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock(); for (i = 0; i < M; ++i) { kvec_t(int) array; kv_init(array); for (j = 0; j < N; ++j) kv_push(int, array, j); kv_destroy(array); } printf("C vector, dynamic (kv_push): %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock(); for (i = 0; i < M; ++i) { std::vector array;
array.reserve(N);
for (j = 0; j < N; ++j) array[j] = j; } printf("C++ vector, preallocated: %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock(); for (i = 0; i < M; ++i) { std::vector array;
for (j = 0; j < N; ++j) array.push_back(j); } printf("C++ vector, dynamic: %.3f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); return 0; } [/sourcecode] Here is the result on two machines (compiled with g++ -O2 -fomit-frame-pointer -finline-functions):

type MacIntel LinuxIntel
C array, preallocated 1.589 1.180
C array, dynamic 2.064 1.340
C vector, dynamic (kv_a) 2.051 1.600
C vector, dynamic (kv_push) 1.932 1.250
C++ vector, preallocated 2.119 1.590
C++ vector, dynamic 5.095 3.770

Such result may vary with different machines/compilers, but not much.

Update: My example passed valgrind check on a Linux (Debian etch, g++-4.1), but Tom_ pointed out that it did not pass VC++’s debugger as vector::operator[] writes outside vector::size(). Anyway, it is good not to use operator[] in my way. You can replace reserve()+[] with resize()+[] or reserve()+push_back(). On my machine, the replacement gives a little bit slower speed, but it is safer/more portable in that way. Thanks for all the comments.

Read Full Post »

As a Perl programmer, I enjoy a lot using hash tables. I keep this habit in C/C++ programming. Then what C/C++ hash libraries are available? How are they compared to each other? In this post, I will give a brief review of hash libraries and present a small benchmark showing their practical performance.

Hash table libraries

In C++, the most widely used hash table implementation is hash_map/set in SGI STL, which is part of the GCC compiler. Note that hash_map/set is SGI’s extention to STL, but is not part of STL. TR1 (technical report 1) tries to standardize hash tables. It provides unordered_map/set with similar API to hash_map/set. Most of TR1 routines are available since gcc-4.0. Google sparse hash is another C++ hash table template library with similar API to hash_map/set. It provides two implementations, one is efficient in speed and the other is in memory.

In contrast, there are few good C libraries around. I have tried SunriseDD, uthash, glibc hash table, hashit, Christopher Clark’s hashtable, glib hash table and ghthash. SunriseDD sounds a great library that implements a lock-free hash table. However, I am not sure how to install it or use it, although the code itself is well documented. Uthash is a single header file. It is quite complex to use and incompatiable with C++. It also lacks basic APIs such as counting how many elements in the hash table. Glibc hash and hashit seem to only implement static hash tables. Glibc hash even does not have deletion operation. Only glib hash, CC’s hashtable and ghthash implement most of common operations. And they still have their weakness in comparison to C++ implementations (see below).

Design of the benchmark

The benchmark is comprised of two experiments. In the first experiment, a random integer array of 5 million elements is generated with about 1.25 million distinct keys. Each element is then tested whether it is present in the hash. If the element is in the hash, it will be removed; otherwise, it will be inserted. 625,792 distinct keys will be in the hash after this process. To test performance on string input, I convert integers to strings with sprintf().

The second experiment is designed by Craig Silverstein, the author of sparsehash. I am using his source codes. This experiment tests the performance of insertion from zero sized hash, insertion from preallocated hash, replacement, query, query of empty hash, and removal.

Results

The following table gives the results in the first experiment:

Library Mac-intCPU (sec) Mac-strCPU (sec) Mac PeakMem (MB) Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 1.904 2.436 11.192 3.490 4.720 24.968
ghthash 2.593 2.869 29.0/39.0 3.260 3.460 61.232
CC’s hashtable 2.740 3.424 59.756 3.040 4.050 129.020
TR1 1.371 2.571 16.140 1.750 3.300 28.648
STL hash_set 1.631 2.698 14.592 2.070 3.430 25.764
google-sparse 2.957 6.098 4.800 2.560 6.930 5.42/8.54
google-dense 0.700 2.833 24.616 0.550 2.820 24.7/49.3
khash (C++) 1.089 2.372 6.772 1.100 2.900 6.88/13.1
khash (C) 0.987 2.294 6.780 1.140 2.940 6.91/13.1
STL set (RB) 5.898 12.978 19.868 7.840 18.620 29.388
kbtree (C) 3.080 13.413 3.268 4.260 17.620 4.86/9.59
NP’s splaytree 8.455 23.369 8.936 11.180 27.610 19.024

Notes:

  • Please be aware that changing the size of input data may change the ranking of speed and memory. The speed of a library may vary up to 10% in two different runs.
  • CPU time is measured in seconds. Memory denotes the peak memory, measured in MB.
  • For string hash, only the pointer to a string is inserted. Memory in the table does not count the space used by strings.
  • If two numbers are given for memory, the first is for integer keys and the second for string keys.
  • For all C++ libraries and khash.h, one operation is needed to achieve “insert if absent; delete otherwise”. Glib and ghthash require two operations, which does not favour these two libraries.
  • The speed may also be influenced by the efficiency of hash funtions. Khash and Glib use the same hash function. TR1/SGI-STL/google-hash use another hash function. Fortunately, to my experiment, the two string hash functions have quite similar performance and so the benchmark reflects the performance of the overall hash libraries instead of just hash functions.
  • For glib and ghthash, what is inserted is the pointer to the integer instead of the integer itself.
  • Ghthash supports dynamic hash table. However, the results do not seem correct when this is switched on. I am using fixed-size hash table. This favours ghthash.
  • CC’s hashtable will force to free a key, which is not implemented in all the other libraries. This behaviour will add overhead on both speed and memory in my benchmark (but probably not in other applications). The memory is measured for integer keys.
  • This simple benchmark does not test the strength and weakness of splay tree.

And here is the result of the second experiment:

Library grow pred/grow replace fetch fetchnull remove Memory
TR1 194.2 183.9 30.7 15.6 15.2 83.4 224.6
STL hash_map 149.0 110.5 35.6 11.5 14.0 87.2 204.2
STL map 289.9 289.9 141.3 134.3 7.0 288.6 236.8
google-sparse 417.2 237.6 89.5 84.0 12.1 100.4 85.4
google-dense 108.4 39.4 17.8 8.3 2.8 18.0 256.0
khash (C++) 111.2 99.2 26.1 11.5 3.0 17.4 198.0

Notes:

  • CPU time is measured in nanosecond for each operation. Memory is measured by TCmalloc. It is the memory difference before and after the allocation of the hash table, instead of the peak memory.
  • In this experiment, integers are inserted in order and there are no collisions in the hash table.
  • All these libraries provide similar API.

Discussions

  • Speed and memory. The larger the hash table, the fewer collisions may occur and the faster the speed. For the same hash library, increasing memory always increases speed. When we compare two libraries, both speed and memory should be considered.
  • C vs. C++. All C++ implementations have similar API. It is also very easy to use for any type of keys. Both C libraries, ghthash and glib, can only keep pointers to the keys, which complicates API and increases memory especially for 64-bit systems where a pointer takes 8 bytes. In general, C++ libraries is perferred over C ones. Surprisingly, on 32-bit Mac OS X, glib outperforms TR1 and STL for string input. This might indicate that the glib implementation itself is very efficient, but just the lack of functionality in C affects the performance.
  • Generic programming in C. Except my khash.h, all the other C hash libraries use (void*) to achieve generic typing. Using void* is okey for strings, but will cause overhead for integers. This is why all C libraries, except khash.h, is slower than C++ libraries on integer keys, but close to on string keys.
  • Open addressing vs. chaining hash. Khash and google hash implement open addressing hash while the remaining implement chaining hash. In open addressing hash, the size of each bucket equals the size of a key plus 0.25 byte. Google sparsehash further compresses unused bucket to 1 bit, achieving high memory efficiency. In chaining hash, the memory overhead of each bucket is at least 4 bytes on 32bit machines, or 8 bytes on 64bit machines. However, chaining hash is less affected when the hash table is nearly full. In practice, both open addressing and chaining hash occupy similar memory under similar speed. Khash takes less peak memory mainly due to its advanced technique in rehashing which reduces memory usage. So far as speed is concerned, chaining hash may have fewer comparison between keys. We can see this from the fact that the speed of chaining hash approaches that of open addressing hash on string keys but much slower on integer keys.
  • Memory usage of search trees. B-tree is the winner here. Each element in the B-tree only needs one additional pointer. When there are enough elements, a B-tree is at least halfly full; on average it should be around 75% full. And so on 64-bit systems, for a B-tree with N elements, we need additional N*8/0.75=10N bytes memory. Splay tree will need N*8*2=16N extra space. RB tree is the worst.
  • Other issues. a) Google hash becomes unbearably slow when I try to put a lot of strings in the hash table. All the other libraries do not have this problem. b) Google hash performs more comparisons than khash. This is obvious because google-dense is clearly faster on integer keys but comparable to khash on string keys.

Concluding remarks

  • C++ hash library is much easier to use than C libraries. This is definitely where C++ is preferred over C.
  • TR1 hash implementation is no faster than STL implementation. They may outperform one another under certain input or settings.
  • SGI hash_map is faster and takes less memory than STL map. Unless ordering is important, hash_map is a better container than map.
  • Google hash is a worthy choice when we understand why it is slow for many string keys.
  • My khash library, which is a single-file C++ template header, achieves good balance between speed and memory. All my source codes are available at the Programs page.

Update

  1. C interface can be elegant, too, if we implement it cleverly. See this post.
  2. I realize that we just need one lookup to achieve “insert if absent; delete otherwise”. This further improves the speed for all C++ libraries.
  3. I have analyzed google dense hash table in this post which explains why it is faster than khash on integer keys but close to or slower than on string keys.
  4. This thread directed me to gcc hashtable, and cocom hashtable. They are more or less independent of other source codes, but it would still take time to separate the source codes. So, I have not benchmarked them. Just keep a record.
  5. Python dictionary is in fact a hash table. The dictnotes.txt in that directory gives some quite interesting discussion about how to implement hash efficiently.
  6. hashlib library. A bit hard to use and I cannot get it running correctly. Possibly I have not provided a proper second hash function for rehashing.
  7. Added results for STL set (based on red-black tree) and John-Mark Gurney’s B-tree implementation (JG’s btree). Both libraries are considerably slower than hash tables. Of course search trees provide more functionality than hash tables, and every nice thing comes with a price. I have also tried Jason Evans’s and Niels Provos’ red-black tree implementations. On integer keys, JE’s takes 6.110 seconds on Mac-Intel using 18.884 MB memory and NP’s taks 6.611 seconds using the same amount of memory. This performance is close to that of STL set. They appear to be slower mainly due to the additional malloc/free calls I have to made under their APIs. Unlike hash table which have a variety of ways to implement it, red-black tree usually has one way (well, can be more. See also Jason’s blog.). And so I only show the performance of STL set as a representitive.
  8. Replaced JG’s B-tree with a modified version. The new version is both faster and more light-weighted.

Read Full Post »

Sorting algorithm

Given an array of size N, sorting can be done in O(N log(N)) in average. The most frequently used sorting algorithms that can achieve this time complexity are quicksort, heapsort and mergesort. They usually require O(log(N)), O(1) and O(N) working space, respectively (the space complexity of mergesort can be improved at the cost of speed). Most people believe quicksort is the fastest sorting algorithm. However, the fact is quicksort is only fast in terms of the number of swaps. When comparison is expensive, mergesort is faster than quicksort because mergesort uses less comparisons. GNU sort uses mergesort. Replacing it with a quicksort reduces the speed on typical text input. In addition, of the three algorithms, only mergesort is a stable sort. Stability is sometimes useful for a general tool like GNU sort.

The worst-case time complexity of quicksort is O(N^2). In practice, we combine quicksort and heapsort to avoid worst-case performance while retaining the fast average speed. The resulting algorithm is called introsort (introspective sort).

Implementation

The two most widely used implementations are glibc qsort and STL (unstable) introsort. Libc qsort calls a function for comparison. For simple comparison, a function call is expensive, which may greatly hurt the efficiency of qsort. STL sort does not have this problem. It is one of the fastest implementations I am aware of. My own implementation of introsort is similar but not as fast as STL introsort.

GNU sort implements a top-down recursive sort. On integer sorting, it is twice slower than introsort (see below). Iterative top-down mergesort is hard to implement. Iterative bottom-up mergesort is much easier. My implementation is a bottom-up one.

Paul Hsieh has also implemented quicksort, heapsort and mergesort. His implementation should be very efficient from what I can tell. To see whether my implementation is good enough, I copied and pasted his codes in my program, and applied “inline” where necessary.

Comparison

I designed a small benchmark on sorting 50 million random integers. As comparison is cheap in this case, the number of swaps dominate the performance. I compiled and run the program on three machines: MacIntel (Core2-2G/Mac/g++-4.2.1), LinuxIntel (Xeon-1.86G/Linux/g++-4.1.2) and LinuxAMD (Opteron-2G/Linux/g++-3.4.4). On all the three platforms, the program was compiled with “-O2 -fomit-frame-pointer”. The time (in seconds) spent on sorting is showed in the following table:

Algorithm MacIntel LinuxIntel LinuxAMD Linux_icc
STL sort 7.749 8.260 7.170 8.400
STL stable_sort 9.684 11.990 10.270 10.770
libc qsort 16.579 81.190 30.490 81.120
introsort 7.887 8.880 7.670 9.320
iterative mergesort 10.371 12.480 10.110 10.130
binary heapsort 36.651 45.710 42.460 40.820
combsort11 18.131 19.290 19.370 19.490
isort (func call) 16.760 17.380 13.390 16.740
isort (template func) 7.902 8.800 7.690 9.010
Paul’s heapsort 34.790 43.680 40.740 39.060
Paul’s quicksort 8.410 8.940 7.810 9.450
Paul’s mergesort 11.103 13.390 10.680 13.030

As for the algorithm itself, we can see that introsort is the fastest and heapsort is the slowest. Mergesort is also very fast. Combsort11 is claimed to approach quicksort, but I do not see this in sorting large integer arrays. As for the implementation of quicksort/introsort, STL is the best, with my implementation following very closely. Paul’s implmentation is also very efficient. Libc qsort is clearly slower, which cannot simply attribute to the use of function calls. My implementation with function calls, although slower than without function calls, outperforms libc qsort on both Linux machines. As for the implementation of mergesort, my version has similar performance to STL stable_sort. Note that stable_sort uses buffered recursive mergesort when a temporary array can be allocated. When memory is insufficient, it will use in-place mergesort which is not evaluated here.

Availability and alternative benchmarks

My implementation is available here as a single C++ template header file. The program for benchmark is also available. Programs in plain text can be acquired by chopping .html in the two links.

Paul Hsieh’s benchmark is here, including the original source codes. He also discussed how algorithms perform when the initial array is not completely random (I am one of “naive people” in his standard). Please note that in his benchmark, he was sorting an array of size 60,000 for 10000 times, while in my benchmark I more focus on very large arrays. Notably, heapsort approaches introsort on small arrays, but far slower on large arrays. Presumably this is because the bad cache performance of heapsort. Both quicksort and mergesort are very cache efficient.

In addition to Paul’s benchmark, you can also find alternative ones here and here. They seem to be more interested in the theoretical issues rather than efficient practical implementations.

If you search “benchmark sorting algorithms” in google, the first result is this page, which was implemented in D by Stewart Gordon. This benchmark aims to evaluate the performance on small arrays. It also tests the speed when the array is sorted or reverse sorted. However, the implementation is not optimized enough at least for quicksort. Using insertion sort when the array is nearly sorted is always preferred. You can also find this report from google search, but the implementation of quicksort is too naive to be efficient.

Concluding remarks

Although in the table introsort performs the best, we may want to use mergesort if we want to perform stable sorting, or the comparison is very expensive. Mergesort is also faster than introsort if the array is nearly sorted. STL sort seems to take particular care in this case, which makes it still fast when the array is sorted.

In common cases when comparison is cheap, introsort is the best choice. Of the various implementations, STL is the fastest. If you do not use STL or you just want to use C, you can use/adapt my implmentation which is very close to STL sort in speed. Do not use libc qsort, especially on Linux. It is not well implemented.

Update

  1. This website gives severl good implementations of sorting algorithms. I also believe the programmer behind is very capable. Highly recommended.

Read Full Post »

Popular multialignment programs include: clustalw, T-coffee, dialign [PMID: 18505568, 10222408], muscle [PMID: 15318951], MAFFT [PMID: 18372315], probcons [PMID: 15687296] and probalign [PMID: 16954142]. Which is the best in practice? You can find various benchmarks in the papers listed above. I only summarize based on my understanding. It is recommended to read their papers in case my summary is biased. Also, both accuracy and speed may vary a lot with different data sets. I can only give results based on the data used in their papers.

Firstly, these software can be divided into two groups: those performing global multiple alignment and those performing local multiple alignment. In global alignment, each residue of each sequence is required to be aligned. This category includes clustalw, T-coffee, muscle, probcons and probalign. In local alignment, some residues are allowed to be unaligned. Only segments that closely related are aligned together. This category includes dialign family and MAFFT. They can also perform global alignment.

Most benchmarks are designed for global aligners. The consensus seems to be:

  • accuracy: probalign>probcons>MAFFT>muscle>T-coffee>>clustalw~dialign
  • speed:  clustalw~muscle>MAFFT>dialign>>probcons>probalign~T-coffee

Only MAFFT and dialign performs local alignment. The dialign paper claims that dialign is much more accurate. However, I doubt this largely depends on the definition of high-scoring segments. We can always find tighter alignment by excluding more residues.

Then, what is the best? Muscle or MAFFT. Although probcons and probalign are more accurate, they run impractically slower than muscle and MAFFT. As for the winner between muscle and MAFFT, I cannot decide. I also need to evaluate the memory consumption, usability and stability. I have only used muscle. It is really nice software!

Read Full Post »

ZOOM [PMID:18684737] is still unavailable even when the manuscript goes online. For the time being, there is no way to confirm whether their benchmarks are unbiased. Fortunately, we can collect some information from what they have presented. In the ZOOM paper, the authors give the memory consumption of ZOOM. It is 2.9GB for 12 million short reads. So far as I know, Eland will only use 30~40% of this memory. If Anthony Cox had meant to use more memory, he would have achieved faster speed. In addition, apparently ZOOM only reports the unique best hit according to this post. Eland will count the occurrences of all the 0-, 1- and 2-mismatch hits anyway. In fact, if you read eland source codes, you will know that counting occurrences is non-trivial and it is much more expensive to implement counting in ZOOM than in Eland. I bet ZOOM would be slower than Eland if it generates the same output as Eland. Of course, it is not always necessary to generate the same output, but detecting unique best hits alone are usually not good enough in some applications, such as detecting structural variations. We need to know whether there are many suboptimal hits with scores close to the best one.

Although the benchmark is biased, the ZOOM paper itself indeed presents an elegant idea about using non-contiguous seeds. I also believe ZOOM is highly tuned for performance by a group of strong programmers. Furthermore, even if ZOOM is slower than Eland, I would still prefer ZOOM as it eliminates the hard limits on the 32bp read length and the maximum two mismatches. I do not mind if ZOOM is a bit slower. I just want to see a fair comparison.

Update: ZOOM has been released and I have tried it a bit. The claim in the paper is honest: ZOOM is indeed faster than Eland while allowing longer reads. Nonetheless, ZOOM only reports unique best hits and therefore my claim here is fair, too: Eland is giving more information, which is one of the reasons that it is slower. Although ZOOM can output N top hits, apparently we can only achieve this by dumping all the potential hits. On human alignment, this is impractical.

Note that I pay a lot of attention to ZOOM mainly because it is a great software. In terms of efficiency and usability, ZOOM is much better than many other similar software that I have not mentioned here.

Read Full Post »