Feeds:
Posts

## Calculating Median

Here is an example that google does not give me the result in the first page. I want to know how to calculate median efficiently, and so I search “c calculate median”. In the first result page, google brings me to several forums which only show very naive implementations. The 11th result, this page, is the truely invaluable one which should be favoured by most programmers. I do not want to replicate that website. I just want to show you a function that is adapted from quickselect.c on the website. This function calculates the k-smallest (0<=k<n) element in an array. Its time complexity is linear to the size of the array and in practice it runs much faster than sorting and then locating the k-smallest element.

```type_t ks_ksmall(size_t n, type_t arr[], size_t kk)
{
type_t *low, *high, *k, *ll, *hh, *middle;
low = arr; high = arr + n - 1; k = arr + kk;
for (;;) {
if (high <= low) return *k;
if (high == low + 1) {
if (cmp(*high, *low)) swap(type_t, *low, *high);
return *k;
}
middle = low + (high - low) / 2;
if (lt(*high, *middle)) swap(type_t, *middle, *high);
if (lt(*high, *low)) swap(type_t, *low, *high);
if (lt(*low, *middle)) swap(type_t, *middle, *low);
swap(type_t, *middle, *(low+1)) ;
ll = low + 1; hh = high;
for (;;) {
do ++ll; while (lt(*ll, *low));
do --hh; while (lt(*low, *hh));
if (hh < ll) break;
swap(type_t, *ll, *hh);
}
swap(type_t, *low, *hh);
if (hh <= k) low = ll;
if (hh >= k) high = hh - 1;
}
}
```

In this funcion, type_t is a type, swap() swaps two values, and lt() is a macro or a function that returns true if and only if the first value is smaller.

## Comparison of Internal Sorting Algorithms

Sorting algorithm

Given an array of size N, sorting can be done in O(N log(N)) in average. The most frequently used sorting algorithms that can achieve this time complexity are quicksort, heapsort and mergesort. They usually require O(log(N)), O(1) and O(N) working space, respectively (the space complexity of mergesort can be improved at the cost of speed). Most people believe quicksort is the fastest sorting algorithm. However, the fact is quicksort is only fast in terms of the number of swaps. When comparison is expensive, mergesort is faster than quicksort because mergesort uses less comparisons. GNU sort uses mergesort. Replacing it with a quicksort reduces the speed on typical text input. In addition, of the three algorithms, only mergesort is a stable sort. Stability is sometimes useful for a general tool like GNU sort.

The worst-case time complexity of quicksort is O(N^2). In practice, we combine quicksort and heapsort to avoid worst-case performance while retaining the fast average speed. The resulting algorithm is called introsort (introspective sort).

Implementation

The two most widely used implementations are glibc qsort and STL (unstable) introsort. Libc qsort calls a function for comparison. For simple comparison, a function call is expensive, which may greatly hurt the efficiency of qsort. STL sort does not have this problem. It is one of the fastest implementations I am aware of. My own implementation of introsort is similar but not as fast as STL introsort.

GNU sort implements a top-down recursive sort. On integer sorting, it is twice slower than introsort (see below). Iterative top-down mergesort is hard to implement. Iterative bottom-up mergesort is much easier. My implementation is a bottom-up one.

Paul Hsieh has also implemented quicksort, heapsort and mergesort. His implementation should be very efficient from what I can tell. To see whether my implementation is good enough, I copied and pasted his codes in my program, and applied “inline” where necessary.

Comparison

I designed a small benchmark on sorting 50 million random integers. As comparison is cheap in this case, the number of swaps dominate the performance. I compiled and run the program on three machines: MacIntel (Core2-2G/Mac/g++-4.2.1), LinuxIntel (Xeon-1.86G/Linux/g++-4.1.2) and LinuxAMD (Opteron-2G/Linux/g++-3.4.4). On all the three platforms, the program was compiled with “-O2 -fomit-frame-pointer”. The time (in seconds) spent on sorting is showed in the following table:

 Algorithm MacIntel LinuxIntel LinuxAMD Linux_icc STL sort 7.749 8.260 7.170 8.400 STL stable_sort 9.684 11.990 10.270 10.770 libc qsort 16.579 81.190 30.490 81.120 introsort 7.887 8.880 7.670 9.320 iterative mergesort 10.371 12.480 10.110 10.130 binary heapsort 36.651 45.710 42.460 40.820 combsort11 18.131 19.290 19.370 19.490 isort (func call) 16.760 17.380 13.390 16.740 isort (template func) 7.902 8.800 7.690 9.010 Paul’s heapsort 34.790 43.680 40.740 39.060 Paul’s quicksort 8.410 8.940 7.810 9.450 Paul’s mergesort 11.103 13.390 10.680 13.030

As for the algorithm itself, we can see that introsort is the fastest and heapsort is the slowest. Mergesort is also very fast. Combsort11 is claimed to approach quicksort, but I do not see this in sorting large integer arrays. As for the implementation of quicksort/introsort, STL is the best, with my implementation following very closely. Paul’s implmentation is also very efficient. Libc qsort is clearly slower, which cannot simply attribute to the use of function calls. My implementation with function calls, although slower than without function calls, outperforms libc qsort on both Linux machines. As for the implementation of mergesort, my version has similar performance to STL stable_sort. Note that stable_sort uses buffered recursive mergesort when a temporary array can be allocated. When memory is insufficient, it will use in-place mergesort which is not evaluated here.

Availability and alternative benchmarks

My implementation is available here as a single C++ template header file. The program for benchmark is also available. Programs in plain text can be acquired by chopping .html in the two links.

Paul Hsieh’s benchmark is here, including the original source codes. He also discussed how algorithms perform when the initial array is not completely random (I am one of “naive people” in his standard). Please note that in his benchmark, he was sorting an array of size 60,000 for 10000 times, while in my benchmark I more focus on very large arrays. Notably, heapsort approaches introsort on small arrays, but far slower on large arrays. Presumably this is because the bad cache performance of heapsort. Both quicksort and mergesort are very cache efficient.

In addition to Paul’s benchmark, you can also find alternative ones here and here. They seem to be more interested in the theoretical issues rather than efficient practical implementations.

If you search “benchmark sorting algorithms” in google, the first result is this page, which was implemented in D by Stewart Gordon. This benchmark aims to evaluate the performance on small arrays. It also tests the speed when the array is sorted or reverse sorted. However, the implementation is not optimized enough at least for quicksort. Using insertion sort when the array is nearly sorted is always preferred. You can also find this report from google search, but the implementation of quicksort is too naive to be efficient.

Concluding remarks

Although in the table introsort performs the best, we may want to use mergesort if we want to perform stable sorting, or the comparison is very expensive. Mergesort is also faster than introsort if the array is nearly sorted. STL sort seems to take particular care in this case, which makes it still fast when the array is sorted.

In common cases when comparison is cheap, introsort is the best choice. Of the various implementations, STL is the fastest. If you do not use STL or you just want to use C, you can use/adapt my implmentation which is very close to STL sort in speed. Do not use libc qsort, especially on Linux. It is not well implemented.

Update

1. This website gives severl good implementations of sorting algorithms. I also believe the programmer behind is very capable. Highly recommended.

## The Eland Short Read Aligner

Eland is a computer program that aligns short oligonucleotides against a reference genome. It is written by Anthony Cox from the Illumina company. The source codes are freely available to machine buyers. Eland is the first program for short read alignment and it profoundly influences most of its successors.

Algorithm

Eland guarantees full sensitivity for hits with up to two mismatches. To achieve this, Eland divides a read into four parts with approximately equal lengths. Six noncontiguous seed templates can be constructed with each template covering two parts. Eland applies each seed templates on reads and indexes the sequences that pass the template. After indexing it scans the reference genome sequence base by base and then looks up each K-mer in the index to find hits. As any two mismatches can only occur to at most two of the four parts, the seeding strategy used by Eland guarantees to find all 2-mismatch hits.

Eland might be the first widely used program that achieves full sensitivity given a threshold. Most of Eland successors learn from this point, and some software, such as Maq and SeqMap, even use the same seeding strategy. SOLiD read mapping software and ZOOM further extend the idea of using noncontiguous seed templates to achieve full sensitivity with fewer templates.

Efficiency

Eland is so fast that it effectively sets a goal for all its successors. Although several software are claimed to achieve so, they cannot retain the same useful information as Eland. For example, frequently a program only gives the unique best hit without couting the occurrences of a read. Counting greatly helps to reduce false alignment in some cases, but implementing this is non-trivial. Comparing Eland to a program without counting is unfair.

Limitations

Eland is not perfect, though. Natively, it does not support reads longer than 32bp, paired end mapping nor gapped alignment. Eland provides several scripts to overcome these limitations, but the power is reduced. In particular, even with the help of the additional scripts, Eland will miss a read alignment if the first 32bp of the read is nonunique or if the read has more than 10 identical hits in the reference genome. These limitations give the room for other short read aligners.

## Classification of Pairwise Sequence Search Algorithms

Definition of Pairwise Sequence Search (PSS)

Roughly speaking, given two sets of sequences A and B, the pairwise sequence search (PSS) problem is to find the high-scoring matches between each pair of sequences, one from each set. PSS can, but not necessaily, be done by performing pairwise alignment between each pair of sequences and then retaining high-scoring matches only. Better algorithms usually index one set of sequences to achieve better time complexity.

Online vs. Offline Search

Sequence search can be online or offline. In online sequence search, all the sequences in alignment are not preprocessed; in offline search, one set of sequences can be indexed first and stored in a on-disk file, called database, and the other set of sequences, called queries, are aligned against the index. For online search, the time complexity is at least linear to the sum of lengths of the larger set. As the indexing step is separated out from matching, the time complexity of offline search can be better especially when lengths of the two sets are highly unbalanced.

Sometimes there is no clear edge between online and offline search. Many online search algorithms are also involved with an indexing step where one or both sets of sequences are indexed. Separating indexing and searching will make an online algorithm become an offline one. In practice, we usually prefer separating the indexing step if one set of sequences is static, and if one set of sequences is much larger than the other or indexing is very expensive.

Algorithms for Indexing

Most of fast searching algorithms involve indexing and algorithms used for indexing largely determine algorithms used for searching.

There are three major categories of algorithms for indexing: sorting, filtering and suffix tree based indexing. There are not clear edges between them. For example, sorting usually comes with filtering, suffix tree can be used to compute filters, and sorting can be used to construct suffix array.

(I will explain the three categories in future posts.)