I am recently working on an algorithm, which surprisingly spends more than half of its time on sorting huge partially ordered arrays of 64-bit integer pairs (one for key and the other for value). Naturally, I want to optimize sorting such arrays. Initially, I tried my implementation of introsort. The program takes about 90 seconds on a sample data set. I then switched to my iterative mergesort in the same library. It takes 55 seconds. I guess the mergesort is faster because the arrays are partially ordered. However, my implementation of mergesort requires a temporary array of the same size. As the arrays are huge, it is unacceptable to allocate this array for real data. It seems that implementing an in-place mergesort is quite challenging. Then I think of radix sort, which I have not implemented before.
My radix sort implementation is here. It is not written as a library, but it should be easy to be adapted to other data types. The C program is quite simple and is not much different from existing ones.
How about the performance? With radix sort, my program takes 35 seconds using little extra working space. I get 100% speedup by replacing introsort with integer-only radix sort. To evaluate the performance of radix sort separately, I put the code in my old ksort_test.cc. Here are the CPU seconds spent on sorting 50 million random or sorted integers:
|Algorithm||Sorted?||Mac CPU (sec)||Linux CPU|