As sorting is the bottleneck of my application, I decided to optimize the code further with reference to this wonderful article. The optimized the version is about 40% faster than my original one. It is now about 2.5 times as fast as STL’s std::sort on random 32-bit integer arrays. The optimized version is slightly slower than the implementation in that article, but it is much faster on sorted arrays.
For sorting large integer arrays, radix sort is the king. It is much faster than other standard algorithms and is arguably simpler.
#define rstype_t uint64_t // type of the array #define rskey(x) (x) // specify how to get the integer from rstype_t #define RS_MIN_SIZE 64 // for an array smaller than this, use insertion sort typedef struct { rstype_t *b, *e; // begin and end of each bucket } rsbucket_t; void rs_insertsort(rstype_t *beg, rstype_t *end) // insertion sort { rstype_t *i; for (i = beg + 1; i < end; ++i) if (rskey(*i) < rskey(*(i - 1))) { rstype_t *j, tmp = *i; for (j = i; j > beg && rskey(tmp) < rskey(*(j-1)); --j) *j = *(j - 1); *j = tmp; } } // sort between [$beg, $end); take radix from ">>$s&((1<<$n_bits)-1)" void rs_sort(rstype_t *beg, rstype_t *end, int n_bits, int s) { rstype_t *i; int size = 1<<n_bits, m = size - 1; rsbucket_t *k, b[size], *be = b + size; // b[] keeps all the buckets for (k = b; k != be; ++k) k->b = k->e = beg; for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix for (k = b + 1; k != be; ++k) // set start and end of each bucket k->e += (k-1)->e - beg, k->b = (k-1)->e; for (k = b; k != be;) { // in-place classification based on radix if (k->b != k->e) { // the bucket is not full rsbucket_t *l; if ((l = b + (rskey(*k->b)>>s&m)) != k) { // different bucket rstype_t tmp = *k->b, swap; do { // swap until we find an element in bucket $k swap = tmp; tmp = *l->b; *l->b++ = swap; l = b + (rskey(tmp)>>s&m); } while (l != k); *k->b++ = tmp; // push the found element to $k } else ++k->b; // move to the next element in the bucket } else ++k; // move to the next bucket } for (b->b = beg, k = b + 1; k != be; ++k) k->b = (k-1)->e; // reset k->b if (s) { // if $s is non-zero, we need to sort buckets s = s > n_bits? s - n_bits : 0; for (k = b; k != be; ++k) if (k->e - k->b > RS_MIN_SIZE) rs_sort(k->b, k->e, n_bits, s); else if (k->e - k->b > 1) rs_insertsort(k->b, k->e); } } void radix_sort(rstype_t *beg, rstype_t *end) { if (end - beg <= RS_MIN_SIZE) rs_insertsort(beg, end); else rs_sort(beg, end, 8, sizeof(rskey(*beg)) * 8 - 8); }
EDIT: Just found this implementation. It is as fast as mine and is simpler. Recommended.
Did you try Pierre Terdiman’s version? (very bottom, includes link to M. Herf’s article as well) http://codercorner.com/RadixSortRevisited.htm
Thanks a lot. I knew there is always something important I have missed. I have tried the implementation just now. I need to change a few lines to make it compiled with modern g++/clang++. The output are at least sorted. If my changes do not affect efficiency, the implementation is not that fast. It sorts the same array with 50 million random 32-bit integers in 21 seconds, while my version takes less than 2 seconds. In addition, this implementation does not sort the array in place, while I am mainly interested in an in-place but non-stable implementation as I am dealing with more than 1 billion numbers. I cannot afford to allocate other large arrays.
It is possible that Pierre’s implementation is optimized for CPUs 14 years ago, but modern CPUs are quite different. I have not read the source code, but from the API, it seems that the implementation is not quite cache efficient. This may also explain its performance.
Thanks anyway.
You might be interested in checking “A Cache-Aware Hybrid Sorter” from Manny Ko.
(Published in Game Engine Gems 2.)
Thanks. I will have a look.
Have you seen the: http://sortbenchmark.org/ ?
Overload of sorting methods, but I’m not sure if they are useful for your use case.
Hi,
Thanks for the shutout Ingenu. I wrote the Cache-Aware Hybrid Sorter. I tried to minimize the # of branches and to make the memory access pattern as efficient as possible.
I also have a more efficient way to tackle floats using bit operations and sign extension. Much more efficient than Terdiman original form.
Cheers.
Manny
I’m curious, how does your radix sort compare to the one on wikipedia? In my experience, it is very,very fast.
Firstly, it is not in-place. You need an extra array. The non-in-place version is much easier to implement. Secondly, for the demonstration purpose, it uses “x%10”, which is much slower than “&”. That implementation is not going to be fast.
Thanks, I’ll definitely give the gorset radix code a try. I have an app which sorts hundreds of millions of records with radix sort all the time, so it might shave a second or two. 🙂
[…] An update on radix sort (attractivechaos.wordpress.com) […]
Can you explain your code?I don’t understand ” for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix “
[…] 原文地址。通过作者的修改,排序速度确实非常快。是一种建立在底层优化上的排序方法。主要的做法是将排序的数字变成二进制,并且安放了256个桶。 作者也将这种算法写到了他的开源作品klib(一个c语言的标准库)中。github地址。 […]