As sorting is the bottleneck of my application, I decided to optimize the code further with reference to this wonderful article. The optimized the version is about 40% faster than my original one. It is now about 2.5 times as fast as STL’s std::sort on random 32-bit integer arrays. The optimized version is slightly slower than the implementation in that article, but it is much faster on sorted arrays.

For sorting large integer arrays, radix sort is the king. It is much faster than other standard algorithms and is arguably simpler.

#define rstype_t uint64_t // type of the array #define rskey(x) (x) // specify how to get the integer from rstype_t #define RS_MIN_SIZE 64 // for an array smaller than this, use insertion sort typedef struct { rstype_t *b, *e; // begin and end of each bucket } rsbucket_t; void rs_insertsort(rstype_t *beg, rstype_t *end) // insertion sort { rstype_t *i; for (i = beg + 1; i < end; ++i) if (rskey(*i) < rskey(*(i - 1))) { rstype_t *j, tmp = *i; for (j = i; j > beg && rskey(tmp) < rskey(*(j-1)); --j) *j = *(j - 1); *j = tmp; } } // sort between [$beg, $end); take radix from ">>$s&((1<<$n_bits)-1)" void rs_sort(rstype_t *beg, rstype_t *end, int n_bits, int s) { rstype_t *i; int size = 1<<n_bits, m = size - 1; rsbucket_t *k, b[size], *be = b + size; // b[] keeps all the buckets for (k = b; k != be; ++k) k->b = k->e = beg; for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix for (k = b + 1; k != be; ++k) // set start and end of each bucket k->e += (k-1)->e - beg, k->b = (k-1)->e; for (k = b; k != be;) { // in-place classification based on radix if (k->b != k->e) { // the bucket is not full rsbucket_t *l; if ((l = b + (rskey(*k->b)>>s&m)) != k) { // different bucket rstype_t tmp = *k->b, swap; do { // swap until we find an element in bucket $k swap = tmp; tmp = *l->b; *l->b++ = swap; l = b + (rskey(tmp)>>s&m); } while (l != k); *k->b++ = tmp; // push the found element to $k } else ++k->b; // move to the next element in the bucket } else ++k; // move to the next bucket } for (b->b = beg, k = b + 1; k != be; ++k) k->b = (k-1)->e; // reset k->b if (s) { // if $s is non-zero, we need to sort buckets s = s > n_bits? s - n_bits : 0; for (k = b; k != be; ++k) if (k->e - k->b > RS_MIN_SIZE) rs_sort(k->b, k->e, n_bits, s); else if (k->e - k->b > 1) rs_insertsort(k->b, k->e); } } void radix_sort(rstype_t *beg, rstype_t *end) { if (end - beg <= RS_MIN_SIZE) rs_insertsort(beg, end); else rs_sort(beg, end, 8, sizeof(rskey(*beg)) * 8 - 8); }

EDIT: Just found this implementation. It is as fast as mine and is simpler. Recommended.

on June 10, 2012 at 8:58 pm |MaciejSDid you try Pierre Terdiman’s version? (very bottom, includes link to M. Herf’s article as well) http://codercorner.com/RadixSortRevisited.htm

on June 10, 2012 at 9:43 pm |attractivechaosThanks a lot. I knew there is always something important I have missed. I have tried the implementation just now. I need to change a few lines to make it compiled with modern g++/clang++. The output are at least sorted. If my changes do not affect efficiency, the implementation is not that fast. It sorts the same array with 50 million random 32-bit integers in 21 seconds, while my version takes less than 2 seconds. In addition, this implementation does not sort the array in place, while I am mainly interested in an in-place but non-stable implementation as I am dealing with more than 1 billion numbers. I cannot afford to allocate other large arrays.

It is possible that Pierre’s implementation is optimized for CPUs 14 years ago, but modern CPUs are quite different. I have not read the source code, but from the API, it seems that the implementation is not quite cache efficient. This may also explain its performance.

Thanks anyway.

on June 12, 2012 at 11:24 am |IngenuYou might be interested in checking “A Cache-Aware Hybrid Sorter” from Manny Ko.

(Published in Game Engine Gems 2.)

on June 13, 2012 at 1:23 am |attractivechaosThanks. I will have a look.

on June 13, 2012 at 1:43 pm |ManabuHave you seen the: http://sortbenchmark.org/ ?

Overload of sorting methods, but I’m not sure if they are useful for your use case.

on June 23, 2012 at 12:17 am |Manny KoHi,

Thanks for the shutout Ingenu. I wrote the Cache-Aware Hybrid Sorter. I tried to minimize the # of branches and to make the memory access pattern as efficient as possible.

I also have a more efficient way to tackle floats using bit operations and sign extension. Much more efficient than Terdiman original form.

Cheers.

Manny

on September 23, 2012 at 6:34 am |Paul PI’m curious, how does your radix sort compare to the one on wikipedia? In my experience, it is very,very fast.

on September 24, 2012 at 11:15 pm |attractivechaosFirstly, it is not in-place. You need an extra array. The non-in-place version is much easier to implement. Secondly, for the demonstration purpose, it uses “x%10”, which is much slower than “&”. That implementation is not going to be fast.

on September 29, 2012 at 5:01 amppak98Thanks, I’ll definitely give the gorset radix code a try. I have an app which sorts hundreds of millions of records with radix sort all the time, so it might shave a second or two. 🙂

on April 26, 2013 at 12:37 pm |Linear Time Sorting, Counting Sort & Radix Sort | Khuram Ali[…] An update on radix sort (attractivechaos.wordpress.com) […]

on February 27, 2017 at 10:42 am |和平老三_peace3Can you explain your code？I don’t understand ” for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix “