Feeds:
Posts

## An update on radix sort

As sorting is the bottleneck of my application, I decided to optimize the code further with reference to this wonderful article. The optimized the version is about 40% faster than my original one. It is now about 2.5 times as fast as STL’s std::sort on random 32-bit integer arrays. The optimized version is slightly slower than the implementation in that article, but it is much faster on sorted arrays.

For sorting large integer arrays, radix sort is the king. It is much faster than other standard algorithms and is arguably simpler.

```#define rstype_t uint64_t // type of the array
#define rskey(x) (x) // specify how to get the integer from rstype_t

#define RS_MIN_SIZE 64 // for an array smaller than this, use insertion sort

typedef struct {
rstype_t *b, *e; // begin and end of each bucket
} rsbucket_t;

void rs_insertsort(rstype_t *beg, rstype_t *end) // insertion sort
{
rstype_t *i;
for (i = beg + 1; i < end; ++i)
if (rskey(*i) < rskey(*(i - 1))) {
rstype_t *j, tmp = *i;
for (j = i; j > beg && rskey(tmp) < rskey(*(j-1)); --j)
*j = *(j - 1);
*j = tmp;
}
}
// sort between [\$beg, \$end); take radix from ">>\$s&((1<<\$n_bits)-1)"
void rs_sort(rstype_t *beg, rstype_t *end, int n_bits, int s)
{
rstype_t *i;
int size = 1<<n_bits, m = size - 1;
rsbucket_t *k, b[size], *be = b + size; // b[] keeps all the buckets

for (k = b; k != be; ++k) k->b = k->e = beg;
for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix
for (k = b + 1; k != be; ++k) // set start and end of each bucket
k->e += (k-1)->e - beg, k->b = (k-1)->e;
for (k = b; k != be;) { // in-place classification based on radix
if (k->b != k->e) { // the bucket is not full
rsbucket_t *l;
if ((l = b + (rskey(*k->b)>>s&m)) != k) { // different bucket
rstype_t tmp = *k->b, swap;
do { // swap until we find an element in bucket \$k
swap = tmp; tmp = *l->b; *l->b++ = swap;
l = b + (rskey(tmp)>>s&m);
} while (l != k);
*k->b++ = tmp; // push the found element to \$k
} else ++k->b; // move to the next element in the bucket
} else ++k; // move to the next bucket
}
for (b->b = beg, k = b + 1; k != be; ++k) k->b = (k-1)->e; // reset k->b
if (s) { // if \$s is non-zero, we need to sort buckets
s = s > n_bits? s - n_bits : 0;
for (k = b; k != be; ++k)
if (k->e - k->b > RS_MIN_SIZE) rs_sort(k->b, k->e, n_bits, s);
else if (k->e - k->b > 1) rs_insertsort(k->b, k->e);
}
}

{
if (end - beg <= RS_MIN_SIZE) rs_insertsort(beg, end);
else rs_sort(beg, end, 8, sizeof(rskey(*beg)) * 8 - 8);
}
```

EDIT: Just found this implementation. It is as fast as mine and is simpler. Recommended.

### 10 Responses

1. Did you try Pierre Terdiman’s version? (very bottom, includes link to M. Herf’s article as well) http://codercorner.com/RadixSortRevisited.htm

• on June 10, 2012 at 9:43 pm | Reply attractivechaos

Thanks a lot. I knew there is always something important I have missed. I have tried the implementation just now. I need to change a few lines to make it compiled with modern g++/clang++. The output are at least sorted. If my changes do not affect efficiency, the implementation is not that fast. It sorts the same array with 50 million random 32-bit integers in 21 seconds, while my version takes less than 2 seconds. In addition, this implementation does not sort the array in place, while I am mainly interested in an in-place but non-stable implementation as I am dealing with more than 1 billion numbers. I cannot afford to allocate other large arrays.

It is possible that Pierre’s implementation is optimized for CPUs 14 years ago, but modern CPUs are quite different. I have not read the source code, but from the API, it seems that the implementation is not quite cache efficient. This may also explain its performance.

Thanks anyway.

2. You might be interested in checking “A Cache-Aware Hybrid Sorter” from Manny Ko.
(Published in Game Engine Gems 2.)

3. Have you seen the: http://sortbenchmark.org/ ?

Overload of sorting methods, but I’m not sure if they are useful for your use case.

4. Hi,

Thanks for the shutout Ingenu. I wrote the Cache-Aware Hybrid Sorter. I tried to minimize the # of branches and to make the memory access pattern as efficient as possible.

I also have a more efficient way to tackle floats using bit operations and sign extension. Much more efficient than Terdiman original form.

Cheers.

Manny

5. I’m curious, how does your radix sort compare to the one on wikipedia? In my experience, it is very,very fast.

• Firstly, it is not in-place. You need an extra array. The non-in-place version is much easier to implement. Secondly, for the demonstration purpose, it uses “x%10″, which is much slower than “&”. That implementation is not going to be fast.

• Thanks, I’ll definitely give the gorset radix code a try. I have an app which sorts hundreds of millions of records with radix sort all the time, so it might shave a second or two.

6. [...] An update on radix sort (attractivechaos.wordpress.com) [...]