GNU sort is one of my favorite program. It is fast and highly flexible. However, when I try to sort chromosome names, it becomes a pain. In bioinformatics, chromosomes are usually named as chr1, chr2, …, chr10, chr11, … chr20, …, chrX and chrY. It seems to me that there is no way to sort these names in the above order. Finally, I decide to modify GNU sort. I separate sort source codes from textutils-1.22 because this version is less dependent on other packages.
The string comparison function is:
static int mixed_numcompare(const char *a, const char *b) { char *pa, *pb; pa = (char*)a; pb = (char*)b; while (*pa && *pb) { if (isdigit(*pa) && isdigit(*pb)) { long ai, bi; ai = strtol(pa, &pa, 10); bi = strtol(pb, &pb, 10); if (ai != bi) return ai<bi? -1 : ai>bi? 1 : 0; } else { if (*pa != *pb) break; ++pa; ++pb; } } if (*pa == *pb) return (pa-a) < (pb-b)? -1 : (pa-a) > (pb-b)? 1 : 0; return *pa<*pb? -1 : *pa>*pb? 1 : 0; }
It does numerical comparison for digits and string comparison for other characters. With this comparison, chromosome names can be sorted in the desired way. I add a new command line option -N (or -k1,1N) to trigger string-digits mixed comparison.
In addition, I also replace the top-down recursive mergesort with a bottom-up iterative sort, and use heap to accelerate merging. The improved sort is a little faster than the orginal version.
The improved sort can be downloaded here, distributed under GPL.