Feeds:
Posts
Comments

Just now I got an email from a mailing list, saying that C++ helps to greatly reduce coding time in comparison to C. I have heard a lot about this argument. But is that true?

C++ can possibly accelerate development in two ways: firstly, OOP (Object-Oriented Programming) helps to organize large projects, and secondly, STL (Standard Template Library) saves time on reimplementing frequently used subroutines. However, I do not find C++ OOP greatly helps me. To me, it is not right to clearly classify a programming language as a procedure-oriented or object-oriented language. It is only right to say a development methodology is procedure-oriented or object-oriented. We can effectively mimic the fundamental OOP ideas in C, a socalled procedure-oriented language, by packaging related data in a struct and transfer the a pointer to the struct to subroutines. I know C++ programmers would argue doing in this way is far from OOP, but it has captured the essence of OOP and in practice sufficient to organize large projects with this simple and natural idea. The large amount of existing C projects, such as Linux kernel, gcc and Emacs, prove this is the truth. With OOP ideas, we can use C to organize large projects without difficulty. C++ does not provide more power except introducing more complicated concepts.

I do not use STL most of time. I have implemented most of useful subroutines in C/C++ by myself. I actually spend less time in using my own library than using STL as I am very familiar with my own codes. Of course, implementing an efficient and yet generic library by myself takes a lot of time, but I really learn a lot in this invaluable process. I can hardly imagine how a programmer who does not get a firm grasp of data structures, which can only be achieved by implementing by him/herself, can ever write good programs. To this end, I agree that for elementary programmers using STL reduces coding time; but this is achieved at the cost of weakening the ability to write better programs. And for an advanced programmer, using STL may help but probably does not save much time.

Note that I am not saying C++ is a bad language as a whole. In fact, I use C++ template functions a lot and C++ template classes at times. In this post, I just want to emphasize the importantance to focusing on the art of programming instead of on the artificial concepts or on the degree of laziness a language can provide.

Eland is a computer program that aligns short oligonucleotides against a reference genome. It is written by Anthony Cox from the Illumina company. The source codes are freely available to machine buyers. Eland is the first program for short read alignment and it profoundly influences most of its successors.

Algorithm

Eland guarantees full sensitivity for hits with up to two mismatches. To achieve this, Eland divides a read into four parts with approximately equal lengths. Six noncontiguous seed templates can be constructed with each template covering two parts. Eland applies each seed templates on reads and indexes the sequences that pass the template. After indexing it scans the reference genome sequence base by base and then looks up each K-mer in the index to find hits. As any two mismatches can only occur to at most two of the four parts, the seeding strategy used by Eland guarantees to find all 2-mismatch hits.

Eland might be the first widely used program that achieves full sensitivity given a threshold. Most of Eland successors learn from this point, and some software, such as Maq and SeqMap, even use the same seeding strategy. SOLiD read mapping software and ZOOM further extend the idea of using noncontiguous seed templates to achieve full sensitivity with fewer templates.

Efficiency

Eland is so fast that it effectively sets a goal for all its successors. Although several software are claimed to achieve so, they cannot retain the same useful information as Eland. For example, frequently a program only gives the unique best hit without couting the occurrences of a read. Counting greatly helps to reduce false alignment in some cases, but implementing this is non-trivial. Comparing Eland to a program without counting is unfair.

Limitations

Eland is not perfect, though. Natively, it does not support reads longer than 32bp, paired end mapping nor gapped alignment. Eland provides several scripts to overcome these limitations, but the power is reduced. In particular, even with the help of the additional scripts, Eland will miss a read alignment if the first 32bp of the read is nonunique or if the read has more than 10 identical hits in the reference genome. These limitations give the room for other short read aligners.

You can find good subroutines in GSL for multivariable nonlinear optimization, but the only method it provides for DFO is Nelder-Mead simplex method, which, claimed by Numerical Recipes, is “almost surely” slower than the Powell’s direct set method “in all likely applications”. You can find some DFO solvers here, but few of them are implemented in C/C++.

I have reimplemented a Hooke-Jeeve’s solver, adapted Powell’s direct set method from Numerical Recipes in C and ported NEWUOA solver from Fortran to C++ with f2c. Of the three solvers, Hooke-Jeeve’s method is the simplest. It simply searches the nearby regions around a given point and reduces radius in each iteration. It is purely heuristic and is hardly related to any “algorithm”. The Powell’s direct set method minimizes the objective function by minimizing along a direction using Brent’s method. It is believed to outperform Nelder-Mead method. NEWUOA is a much more advanced method. I do not know how it works, frankly. A benchmark shows that NEWUOA outperforms NMSMAX, an implementation of Nelder-Mead method, and APPSPACK as well. All the three solvers I implement work well on the problem I want to solve.

The three solvers are implemented as C++ template headers, one header file for each solver. The Hooke-Jeeve’s method is here and NEWUOA is here. I cannot release the source codes for Powell’s direct set as Numerical Recipes disallows this.

GNU sort is one of my favorite program. It is fast and highly flexible. However, when I try to sort chromosome names, it becomes a pain. In bioinformatics, chromosomes are usually named as chr1, chr2, …, chr10, chr11, … chr20, …, chrX and chrY. It seems to me that there is no way to sort these names in the above order. Finally, I decide to modify GNU sort. I separate sort source codes from textutils-1.22 because this version is less dependent on other packages.

The string comparison function is:

static int mixed_numcompare(const char *a, const char *b)
{
  char *pa, *pb;
  pa = (char*)a; pb = (char*)b;
  while (*pa && *pb) {
    if (isdigit(*pa) && isdigit(*pb)) {
      long ai, bi;
      ai = strtol(pa, &pa, 10);
      bi = strtol(pb, &pb, 10);
      if (ai != bi) return ai<bi? -1 : ai>bi? 1 : 0;
    } else {
      if (*pa != *pb) break;
      ++pa; ++pb;
    }
  }
  if (*pa == *pb)
  return (pa-a) < (pb-b)? -1 : (pa-a) > (pb-b)? 1 : 0;
  return *pa<*pb? -1 : *pa>*pb? 1 : 0;
}

It does numerical comparison for digits and string comparison for other characters. With this comparison, chromosome names can be sorted in the desired way. I add a new command line option -N (or -k1,1N) to trigger string-digits mixed comparison.

In addition, I also replace the top-down recursive mergesort with a bottom-up iterative sort, and use heap to accelerate merging. The improved sort is a little faster than the orginal version.

The improved sort can be downloaded here, distributed under GPL.

Computer plays an important in simulating dynamic systems. For an ordinary system (as opposed to a chaotic system), the simulated trajectory is always close to the one without errors even if numerical methods are approximate.

A chaotic system is different. It is very sensitive to initial values. Tiny error in initial values will lead to a completely different behaviour from the one developed from the original initial values. Nonetheless, people still use computer to simulate the system. If the simulated trajectory is far from the true one without error, what is the computer simulating? Should we trust such results?

We can, but most of popular books on the chaos theory do not explain the reason. I found the answer in a mathematical textbook. Under small errors, although the trajectory given by the computer can be far from the one developed without any error, there must be a true trajectory that is very close to the simulated trajectory. In other words, the computer is not simulating the trajectory we mean to simulate, but simulating another one that we never know. The simulation is still valid if we want to analyze the property of the whole dynamic system (as opposed to studying one trajectory).

Generating Chaos

Keep a record. I came across Chaoscope, which further directed me to Flam3 and then Oxidizer and finally Rampant Mac. What is all about? They are fantastic programs/resources that generate rendered pictures of strange attractors. For example, you can easily generate the following picture in a couple of minutes once you know how to use the software.

Generated by Oxidizer with simil3.lua script

Generated by Oxidizer with simil3.lua script

I am not an artist at all and am just an elementary user of Oxidizer. You can find much better artworks from Choascope and Rampant Mac.

Popular multialignment programs include: clustalw, T-coffee, dialign [PMID: 18505568, 10222408], muscle [PMID: 15318951], MAFFT [PMID: 18372315], probcons [PMID: 15687296] and probalign [PMID: 16954142]. Which is the best in practice? You can find various benchmarks in the papers listed above. I only summarize based on my understanding. It is recommended to read their papers in case my summary is biased. Also, both accuracy and speed may vary a lot with different data sets. I can only give results based on the data used in their papers.

Firstly, these software can be divided into two groups: those performing global multiple alignment and those performing local multiple alignment. In global alignment, each residue of each sequence is required to be aligned. This category includes clustalw, T-coffee, muscle, probcons and probalign. In local alignment, some residues are allowed to be unaligned. Only segments that closely related are aligned together. This category includes dialign family and MAFFT. They can also perform global alignment.

Most benchmarks are designed for global aligners. The consensus seems to be:

  • accuracy: probalign>probcons>MAFFT>muscle>T-coffee>>clustalw~dialign
  • speed:  clustalw~muscle>MAFFT>dialign>>probcons>probalign~T-coffee

Only MAFFT and dialign performs local alignment. The dialign paper claims that dialign is much more accurate. However, I doubt this largely depends on the definition of high-scoring segments. We can always find tighter alignment by excluding more residues.

Then, what is the best? Muscle or MAFFT. Although probcons and probalign are more accurate, they run impractically slower than muscle and MAFFT. As for the winner between muscle and MAFFT, I cannot decide. I also need to evaluate the memory consumption, usability and stability. I have only used muscle. It is really nice software!

I was ignorant. An hour ago, I thought it is impossible to implement a garbage collector (GC) for C, but this is certainly wrong.

For an interpretated language like Perl, it is cheap to keep track of memory that is not referenced and therefore it is not so hard to identify and free unused memory in most cases except circular referencing. Java disallows pointers, of course including internal pointers. Objects out of the scope can be easily identified and freed. C is very different. At the first sight, it is impossible to directly tell where pointer variables point to. Then how to identify unused memory? This page gives the answer: we can scan registers, stacks and static data regions to collect information on pointers. Knowing this information makes it possible to implement a GC for C. The most famous implementation is the Boehm-Demers-Weiser GC library. A third-party review shows that this GC may outperform manual memory management. It also thoroghly discusses the advantages and disadvantages of this library in the end. The memory management reference is another website that provides insight into GC.

Probably I will not use GC in C. Although GC can be faster, its behaviour is less predictable than manual memory management. This makes me feel uneasy when I am used to controlling the memory. What is more important, BDW GC seems not to do boundary check. When such an error occurs, it will be very difficult to identify the problem when GC effectively cripples valgrind which should pinpoint the error otherwise.

This topic sounds pretty elementary, but I did not know the difference a week ago. Just explain it here as a record. You may also want to have a look at this page.

Memory can be allocated on the heap or on the stack. When a program calls malloc() family, the memory will be allocated on the heap. When you define a variable or an array or call alloca(), the memory will be allocated on the stack. The data on the stack are separated by frames. Each time a open-brace is met, a scope is initiated and a frame which contains the data in the scope will be pushed on the stack; each time a close-brace is met, the scope is over and the frame will be removed. To this end, memory allocated on the stack is transcient. Pointers pointed to such memory become invalid when the scope is over.

Allocation on the stack is more convenient and cheaper, but the maximum stack size is limited. It may cause stack overflow if your program allocates large memory on the stack. In addition, allocating large arrays on the stack may fool valgrind (see here). Usually, large arrays should be allocated on the heap.

A colleague of mine just told me that C++ iostream is typically an order of magnitude slower than printf. His example shows that printing out a string like “%s\t%d\tabc\t%s\t%s\n” with C++ iostream is 3 times slower than printf in Perl! This observation agrees my experience, although I have never done any benchmark. I abandoned iostream after I tried it the first time in my program.

Update: In another test, C++ iostream is ~30% slower, which is not too bad. Anyway, be aware that C++ iostream can be very slow in some cases. This thread also provides helpful information.