Posts Tagged ‘alignment’

Eland is a computer program that aligns short oligonucleotides against a reference genome. It is written by Anthony Cox from the Illumina company. The source codes are freely available to machine buyers. Eland is the first program for short read alignment and it profoundly influences most of its successors.


Eland guarantees full sensitivity for hits with up to two mismatches. To achieve this, Eland divides a read into four parts with approximately equal lengths. Six noncontiguous seed templates can be constructed with each template covering two parts. Eland applies each seed templates on reads and indexes the sequences that pass the template. After indexing it scans the reference genome sequence base by base and then looks up each K-mer in the index to find hits. As any two mismatches can only occur to at most two of the four parts, the seeding strategy used by Eland guarantees to find all 2-mismatch hits.

Eland might be the first widely used program that achieves full sensitivity given a threshold. Most of Eland successors learn from this point, and some software, such as Maq and SeqMap, even use the same seeding strategy. SOLiD read mapping software and ZOOM further extend the idea of using noncontiguous seed templates to achieve full sensitivity with fewer templates.


Eland is so fast that it effectively sets a goal for all its successors. Although several software are claimed to achieve so, they cannot retain the same useful information as Eland. For example, frequently a program only gives the unique best hit without couting the occurrences of a read. Counting greatly helps to reduce false alignment in some cases, but implementing this is non-trivial. Comparing Eland to a program without counting is unfair.


Eland is not perfect, though. Natively, it does not support reads longer than 32bp, paired end mapping nor gapped alignment. Eland provides several scripts to overcome these limitations, but the power is reduced. In particular, even with the help of the additional scripts, Eland will miss a read alignment if the first 32bp of the read is nonunique or if the read has more than 10 identical hits in the reference genome. These limitations give the room for other short read aligners.

Read Full Post »

Popular multialignment programs include: clustalw, T-coffee, dialign [PMID: 18505568, 10222408], muscle [PMID: 15318951], MAFFT [PMID: 18372315], probcons [PMID: 15687296] and probalign [PMID: 16954142]. Which is the best in practice? You can find various benchmarks in the papers listed above. I only summarize based on my understanding. It is recommended to read their papers in case my summary is biased. Also, both accuracy and speed may vary a lot with different data sets. I can only give results based on the data used in their papers.

Firstly, these software can be divided into two groups: those performing global multiple alignment and those performing local multiple alignment. In global alignment, each residue of each sequence is required to be aligned. This category includes clustalw, T-coffee, muscle, probcons and probalign. In local alignment, some residues are allowed to be unaligned. Only segments that closely related are aligned together. This category includes dialign family and MAFFT. They can also perform global alignment.

Most benchmarks are designed for global aligners. The consensus seems to be:

  • accuracy: probalign>probcons>MAFFT>muscle>T-coffee>>clustalw~dialign
  • speed: ¬†clustalw~muscle>MAFFT>dialign>>probcons>probalign~T-coffee

Only MAFFT and dialign performs local alignment. The dialign paper claims that dialign is much more accurate. However, I doubt this largely depends on the definition of high-scoring segments. We can always find tighter alignment by excluding more residues.

Then, what is the best? Muscle or MAFFT. Although probcons and probalign are more accurate, they run impractically slower than muscle and MAFFT. As for the winner between muscle and MAFFT, I cannot decide. I also need to evaluate the memory consumption, usability and stability. I have only used muscle. It is really nice software!

Read Full Post »

ZOOM [PMID:18684737] is still unavailable even when the manuscript goes online. For the time being, there is no way to confirm whether their benchmarks are unbiased. Fortunately, we can collect some information from what they have presented. In the ZOOM paper, the authors give the memory consumption of ZOOM. It is 2.9GB for 12 million short reads. So far as I know, Eland will only use 30~40% of this memory. If Anthony Cox had meant to use more memory, he would have achieved faster speed. In addition, apparently ZOOM only reports the unique best hit according to this post. Eland will count the occurrences of all the 0-, 1- and 2-mismatch hits anyway. In fact, if you read eland source codes, you will know that counting occurrences is non-trivial and it is much more expensive to implement counting in ZOOM than in Eland. I bet ZOOM would be slower than Eland if it generates the same output as Eland. Of course, it is not always necessary to generate the same output, but detecting unique best hits alone are usually not good enough in some applications, such as detecting structural variations. We need to know whether there are many suboptimal hits with scores close to the best one.

Although the benchmark is biased, the ZOOM paper itself indeed presents an elegant idea about using non-contiguous seeds. I also believe ZOOM is highly tuned for performance by a group of strong programmers. Furthermore, even if ZOOM is slower than Eland, I would still prefer ZOOM as it eliminates the hard limits on the 32bp read length and the maximum two mismatches. I do not mind if ZOOM is a bit slower. I just want to see a fair comparison.

Update: ZOOM has been released and I have tried it a bit. The claim in the paper is honest: ZOOM is indeed faster than Eland while allowing longer reads. Nonetheless, ZOOM only reports unique best hits and therefore my claim here is fair, too: Eland is giving more information, which is one of the reasons that it is slower. Although ZOOM can output N top hits, apparently we can only achieve this by dumping all the potential hits. On human alignment, this is impractical.

Note that I pay a lot of attention to ZOOM mainly because it is a great software. In terms of efficiency and usability, ZOOM is much better than many other similar software that I have not mentioned here.

Read Full Post »

Definition of Pairwise Sequence Search (PSS)

Roughly speaking, given two sets of sequences A and B, the pairwise sequence search (PSS) problem is to find the high-scoring matches between each pair of sequences, one from each set. PSS can, but not necessaily, be done by performing pairwise alignment between each pair of sequences and then retaining high-scoring matches only. Better algorithms usually index one set of sequences to achieve better time complexity.

Online vs. Offline Search

Sequence search can be online or offline. In online sequence search, all the sequences in alignment are not preprocessed; in offline search, one set of sequences can be indexed first and stored in a on-disk file, called database, and the other set of sequences, called queries, are aligned against the index. For online search, the time complexity is at least linear to the sum of lengths of the larger set. As the indexing step is separated out from matching, the time complexity of offline search can be better especially when lengths of the two sets are highly unbalanced.

Sometimes there is no clear edge between online and offline search. Many online search algorithms are also involved with an indexing step where one or both sets of sequences are indexed. Separating indexing and searching will make an online algorithm become an offline one. In practice, we usually prefer separating the indexing step if one set of sequences is static, and if one set of sequences is much larger than the other or indexing is very expensive.

Algorithms for Indexing

Most of fast searching algorithms involve indexing and algorithms used for indexing largely determine algorithms used for searching.

There are three major categories of algorithms for indexing: sorting, filtering and suffix tree based indexing. There are not clear edges between them. For example, sorting usually comes with filtering, suffix tree can be used to compute filters, and sorting can be used to construct suffix array.

(I will explain the three categories in future posts.)

Read Full Post »