Feeds:
Posts
Comments

Posts Tagged ‘NGS’

Eland is a computer program that aligns short oligonucleotides against a reference genome. It is written by Anthony Cox from the Illumina company. The source codes are freely available to machine buyers. Eland is the first program for short read alignment and it profoundly influences most of its successors.

Algorithm

Eland guarantees full sensitivity for hits with up to two mismatches. To achieve this, Eland divides a read into four parts with approximately equal lengths. Six noncontiguous seed templates can be constructed with each template covering two parts. Eland applies each seed templates on reads and indexes the sequences that pass the template. After indexing it scans the reference genome sequence base by base and then looks up each K-mer in the index to find hits. As any two mismatches can only occur to at most two of the four parts, the seeding strategy used by Eland guarantees to find all 2-mismatch hits.

Eland might be the first widely used program that achieves full sensitivity given a threshold. Most of Eland successors learn from this point, and some software, such as Maq and SeqMap, even use the same seeding strategy. SOLiD read mapping software and ZOOM further extend the idea of using noncontiguous seed templates to achieve full sensitivity with fewer templates.

Efficiency

Eland is so fast that it effectively sets a goal for all its successors. Although several software are claimed to achieve so, they cannot retain the same useful information as Eland. For example, frequently a program only gives the unique best hit without couting the occurrences of a read. Counting greatly helps to reduce false alignment in some cases, but implementing this is non-trivial. Comparing Eland to a program without counting is unfair.

Limitations

Eland is not perfect, though. Natively, it does not support reads longer than 32bp, paired end mapping nor gapped alignment. Eland provides several scripts to overcome these limitations, but the power is reduced. In particular, even with the help of the additional scripts, Eland will miss a read alignment if the first 32bp of the read is nonunique or if the read has more than 10 identical hits in the reference genome. These limitations give the room for other short read aligners.

Read Full Post »

ZOOM [PMID:18684737] is still unavailable even when the manuscript goes online. For the time being, there is no way to confirm whether their benchmarks are unbiased. Fortunately, we can collect some information from what they have presented. In the ZOOM paper, the authors give the memory consumption of ZOOM. It is 2.9GB for 12 million short reads. So far as I know, Eland will only use 30~40% of this memory. If Anthony Cox had meant to use more memory, he would have achieved faster speed. In addition, apparently ZOOM only reports the unique best hit according to this post. Eland will count the occurrences of all the 0-, 1- and 2-mismatch hits anyway. In fact, if you read eland source codes, you will know that counting occurrences is non-trivial and it is much more expensive to implement counting in ZOOM than in Eland. I bet ZOOM would be slower than Eland if it generates the same output as Eland. Of course, it is not always necessary to generate the same output, but detecting unique best hits alone are usually not good enough in some applications, such as detecting structural variations. We need to know whether there are many suboptimal hits with scores close to the best one.

Although the benchmark is biased, the ZOOM paper itself indeed presents an elegant idea about using non-contiguous seeds. I also believe ZOOM is highly tuned for performance by a group of strong programmers. Furthermore, even if ZOOM is slower than Eland, I would still prefer ZOOM as it eliminates the hard limits on the 32bp read length and the maximum two mismatches. I do not mind if ZOOM is a bit slower. I just want to see a fair comparison.

Update: ZOOM has been released and I have tried it a bit. The claim in the paper is honest: ZOOM is indeed faster than Eland while allowing longer reads. Nonetheless, ZOOM only reports unique best hits and therefore my claim here is fair, too: Eland is giving more information, which is one of the reasons that it is slower. Although ZOOM can output N top hits, apparently we can only achieve this by dumping all the potential hits. On human alignment, this is impractical.

Note that I pay a lot of attention to ZOOM mainly because it is a great software. In terms of efficiency and usability, ZOOM is much better than many other similar software that I have not mentioned here.

Read Full Post »