Feeds:
Posts
Comments

C pointer is the most powerful and nasty concept. Whether marstering C points or not separate intermediate C programers from the elementary ones. Want to know whether you have mastered C pointers? Have a look at this program. If the basic idea is clear to you, you are qualified to be an intermediate programmer. If you have difficulty, you should learn hard from other programmers. It is unwise to study this program as a beginner. The catch in the program is too complicated.

This program is adapted from an example in the C bible “The C Programming Language” written by Kernighan & Ritchie. I commented it and extended its function a bit. This allocator is usually not as efficient as malloc family that comes with the system, but it is good enough for a lot of practical applications. Also, this allocator is simple and clear. You can largely predict its behaviour. In comparison, it is not always easy to understand malloc is doing.

I came to know this allocator from Phil Green’s Phrap assembler. I then found the book and reimplemented in my way.

What is Moral?

When I was little, I was taught that moral is the guideline for the majority of people. When I was older, I became skeptical about this definition: would cheating be moral if the majority of people cheated each other? Would the minority be immoral if cheating became moral? According to the definition I learnt cheating could be moral, but this strongly counters my view of the world. Then what is moral?

Now I think I find my own definition. Moral is the guideline that adds the value to the whole human community. With this definition, cheating is immoral as it wastes the human resources and aggregates the expenses on discovering the truth. The behaviour held by the majority people can be immoral.

In the real life, I do see people regard cheating as a normal behaviour from time to time. The argument I heard many times is “if I had been honest, I would not have achieved”. Well, I can do nothing but escaping such people. It is regretful and dangerous when the immoral become the majority while I am not among the majority. I could behave like the majority, but then I would ruin myself.

If you use Emacs, you can convert everything you see on the screen to an HTML file with almost exactly the same look and style. The package you need is htmlize.el. Some Emacs distributions, such as Carbon Emacs, package this module by default. This page shows an example. Really fantastic! It goes far beyond any similar methods I am aware of so far.

ZOOM [PMID:18684737] is still unavailable even when the manuscript goes online. For the time being, there is no way to confirm whether their benchmarks are unbiased. Fortunately, we can collect some information from what they have presented. In the ZOOM paper, the authors give the memory consumption of ZOOM. It is 2.9GB for 12 million short reads. So far as I know, Eland will only use 30~40% of this memory. If Anthony Cox had meant to use more memory, he would have achieved faster speed. In addition, apparently ZOOM only reports the unique best hit according to this post. Eland will count the occurrences of all the 0-, 1- and 2-mismatch hits anyway. In fact, if you read eland source codes, you will know that counting occurrences is non-trivial and it is much more expensive to implement counting in ZOOM than in Eland. I bet ZOOM would be slower than Eland if it generates the same output as Eland. Of course, it is not always necessary to generate the same output, but detecting unique best hits alone are usually not good enough in some applications, such as detecting structural variations. We need to know whether there are many suboptimal hits with scores close to the best one.

Although the benchmark is biased, the ZOOM paper itself indeed presents an elegant idea about using non-contiguous seeds. I also believe ZOOM is highly tuned for performance by a group of strong programmers. Furthermore, even if ZOOM is slower than Eland, I would still prefer ZOOM as it eliminates the hard limits on the 32bp read length and the maximum two mismatches. I do not mind if ZOOM is a bit slower. I just want to see a fair comparison.

Update: ZOOM has been released and I have tried it a bit. The claim in the paper is honest: ZOOM is indeed faster than Eland while allowing longer reads. Nonetheless, ZOOM only reports unique best hits and therefore my claim here is fair, too: Eland is giving more information, which is one of the reasons that it is slower. Although ZOOM can output N top hits, apparently we can only achieve this by dumping all the potential hits. On human alignment, this is impractical.

Note that I pay a lot of attention to ZOOM mainly because it is a great software. In terms of efficiency and usability, ZOOM is much better than many other similar software that I have not mentioned here.

Definition of Pairwise Sequence Search (PSS)

Roughly speaking, given two sets of sequences A and B, the pairwise sequence search (PSS) problem is to find the high-scoring matches between each pair of sequences, one from each set. PSS can, but not necessaily, be done by performing pairwise alignment between each pair of sequences and then retaining high-scoring matches only. Better algorithms usually index one set of sequences to achieve better time complexity.

Online vs. Offline Search

Sequence search can be online or offline. In online sequence search, all the sequences in alignment are not preprocessed; in offline search, one set of sequences can be indexed first and stored in a on-disk file, called database, and the other set of sequences, called queries, are aligned against the index. For online search, the time complexity is at least linear to the sum of lengths of the larger set. As the indexing step is separated out from matching, the time complexity of offline search can be better especially when lengths of the two sets are highly unbalanced.

Sometimes there is no clear edge between online and offline search. Many online search algorithms are also involved with an indexing step where one or both sets of sequences are indexed. Separating indexing and searching will make an online algorithm become an offline one. In practice, we usually prefer separating the indexing step if one set of sequences is static, and if one set of sequences is much larger than the other or indexing is very expensive.

Algorithms for Indexing

Most of fast searching algorithms involve indexing and algorithms used for indexing largely determine algorithms used for searching.

There are three major categories of algorithms for indexing: sorting, filtering and suffix tree based indexing. There are not clear edges between them. For example, sorting usually comes with filtering, suffix tree can be used to compute filters, and sorting can be used to construct suffix array.

(I will explain the three categories in future posts.)

Language War

Language war denotes the debate over which is the best programming language, especially between languages with similar applications like C/C++, C++/Java and Perl/Python. A lot of programmers, including me, are absorbed in this debate. They write articles; they design benchmarks. However, they can never come to a unanimous conclusion. Language war is never ending.

The long-lasting debate itself implies that there is no best programming language. All the mainstream programming languages have their own advantages and disadvantages. The answer to “which is the best programming language” largely depends on how a programmer weights the advantages and disadvantages. It is subjective rather than objective: different programmers weight differently.

However, it is still beneficial to be involved in the language war. The intense debate makes us think over the strength and weakness of programming lanuages we may not be aware of otherwise. I will also post my opinions here in future.