Feeds:
Posts
Comments

A quick post. I implemented matrix multiplication in Dart. It takes Dart 12 seconds to multiply two 500×500 matrices. In contrast, LuaJIT does the same job in less than 0.2 seconds. Perl takes 26 seconds. This means that Dart fails to JIT the critical loop even though I am trying to use explicit typing. Dart is not quite there yet to match the performance of V8 and LuaJIT. The source code is appended. It looks almost the same as C++.

mat_transpose(a)
{
	int m = a.length, n = a[0].length; // m rows and n cols
	var b = new List(n);
	for (int j = 0; j < n; ++j) b[j] = new List<double>(m);
	for (int i = 0; i < m; ++i)
		for (int j = 0; j < n; ++j)
			b[j][i] = a[i][j];
	return b;
}

mat_mul(a, b)
{
	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i) {
		x[i] = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0;
			var ai = a[i], cj = c[j];
			for (int k = 0; k < n; ++k) sum += ai[k] * cj[k];
			x[i][j] = sum;
		}
	}
	return x;
}

mat_gen(int n)
{
	var a = new List(n);
	double t = 1.0 / n / n;
	for (int i = 0; i < n; ++i) {
		a[i] = new List<double>(n);
		for (int j = 0; j < n; ++j)
			a[i][j] = t * (i - j) * (i + j);
	}
	return a;
}

main()
{
	int n = 500;
	var a = mat_gen(n);
	var b = mat_gen(n);
	var c = mat_mul(a, b);
	print(c[n~/2][n~/2]);
}

The first dart SDK is released today. Since the initial announcement, most web developers have been strongly against dart. The typical argument is that javascript meets our needs and even if it does not there are a bunch of other languages translated to javascript. Why do we need a new one? Because google can take control over it?

While these arguments are true, I see dart in the angle of a command-line tool developer. For javascript or a language translated to javascript, such as coffeescript, it cannot provide basic file I/O and system utilities, which makes it not suitable for developing command-line tools at all. A few years ago when I investigated nodejs, it did not provide proper file I/O, either (it seems much better now, but I have not tried). Another problem with Javascript is that it was not designed for JIT at the beginning. Naively, a language designed with JIT in mind is likely to perform better.

From a quick look, Dart apparently overcomes the major weakness of javascript mentioned above. It has clean C++-like syntax with modern language features, inherites the flexibility of javascript, supports at least basic I/O and system utilities (perhaps a replacement of nodejs?), and is designed for JIT from the beginning. I have not evaluated its performance, but I would expect it will compete or outperform V8 in the long run, though the release note seems to suggest that right now V8 is faster. I will evaluate its performance when I have time.

I have to admit that I am a little anti-google in general (not much), but I applaud google’s decision on developing the dart programming language amidst massively axing other projects. From a quick tour, it seems to be the closest to the ideal programming language in my mind (EDIT: I should add that this is the ideal scripting language; no matter how much I like dart, I will always use C for performance-critical tasks).

As sorting is the bottleneck of my application, I decided to optimize the code further with reference to this wonderful article. The optimized the version is about 40% faster than my original one. It is now about 2.5 times as fast as STL’s std::sort on random 32-bit integer arrays. The optimized version is slightly slower than the implementation in that article, but it is much faster on sorted arrays.

For sorting large integer arrays, radix sort is the king. It is much faster than other standard algorithms and is arguably simpler.

#define rstype_t uint64_t // type of the array
#define rskey(x) (x) // specify how to get the integer from rstype_t

#define RS_MIN_SIZE 64 // for an array smaller than this, use insertion sort

typedef struct {
  rstype_t *b, *e; // begin and end of each bucket
} rsbucket_t;

void rs_insertsort(rstype_t *beg, rstype_t *end) // insertion sort
{
  rstype_t *i;
  for (i = beg + 1; i < end; ++i)
    if (rskey(*i) < rskey(*(i - 1))) {
      rstype_t *j, tmp = *i;
      for (j = i; j > beg && rskey(tmp) < rskey(*(j-1)); --j)
        *j = *(j - 1);
      *j = tmp;
    }
}
// sort between [$beg, $end); take radix from ">>$s&((1<<$n_bits)-1)"
void rs_sort(rstype_t *beg, rstype_t *end, int n_bits, int s)
{
  rstype_t *i;
  int size = 1<<n_bits, m = size - 1;
  rsbucket_t *k, b[size], *be = b + size; // b[] keeps all the buckets

  for (k = b; k != be; ++k) k->b = k->e = beg;
  for (i = beg; i != end; ++i) ++b[rskey(*i)>>s&m].e; // count radix
  for (k = b + 1; k != be; ++k) // set start and end of each bucket
    k->e += (k-1)->e - beg, k->b = (k-1)->e;
  for (k = b; k != be;) { // in-place classification based on radix
    if (k->b != k->e) { // the bucket is not full
      rsbucket_t *l;
      if ((l = b + (rskey(*k->b)>>s&m)) != k) { // different bucket
        rstype_t tmp = *k->b, swap;
        do { // swap until we find an element in bucket $k
          swap = tmp; tmp = *l->b; *l->b++ = swap;
          l = b + (rskey(tmp)>>s&m);
        } while (l != k);
        *k->b++ = tmp; // push the found element to $k
      } else ++k->b; // move to the next element in the bucket
    } else ++k; // move to the next bucket
  }
  for (b->b = beg, k = b + 1; k != be; ++k) k->b = (k-1)->e; // reset k->b
  if (s) { // if $s is non-zero, we need to sort buckets
    s = s > n_bits? s - n_bits : 0;
    for (k = b; k != be; ++k)
      if (k->e - k->b > RS_MIN_SIZE) rs_sort(k->b, k->e, n_bits, s);
      else if (k->e - k->b > 1) rs_insertsort(k->b, k->e);
  }
}

void radix_sort(rstype_t *beg, rstype_t *end)
{
  if (end - beg <= RS_MIN_SIZE) rs_insertsort(beg, end);
  else rs_sort(beg, end, 8, sizeof(rskey(*beg)) * 8 - 8);
}

EDIT: Just found this implementation. It is as fast as mine and is simpler. Recommended.

I am recently working on an algorithm, which surprisingly spends more than half of its time on sorting huge partially ordered arrays of 64-bit integer pairs (one for key and the other for value). Naturally, I want to optimize sorting such arrays. Initially, I tried my implementation of introsort. The program takes about 90 seconds on a sample data set. I then switched to my iterative mergesort in the same library. It takes 55 seconds. I guess the mergesort is faster because the arrays are partially ordered. However, my implementation of mergesort requires a temporary array of the same size. As the arrays are huge, it is unacceptable to allocate this array for real data. It seems that implementing an in-place mergesort is quite challenging. Then I think of radix sort, which I have not implemented before.

My radix sort implementation is here. It is not written as a library, but it should be easy to be adapted to other data types. The C program is quite simple and is not much different from existing ones.

How about the performance? With radix sort, my program takes 35 seconds using little extra working space. I get 100% speedup by replacing introsort with integer-only radix sort. To evaluate the performance of radix sort separately, I put the code in my old ksort_test.cc. Here are the CPU seconds spent on sorting 50 million random or sorted integers:

Algorithm

Sorted?

Mac CPU (sec)

Linux CPU
STL introsort

No

4.9

5.1
STL introsort

Yes

0.9

1.1
STL stablesort

No

6.7

6.1
STL stablesort

Yes

2.0

2.0
STL heapsort

No

54.1

32.2
STL heapsort

Yes

4.5

4.2
libc qsort

No

11.3

9.7
ac’s radix

No

1.9

2.0
ac’s radix

Yes

0.8

0.9
ac’s combsort

No

11.9

11.5
ac’s introsort

No

5.5

5.7
ac’s introsort

Yes

7.4

6.8
ac’s mergesort

No

6.1

6.6
ac’s mergesort

Yes

2.1

2.3
ac’s heapsort

No

54.4

31.8
ac’s heapsort

Yes

4.9

6.6
ph’s heapsort

No

54.6

30.8
ph’s quicksort

No

5.6

5.8
ph’s mergesort

No

7.0

6.9

Please see my old post for the information on other algorithms and implementations. You can also clone my klib repository and play with ksort_test.cc by yourself.

Recently I saw Linus’ post on Genome 3.4, where he was complaining that it is hard to make fonts smaller. This reminded me of my recent experience in installing Mint/Ubuntu under a virtual machine. I have exactly the same problem. In Unity, I wanted the font to be smaller, but I found I have to install a 3rd-party tweak tool that is not part of Ubuntu. I deleted the VM immediately.

Linus blamed Gnome not being customizable enough. I blamed Unity, too (in Xfce, it is easy to change font sizes), but I think a more fundamental problem is the default setting of Gnome3/Unity/Xfce. On Mac, I am happy with the default system fonts and font sizes all the time (I changed font settings in a couple of applications). I have not heard that a Windows user complaining about the system default, either.

No matter how fancy a user interface looks afar, if the fonts are ugly or in the wrong sizes, it is a complete failure. Why haven’t those UI designers realized this simple fact even till today? It really amazes me how the UI designers can live with such a big ugly font size, and they love such ugly fonts so much that they even disallow end users to change them!

Sorry for the rant.

The polls are conducted by Hacker News. It is still ongoing. The following is a snapshot of the results (generated by this ugly Perl script). The links to the two polls: like and dislike.

What is interesting is that this is so far the only poll I saw on “disliked” languages.

According to the TIOBE index, D quickly gained popularity around 2007, coicident with the release of D1, Tango, GDC and LDC. The popularity has receded since 2009, again coicident with the leave of several key D developers from the community. Looking back, I think the biggest mistake was made by the Tango developers who chose not to be compatible with the official standard library at the runtime level. I understand that the Tango runtime is probably more efficient, but asking developers to drop the official library is rare, unreasonable and a little impolite. In my view, the incompatibility also made it much harder to port Tango to D2, which further frustrated D developers. On the other hand, my feeling is that Tango indeed played a key role in attracting C++ and Java programmers. D was probably flourishing with Tango. When Tango stopped evolving, the ranking of D according to the TIOBE index was back to the pre-Tango days. In addition, the tension between Tango and Phobos/DMD did not come without reasons. The closeness of Phobos and DMD at the beginning, the slow adoption of 64-bit platforms and the transition of D1 to D2 all led to complaints among developers. While I believe the D1 to D2 transition is unfortunately the right direction overall, the other interfering factors should and could be avoided. If these other factors could be resolved well, I am sure the D1 to D2 transition would have been made smoother as well, in the belief that we are more likely to find solutions in a friendly atmosphere.

Looking forward, I see D2 has got over some major obstacles but may also be faced with new challenges. On the up side, D2 and Phobos are more open. D2 is in many ways a better language than D1. The official 64-bit support has been finished (at least on Linux). Phobos is much more powerful. Tango has been just ported to D2 without the incompatibility with Phobos. On the down side, however, D2/Phobos is still fast moving. I do not know if a D program written today can still be compiled a year later. The Tango development team has shrunk considerably. D has probably frustrated quite a few early adopters. It is not easy to win them back.

As an outsider, I think the first priority for D is to quickly stabilize the compiler and the standard library to convince everyone that programs written now will still be compiled years later. Without this guarantee, D2 will remain as a niche/hobby language. As a rant, I think the core development team should at least make D2+Phobos easier to compile and to install. I am not bad at installing software, but it took me a couple of hours to figure out the relationship between source codes and how to install and configure without the root permission. Many newcomers will just leave without looking back.

I always think D is the most promising modern programming language, successfully combining extreme efficiency and expressiveness. I continue to hope D can take flight – I have been hoping so for 5 years. No matter whether D flourishes again or gradually fades away, its past history will become a worthy case to the open source community.

I am a long-time fan of the D programming language, though I have never seriously learned it. One of the many big and small reasons that I do not use D is the official compiler dmd is hard to install on the old Linux servers I routinely access. Although a binary has been available for some time, it requires a recent glibc. I tried to compile the source code, but the compiler is separated into several source code trees and the documentation is quite lousy. It is not like I can download a tar-ball, unpack it and run ‘./configure;make;make install’.

The situation has not been changed much nowadays – the binary is still not working – but I hope I can help those who have the same problem. I wrote a Makefile to compile the D compiler. It grabs source code from github and compile it. If you have an old machine where the binary does not work, you can do the following to compile D (you need a C++ compiler and git of course):

mkdir -p dmd2; cd dmd2
git clone git://gist.github.com/1848272.git
mv 1848272/compile-d.mak .
make -f compile-d.mak
export PATH="`pwd`/bin:$PATH"

The whole procedure only takes a few minutes, much faster than compiling gcc/g++. Hope it helps, just in case.

I always dreamed of a stream opening routine that is able to open ordinary files, URLs and Unix pipes at the same time. I started to write such a routine just for fun.

Here are some examples:

extern void *kopen(const char*, int*);
extern int kclose(void*);
void *ko[4];
int i, fd[4];
ko[0] = kopen("file.txt", &fd[0]);
ko[1] = kopen("http://website.com/index.html", &fd[1]);
ko[2] = kopen("ftp://ftpsite.edu/file.txt", &fd[2]);
ko[3] = kopen("<grep test file.txt", &fd[3]);
for (i = 0; i < 4; ++i) {
  close(fd[i]);
  kclose(ko[i]);
}

As is showed, this library currently supports file/http/ftp/pipe. It returns an object to keep internal states and a file descriptor. You may use “fdopen()”, “gzdopen()” or my generic stream buffer for buffered I/O. For now, it does not work in Windows. I believe in Windows, a socket descriptor is not equivalent to a file descriptor as in Unix.

This routine is certainly imperfect. Its network related code is not robust and not optimized. There may be a memory leak to call other executables without using a shell. But the idea is there.

If you are interested, you may download kopen.c, compile with “gcc -D_KO_MAIN -O2 kopen.c -o kopen” and try:

./kopen file.txt
./kopen https://attractivechaos.wordpress.com/
./kopen "<grep man /etc/man.conf"

You see, this kopen is basically a more versatile “cat” (to some extend).

Mersenne Twister (MT) produces pseudorandom number with much longer period (2^19937-1) than the most widely used one-dimension linear congruential generator (LCG; typically below 2^64-1), which is used by rand() and rand48() in libc. Of equal importance, MT approaches the speed of LCG. It is about 40% slower on my Mac laptop than lrand48() and faster than lrand48() on a Linux machine. The boost developers have also benchmarked various pseudorandom number generators. Their conclusion is broadly similar to mine. In all, MT is a far better pseudorandom generator.

Takuji Nishimura and Makoto Matsumoto, the father of MT, have implemented the algorithm in C for both 32-bit and 64-bit integers. Many others have adapted or improved it. I also wrote a wrapper in klib. To use it:

#include "krand.h"
krand_t *kr = kr_seed((uint64_t)time()<<32^getpid());
uint64_t x = kr_rand(kr);
free(kr);

In the context of multi-threading, each thread should call kr_seed() with a different seed. You can also call kr_seed() once, but kr_rand() needs to be locked.