Dart: revisiting matrix multiplication

Feeds:: Posts; Comments

Dart: revisiting matrix multiplication

October 26, 2012 by attractivechaos

First of all, I know little about JIT and VM. Some of what I said below may well be wrong, so read this blog post with a grain of salt.

My previous microbenchmark showed that dart is unable to optimize the code to achieve comparable speed to LuaJIT. Vyacheslav Egorov commented that the key reason is that I have not “warmed up” the code. John McCutchan further wrote an article about how to perform microbenchmarks, also emphasizing the importance of warm-up. After reading a thread on the Dart mailing list, I know better about the importance of warm-up now.

If I am right, JIT can be classified as more traditional method JIT whereby the VM compiles a method/function to machine code at a time, and tracing JIT whereby the VM may optimize a single loop. There is a long discussion on Lambda the Ultimate about them. Typically, method JIT needs to identify hot functions and then compile AFTER the method call is finished. What I did not know previously is On Stack Replacement (OSR), with which we are able to compile a method (or part of the method?) to machine code while it is running. This in some way blurs the boundary between method JIT and tracing JIT.

Among popular JIT implementations, V8 and Java use method JIT with OSR, while Pypy and LuaJIT use tracing JIT. They are all able to perform well for matrix multiplication even if the hot method is called only once. In my previous post, Dart has bad performance because it uses method JIT but without OSR. It is unable to optimize the hot function while it is being executed. The Dart development team argued that the lack of OSR is because implementing OSR is complicated and “experience with Javascript and Java programs has shown that it very rarely benefits real applications.”

I hold the opposite opinion, strongly. There is no clear distinction between benchmarks and real applications. It is true that in web development, a program rarely spends more than a few seconds in a function called only once, but there are more real applications than web. In my daily work, I may need to do Smith-Waterman alignment between two long sequences or to compute the first few eigenvalues of a huge positive matrix. The core functions will be called only once. I have also written many one-off scripts having only a main function. Without OSR, Dart won’t perform better than Perl/Python, either, I guess. If the Dart development team want Dart to be widely adopted in addition to web development, OSR will be a key feature (well, a general-purpose language may not be the goal of Dart, which would be a pity!). I wholeheartedly hope they can implement OSR in future.

Fortunately, before OSR gets implemented in Dart (if ever), there is a simpler and more practical solution than warm-up: hoisting the content in the hot loop into a function to allow Dart compiles that function to machine code after it is called for a few times (though to do this, you need to know which loop is hot).

At the end of the post is an updated implementation of matrix multiplication, where “mat_mul1()” and “mat_mul2()” have the same functionality but differ in the use of function. The new implementation (mat_mul2) multiplies two 500×500 matrices in 1.0 second, as opposed to 14 seconds by the old one (mat_mul1). This is still much slower than LuaJIT (0.2 second) and V8 (0.3 second), but I would expect Dart to catch up in the future. Actually Vyacheslav commented that a nightly build might have already achieved or approached that.

SUMMARY: Dart as of now only compiles an entire method to machine code, but it cannot compile the method while it is running. Therefore, if the hot method is called only once, it will not be compiled and you will experience bad performance. An effective solution is to hoist the content of the hot loop to a separate function such that Dart can compile the function after it is executed a few times.

mat_transpose(a)
{
	int m = a.length, n = a[0].length; // m rows and n cols
	var b = new List(n);
	for (int j = 0; j < n; ++j) b[j] = new List<double>(m);
	for (int i = 0; i < m; ++i)
		for (int j = 0; j < n; ++j)
			b[j][i] = a[i][j];
	return b;
}

mat_mul1(a, b)
{
	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i) {
		x[i] = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0.0;
			for (int k = 0; k < n; ++k) sum += a[i][k] * c[j][k];
			x[i][j] = sum;
		}
	}
	return x;
}

mat_mul2(a, b)
{
	inner_loop(t, n, ai, c)
	{
		var xi = new List<double>(t);
		for (int j = 0; j < t; ++j) {
			double sum = 0.0;
			for (int k = 0; k < n; ++k) sum += ai[k] * c[j][k];
			xi[j] = sum;
		}
		return xi;
	}

	int m = a.length, n = a[0].length, s = b.length, t = b[0].length;
	if (n != s) return null;
	var x = new List(m), c = mat_transpose(b);
	for (int i = 0; i < m; ++i)
		x[i] = inner_loop(t, n, a[i], c);
	return x;
}

mat_gen(int n)
{
	var a = new List(n);
	double t = 1.0 / n / n;
	for (int i = 0; i < n; ++i) {
		a[i] = new List<double>(n);
		for (int j = 0; j < n; ++j)
			a[i][j] = t * (i - j) * (i + j);
	}
	return a;
}

main()
{
	int n = 500;
	var a = mat_gen(n), b = mat_gen(n);
	var c = mat_mul2(a, b);
	print(c[n~/2][n~/2]);
}

Posted in development | Tagged Dart | 1 Comment

One Response

on October 26, 2012 at 6:22 pm | Reply Vyacheslav Egorov (@mraleph)

One thing that I still would recommend is to use Float64List from the dart:scalarlist instead of List (see #4 in my previous comment https://attractivechaos.wordpress.com/2012/10/18/initial-evaluation-of-the-dart-performance/#comment-1139 for explanation why this matters).

As for OSR I *personally* believe that it is required for predictable performance to cover some cases where hot loop has to be optimized while “in fly” but I also believe that these cases are indeed extremely rare (if you don’t count benchmarks) and are mostly related to heavy number crunching code which is not that common on the web (both server side and client side). However my experience with V8 shows that real world cases where OSR is crucial do exist.

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog