Feeds:
Posts
Comments

I happen to read a great article by Joel Webber where he benchmarks the real-world performance of the Box2D library in C++, Java and Javascript. Note that the benchmark only evaluates the “physical performance”, excluding the time spent on rendering the animation.

The take-home message is that C++ is about 20 times faster than Chrome’s V8 Javascript engine. I guess rendering is also much slower in browsers. No wonder Angry Bird, which uses Box2D, is sluggish in Chrome. The more I think, the more I feel Google’s Dart is technically correct, though it may turn out to be a vaporware in the end.

I read an interesting article about google’s new dart. This makes me thinking what is the ideal programming language in my mind. I am not simply thinking to have the fastest, smallest and simplest thing, which is impossible. I am trying to be a little practical.

I will start by explaining why the existing languages are not ideal, to me.

  • C/C++. Lack of garbage collector (thus having segmentation fault). Too verbose.
  • Java. Memory hog. Too many restrictions. Another reason I do not like Java is that I always feel delays when launching the VM or when using GUIs. Java is fast enough, but I do not know where the delay comes from.
  • C#. I do not use Windows. I heard that the C# GUI is much more responsive than Java in Windows. The bad thing is we do not have a good cross-platform VM. Mono falls far behind the Microsoft VM in terms of efficiency.
  • D and Go. They are closer to my expectation, but still too verbose as they are compiled languages.
  • Javascript and Lua. They are closer to my expectation in another direction, but for real-world applications, they are still too slow in comparison to C.
  • Python, Perl and Ruby. They are simply too sloooow most of time.

Then what is in my mind?

  • Have C/C++ like syntax (mainly because I am a C programmer).
  • Do not require to be explicitly compiled into native or byte code.
  • Garbage collected.
  • Allow to write very simple/short programs (not “main(){}” which is unnecessary for simple things).
  • Allow dynamic typing.
  • Up to here, many existing languages meet my requirements. Then comes the requirement of speed: I hope the ideal programming language should not be over three times as slow in comparison to C for real-world applications (not microbenchmarks). An immediate question is if this is possible at all. I think it is. In my fairly naive view, LuaJIT and V8 are still much slower than C mainly because Lua and Javascript were not designed to achieve a speed comparable to Java and not designed for JIT from the beginning. It is hard to change something that is not designed for. With a careful design focusing on speed and closely coupled with the modern theory on JIT compilation, I think it should be possible to achieve a speed far ahead of LuaJIT and V8, two of the best JIT compilers so far.
  • Optional static typing. This may be necessary to guarantee high performance when we really need that.
  • Mutable strings or byte array. This request is more for my own work. I work with mutable strings a lot.

Why was I thinking of my ideal programming language when I read an article about Dart? Because if Dart turns out to be something real, it may be the right programming language in my mind. There are things I do not like (e.g. requiring semi-colon; requiring main(); lack of a File class), but it seems closer than others. To get an efficient implementation, we need a fresh start.

Mutex and spinlock

When multiple threads need to modify a shared resource such as a global variable, we typically lock the resource to guarantee at most one thread can do modification, such that a thread cannot interfere with other threads and leads to unexpected behaviors. There are different types of locks. For Linux C programmers, probably the most frequently used lock is mutex from NPTL, the Native POSIX Thread Library.

In NPTL, mutex is implemented on top of the futex, a kernel primitive for assisting the implementation of locking. Essentially, futex provides a way to ask the kernel to block a thread until the value at a given address is changed, and to wake up all threads waiting on an address. The implementation of mutex is kernel-dependent and complex. For now, I could not think of a means to implement mutex without closely interacting with the kernel.

Another much simpler type of lock is spinlock. The fundamental difference between spinlock and mutex is that spinlock keeps checking the lock (busy waiting), while mutex puts threads waiting for the lock into sleep (blocked). A busy-waiting thread wastes CPU cycles, while a blocked thread does not. Regardless of this difference, mutex and spinlock have very similar behavior.

According to POSIX, implementing spinlock in the pthread library is optional. Mac OS X does not implement spinlock in its pthread library; instead, it implements spinlock in several OS-specific functions, which indeed causes portability issue.

Fortunately, spinlock is much easier to implement than mutex. The basic form of spinlock can be implemented with a dozen or so instructions on an X86 CPU. Indeed, NPTL does this. The Linux kernel implements spinlock in a similar way. It is also fairly easy to implement spinlock by yourself if you do not care too much about portability:

volatile int lock = 0;
void *worker(void*)
{
  while (__sync_lock_test_and_set(&lock, 1));
  // critical section
  __sync_lock_release(&lock);
}

where __sync_lock_test_and_set() and __sync_lock_release() are GCC atomic builtins, which guarantee atomic memory access. In the code above, the volatile keyword suggests that the lock variable may be changed by other threads. __sync_lock_test_and_test(&lock, 1) atomically sets lock as 1 and returns its previous value. If the critical section has been locked (i.e. lock=1) by other threads, we keep looping until another thread holds the lock calls __sync_lock_release() which sets lock to 0.

A probably better implementation is

while (__sync_lock_test_and_set(&lock, 1)) while (lock);
// critical section
__sync_lock_release(&lock);

It is also straightforward to only use the compare-and-swap (CAS) instruction to implement a spinlock, though this way is a little slower. Here are a little more discussions on the implementation of spinlock.

Evaluating mutex, spinlock and semaphore

Except blocking vs. busy-waiting, mutex and spinlock have very similar effect and can usually (not always) be replaced by each other. Then which lock is preferred? In addition to mutex and spinlock, semaphore can also be used as mutex. Is it faster or slower? More generally, how much overhead locking causes in comparison to a single-threaded program? The following benchmark aims to give a hint about these questions.

Design of the benchmark

The task is simple: we want to multithread the following code:

int n = 1000000, m = 100, i;
uint64_t *bits;
bits = (uint64_t*)calloc((n + 63) / 64, 8);
for (i = 0; i < n*m; ++i) {
  int x = (uint64_t)i * i * i % n;
  bits[x>>6] ^= 1LLU << (x & 0x3f);
}

The ‘bits‘ will be shared across threads and need to be protected. We are going to evaluate the following strategies:

Method

Description
Single

Single-threaded version
Lock-free

Using GCC’s atomic builtins without locking
Spin

Spin lock
Pthread spin

Spin lock from pthread
Pthrrea mutex

Mutex from pthread
Semaphore

Semaphore as lock
Buffer+spin

Buffering numbers for less frequent locking using spin lock
Buffer+mutex

Buffering numbers for less frequent locking using mutex

The C program is available in my github repository as a single C file.

Results and discussions

You can find the full results at the end of the source code file. I give part of the results in the following table, where the “OSX-1” column gives the “real / CPU” time in seconds on Mac OS X with a single thread, “Linux-2” on Linux with 2 threads, and so on. More detailed information about the platforms is given in the source code file.

Method

OSX-1

OSX-2

Linux-1

Linux-2

Linux-4
Single

1.4 / 1.4

N/A

1.2 / 1.1

N/A

N/A
Lock-free

3.8 / 3.8

2.2 / 4.2

3.6 / 3.6

3.7 / 7.3

2.5 / 9.9
Spin

5.8 / 5.7

5.1 / 9.5

5.1 / 5.1

20.7 / 41.2

32.7 / 130
Pthread spin

N/A

N/A

3.8 / 3.8

21.1 / 42.2

53.8 / 206
Pthread mutex

7.5 / 7.5

182 / 271

6.0 / 5.9

31.0 / 45.1

97.2 / 284
Semaphore

85 / 85

51 / 96

5.5 / 5.4

46.0 / 68.9

133 / 418
Buffer+spin

1.3 / 1.3

0.7 / 1.3

1.2 / 1.2

0.7 / 1.4

0.5 / 2.0
Buffer+mutex

1.3 / 1.3

0.7 / 1.4

1.2 / 1.2

0.7 / 1.4

0.5 / 1.5

Here is what I have learned from this microbenchmark:

  • For multi-threaded programming, it is preferred to let threads interact less with each other as CPU has to pay a price for frequent cross-talk between threads. This is true even if we do not use any lock (see lock-free vs. buffer+mutex); frequent locking/unlocking is much worse (see mutex vs. buffer+mutex). A proper design of the algorithm is more important than choosing between spinlock, mutex and semaphore.
  • In this benchmark, the lock-free implementation is less efficient than buffer+mutex, but a lock-free algorithm may be more scalable to massive CPU cores because locking will become more often given more concurrent threads. When frequent locking/unlocking can hardly be avoided, a lock-free algorithm, if possible in the first place, may deliver better performance, too.
  • When locking is infrequently, it does not matter too much what type of lock to use – little time goes to locking anyway. When locking is frequent, it is worthwhile to explore both spinlock and mutex. Although using mutex is usually safer and more portable, spinlock may be faster (careless micro-benchmarks tend to prefer spinlock actually). Nonetheless, let me emphasize again that we should first try our best to reduce cross-talk between threads, which is more effective than playing with different types of locks.
  • The performance of locking is CPU, OS and library dependent. It is worth testing our programs on multiple platforms.

Zilong Tan has kindly shared his observation that using a power-of-2 bucket count may dramatically improve the speed for integer keys. He implemented this idea in his ulib library. He is right. Here is a tiny benchmark done on my laptop (Intel Core 2 Duo; Mac OS X 10.6; gcc 4.2.1; program khash_test.c).

Version

Int-set-5M (sec)

Str-set-5M

Int-set-4M

Int-set-6M
0.2.5

0.898

2.174

0.593

1.327
0.2.6

0.476

1.814

0.285

0.664
0.2.6 linear

0.333

1.744

0.250

0.455

where “0.2.6” is the new version and “0.2.6 linear” indicates that linear probing is in use (by default, khash uses double hashing for stability). We can see that 0.2.6 is nearly twice as fast as 0.2.5 for integer hash set. Linear probing is even faster than the default double hashing due to the better cache performance and a little fewer CPU instructions, but double hashing tends to be more robust to certain non-random input.

The new version is available here.

Update: Note that due to rehashing the performance of a hash table as a function of input size is not a smooth curve. Rehashing may take time. Thus I added two more data points to show that the speedup is not due to this caveat.

I read a news that Mozilla is proposing WebAPIs in hope to replace native apps with web apps. There has been a long standing tension between web apps and native apps and I have read quite a lot of technical articles and readers’ comments on this topic in the past few years. However, what amazes me is that almost nobody mentions the difference in the responsiveness. For me, I always dislike the slowness of web interfaces. Although the difference from a native app is only a split of second, the delay greatly hurts my experiences. Moreover, my feeling is that web/flash based games take more CPU resources than a game written in “low-level” languages. For example, I heard the fan of my laptop roaring when I left an angry bird on the screen (a few birds jumping every few seconds); Entanglement took 100% of CPU while really doing nothing (no movable elements at all). Running these games on a mobile phone will quickly suck the precious battery life. Perhaps I am alone as almost no readers mention these issues?

Anyway to me, I do not think we have reached the point to replace the native apps. Although web apps ease the cross-platform development, users’ experiences are far more important. Mozilla’s effort is in vain.

I have added a web interface to my Javascript Sudoku solver. It is available here. You can input puzzles in multiple ways, for example via direct input, URL or text selection. It should be the fastest Javascript solver to date.

With the completion of the sudoku solving benchmark (my last post), my programming language benchmark is also regarded to be completed (still with a few implementations missing). This post gives more context and analyses of the benchmark.

Design

This benchmark is comprised of four tasks:

  1. solving 1000 Sudoku puzzles
  2. multiplying two 1000×1000 matrices
  3. matching URI or URI|Email in a concatenated Linux HowTo file
  4. counting the occurrences of words using a dictionary

The first two tasks focus on evaluating the performance of the ability of translating the language source code into machine code. For these two tasks, most of CPU time is spent on the benchmarking programs. The last two tasks focus on evaluating the efficiency of the companion libraries. For these two tasks, most of CPU time is spent on the library routines. These tasks are relatively simple and cannot be easily hand optimized for better performance.

Results and discussions

The complete results are available here. The following figure shows the CPU time for Sudoku solving and matrix multiplication, both evaluating the language implementation itself (click for a larger figure):

In the plots, a number in red indicates that the corresponding implementation requires explicit compilation; in blue shows that the implementation applies a Just-In-Time compilation (JIT); in black implies the implementation interprets the program but without JIT.

The overall message is the following. Languages compiled into machine code (C and D) are slightly faster than languages compiled into bytecode (Java and C#); compilers tend to be faster than Just-In-Time (JIT) interpreters (LuaJIT, PyPy and V8); JIT interpreters are much faster than the conventional interpreters (Perl, CPython and Ruby). Between compilers, C is still the winner with a thin margin. Between interpreters, LuaJIT and V8 pull ahead. There is little surprising for most language implementations, perhaps except the few with very bad performance.

On the other hand, the comparison of the library performance yields a vastly different picture (again, click to enlarge):

This time, even conventional interpreters may approach or even surpass the optimized C implementation (Perl vs. C for simple regex matching). Some compiled languages at their early ages may perform badly.

Conclusions

The quality of libraries is a critical part of a programming language. This benchmark is one of few clearly separating the performance of the language implementation itself and its companion libraries. While compiled languages are typically one or two orders of magnitude faster than interpreted languages, library performance may be very similar. For algorithms heavily rely on library routines, the choice of programming language does not matter too much. It is likely to come up with a benchmark to beat C/C++ in a certain application.

All the benchmarking programs are distributed under the MIT/X11 license. Please follow the links below for the source code and the complete results:

There are actually more to say about each specific language implementation, but perhaps I’d better leave the controversial part to readers.

Introduction

I conducted a programming language benchmark a couple of months ago. One of the major concerns is that it is a microbenchmark that rarely measures the real-world performance. When I saw Google’s recent publication (PDF) on the real-world comparison of C++, Java, Scala and Go, I was wondering if I could consider to adopt these problems in my benchmark. Nonetheless, I quickly gave up because of its complexity and because I am not quite interested in the problem itself. Is there an interesting problem that is easy to implement and informative for measuring the performance programming languages? Sudoku.

Sudoku is probably the most popular puzzle nowadays. Lots of attempts have been made to solve Sudoku using programs. However, solving Sudoku is known to be an NP-complete problem, which means no efficient algorithm exists so far. Finding the solution largely relies on black magic. On the other hand, lacking a definite answer makes Sudoku more interesting. A variety of methods arise to solve Sudoku. Wiki gives a good brief view of the existing algorithms. I will focus more on the practical implementations.

A brief overview of Sudoku solving algorithms

Solving Sudoku puzzles can be reduced to a graph coloring problem, or more often an exact cover problem that can be solved with Knuth’s elegant Dancing Links algorithm (DLX). There are DLX-based solvers in C, C++, Java, C#, Python, Ruby and Javascript. Google search will give you many more. These solvers are usually fast, but for the few I have tried, they are not the fastest. To the best of my knowledge, the fastest implementations for each programming language are all based on generic backtracking without using DLX. The basic idea behind these fast solvers is simple: we select the cell or the constraint with the minimum choices, fix to one number and move forward; when a cell is left with no choice, we go back to the previous step and select the next allowed number. The key to fast solving is to quickly find cells that can be determined with simple logic, including at least the naked single and hidden single rules. Fast solvers may employ a little more advanced logic (e.g. locked candidates) if that can be computed fast. The rest is data structures and details which vary a lot and also matter a lot. It is the great variability in algorithms and implementations that makes Sudoku fascinating.

Benchmarking several Sudoku solvers

To demonstrate how fast a Sudoku can be solved and how the solving speed may be affected by implementations, I briefly evaluated the performance of a few solvers using 20 very hard Sudoku puzzles repeated 50 times (1000=20×50 Sudokus in total). Some of these solvers are casual implementations and not to be perform well. I just choose a few to show the impact of algorithms and implementations. The following table shows the result of solving these 1000 very hard Sudokus.

Solver

Algorithm

Test Unique?

Language

Time (sec)
JSolve

backtracking

Yes

C

0.25
kudoku

backtracking

Yes

C

1.1
fast_solv_9r2

DLX

Yes

C

1.4
kudoku

backtracking

Yes

Java

2.0
kudoku

backtracking

Yes

JavaScript

6.3
kudoku

backtracking

Yes

Lua

7.5
sudoku-bb

Backtrack

No

Python

33.5
sudoku-gh

DLX

Yes

JavaScript

41.2
Peter Norvig’s

backtracking

No

Python

147.4
kudoku

backtracking

Yes

Python

190.5
sudoku-pk

backtracking

No

Javascript

278.4
sudoku-aa

DLX

Yes

Python

514.0
sudoku_db

backtracking

No

Python

1915
Sudoku_bc

backtracking

Yes

Java

2180
Sudoku_dl

unknown

No

Java

2975
Sudoku_6l

brute-force

No

Java

25385

The third column indicates whether the solver finds the first solution only or confirms the input Sudoku having a unique solution. Finding all solutions is typically 50-100% slower than stopping at the first solution.

Discussions

So far as I know, JSolve and fsss are the fastest Sudoku solvers to date.
JSolve based on BB_sudoku, Brian’s famous Bit Based Sudoku Solver. These three solvers are all based on generic backtracking instead of DLX and all introduce advanced tactics in addition to the basic naked/hidden single rules.

Kudoku is my solver. The basic idea comes from Guenter Stertenbrink’s suexk with some improvements. Kudoku is first implemented in C and then ported to a few other programming languages. It is not the fastest solver, but much simpler than JSolve and fsss.

The speed difference between the four Java implementations is striking. My solver is 1000-10000 times faster than the other three solvers. This proves how important the algorithm is to solving Sudoku puzzles. Choosing the right algorithm can make a huge difference.

Peter Norvig has also implemented a nice solver in Python, which has been ported to Ruby and Javascript among a few others. “sudoku-pk.js” is one of the Javascript ports. Note that Peter’s solver stops at the first solution (if I am right), while Kudoku finds all solutions. If I make Kudoku find one solution only, it takes 100 seconds, faster than other Python implementations.

EDIT: After I published this post, Boris Borcic has kindly sent me his Python solver (sudoku-bb in the table above), which is easily the fastest among the Python solvers I know. This solver is extremely interesting for two reasons. Firstly, by algorithm, Boris’ solver is not much different from my solver, both based on the exact-cover matrix without dancing. The major difference is sudoku-bb uses Python native functions more while my version, which is literally translated from C, heavily relies on deep nested loops (because I do not know python well). Although my version has smaller time complexity (I believe) and is ~50% faster with Pypy, sudoku-bb is 3X as fast with CPython. This example shows how important it is to make the best use of native routines.

Secondly and more importantly, Boris’ solver relies on a very interesting property of Sudoku: if the Sudoku has a unique solution, using “naked singles” and “hidden singles” rules guarantees that at each step, there are at most two choices. Neither he nor I cannot prove this, but it is true for ~100,000 hard Sudoku puzzles. I do not know if this is a known conclusion (we should ask mathematicians). Using this property can simplify coding, but disables unique solution checking.

Conclusions

  • The fastest Sudoku solver can solve even the hardest Sudoku in about 1 millisecond and solve most others in 0.1 millisecond.
  • An inefficient algorithm can be 3 to 4 order of magnitude slower than an efficient one, so choose the algorithm carefully.
  • I have added Sudoku solving to my programming language benchmark as a performance measurement for a real-world application (incomplete right now). The Javascript solver is also available online. You may get a sense how fast it is to solve Sudoku puzzles, unless you are using IE 6/7/8 which are known to have crappy Javascript engines.

After the withdrawal of LuaJIT, PyPy and a few other great implementations from the Compute Language Benchmarks Game, we are virtually left with no benchmarks that compare the performance between languages and between implementations of the same language. Without both types of comparisons, we will be unable to tell which programming language has the best performance in practice, because languages and implementations are closely related.

In the end, I decide to benchmark programming languages by myself. Although right now the benchmark is very primitive and limited, in my opinion it is not very misleading and thus better than nothing.

In the current benchmark, I am evaluating the speed of multiplying two 1000×1000 matrices using the standard method (there are theoretically faster algorithms). The benchmark covers 11 programming languages including C (icc, gcc and clang), C# (mono), D (gdc), Go (6g), Java, Javascript (V8, JaegarMonkey and Rhino), Lua (Lua and LuaJIT), Perl, Python (CPython2/3, PyPy, Jython and IronPython@mono), Ruby (Ruby, JRuby, Rubinius and IronRuby@mono) and R.

Note that this is the first time I write C#, Go and Ruby and I am very unfamiliar with D, R, Python and Java. It is quite likely that I did something silly even though matrix multiplication sounds easy. Please feel free to correct me. Thank you.

Results: http://attractivechaos.github.com/plb/
Source code: https://github.com/attractivechaos/plb/

EDIT: Results are also in this post.

A few hours ago, I wanted to show to my friends how speedy LuaJIT is. Naturally I went to the Computer Language Benchmarks Game website, but to my great surprise, LuaJIT is gone. I then posted a brief question in this blog asking why. Sindisil has very kindly pointed me to this informative discussion. In short, LuaJIT, along with a few other great language implementations, is gone and probably will never be back.

I have been watching this wonderful benchmark suite for several years and been greatly benefit from it. I really appreciate all that Issac, the owner of this website, has done. I know it must have been taking a lot of his own time.

However, with the withdrawal of various languages implementations, I have to say the benchmark is not of much use now. The great virtue of the old benchmark suite is we can compare different languages AND different language implementations. Take Javascript as an example. There are multiple Javascript engines and we see various benchmarks on how they are compared to each other. Nonetheless, we are also interested in knowing if the best Javascript engine is comparable to the best implementation of other scripting languages. For a long time, the best place to know this is the Computer Language Benchmark Game. Unfortunately, not any more. Now we see that Javascript V8 is the fastest scripting language, 5X as fast as Lua as LuaJIT is not there any more. Ah, the new benchmark is not only of little use but also misleading.

This reddit discussion and a post from a core PyPy developer also point out the weakness of the benchmark, such as over-optimization for one implementation, the use of foreign libraries (especially C libraries), different coding qualities for different languages and the use of unusual/unportable optimization.

Perhaps it is time for someone to come up with a more proper, more open and less biased benchmark.