Introduction
About two years ago I evaluated the performance of ~20 compilers and interpreters on sudoku solving, matrix multiplication, pattern matching and dictionary operations. Two years later, I decide update a small part of the benchmark on Sudoku solving. I choose this problem because it is practically and algorithmically interesting, and simple enough to be easily ported to multiple languages. Meanwhile, I am also adding two new programming languages: Mozilla’s Rust and Google’s Dart. They are probably the most promising languages announced in the past two years.
Results
In this small benchmark, I am implementing Sudoku solvers in multiple programming languages. The algorithm, adapted from Guenter Stertenbrink’s solver, was first implemented in C and then ported to other languages. The C source code briefly describes the method. For more information about Sudoku solving in general, please see my other post.
Before I show the results, there are a couple of caveats to note:
- Solving Sudoku is NP-hard. The choice of the solving algorithm will dramatically affect the speed. For example, my Rust implementation is ~2500 times faster than the one in the Rust official repository. For a language benchmark, we must implement exactly the same algorithm.
- I am mostly familiar with C but am pretty much a newbie in other programming languages. I am sure some implementations are not optimal. If you can improve the code, please send me a pull request. I am happy to replace with a better version.
The following table shows the CPU time for solving 20 hard Sudokus repeated 50 500 times (thus 1000 10000 Sudokus in total). The programs, which are freely available, are compiled and run on my Mac laptop with a 2.66GHz Core i7 CPU.
Compiler/VM | Version | Language | Option | CPU time (sec) |
---|---|---|---|---|
clang | 425.0.27 (3.2svn) | C | -O2 | 8.92 |
llvm-gcc | 4.2.1 | C | -O2 | 9.23 |
dmd | 2.062 | D2 | -O -release -noboundscheck | 11.54 11.47 |
rust | 0.6 | Rust | –opt-level 3 | 11.51 |
java | 1.6.0_37 | Java | -d64 | 11.57 |
go | 1.1beta 20130406 | Go | (default) -gcflags -B | 14.96 13.78 |
dart | 0.4.4.4-r20810 | Dart | 21.42 | |
v8 | 3.16.3 | Javascript | 28.19 | |
luajit | 2.0.1 | Lua | 30.66 | |
pypy | 2.0-beta-130405 | Python | 44.29 |
In this small benchmark, C still takes the crown of speed, Other statically typed languages are about twice as slow but Rust and D are very close to C. It is pretty amazing that Rust as a new language is that performant given the developers have not put too much efforts on speed so far.
Among dynamically typed languages, Dart, V8 and LuaJIT are similar in speed, about 3 times as slow as C. 3 times is arguably not much to many applications. I really hope some day I can use a handy dynamically typed language for most programming. Pypy is slower here, but it is more than twice as fast as the version two years ago.
Related resources
- Old benchmark results.
- Sudoku solver source code based on the same algorithm.
- Online Sudoku solver using the same algorithm.
- Third-party sudoku solvers.
- Reddit discussions.
- Hacker News discussions.
Update
- I forgot to use `-release’ with dmd. The new result looks much better. Sorry for my mistake.
- Mac ships gcc-4.2.1 only due to licensing issues. I have just tried both gcc 4.7.2 and gcc 4.8 from MacPorts. The executables compiled by them take 0.99 second to run, slower than gcc-4.2.1.
- Updated to the latest Go compiled from the repository.
- Updated the Python implementation (thanks to Rob Smallshire).
- Updated the Dart implementation (thanks to jwendel).
- Updated the Rust implementation (thanks to dotdash).
- Made input 10 times larger to reduce the fraction of time spent on VM startup. Dart/V8/LuaJIT have short VM startup time, but Java is known to have a long startup.
- Updated the Go implementation (thanks to Sébastien Paolacci).
- Updated the Python implementation.
I would be included to forsake using Rust and Dart in favor of Microsoft VB6. I know it is old but it will outperform those other two languages by an extremely large margin and as well, will take less time to implement.
I have never written anything in VB, but being Windows only is enough for me for discourage its use. Rust and Bart both run on the three major OS.
What about testing in Julia, Dylan, Lisp and C++ compilers as well?
Well, if someone provide me with programs, I am happy to include.
Can you retry with Go 1.1? I am interested in their performance improvements.
Go-1.1beta1 binary for mac seems broken for now. I cannot install it.
On a (low powered) mac book air:
~$ hg clone https://code.google.com/p/go/
(…)
3720 files updated, 0 files merged, 0 files removed, 0 files unresolved
~$ cd go/src/
~/go/src$ time ./make.bash
# Building compilers and Go bootstrap tool for host, darwin/amd64.
(…)
real 0m28.727s
user 0m53.440s
sys 0m9.524s
Updated Go. Thanks. Didn’t know compiling go is that fast. It took me 1 hour or so to compile Rust and several hours to compile gcc-4.7.2 (and failed in the latter case).
You should throw in go 1.1 beta 1 and see if that affects the go results.
Too lazy to look at the benchmark source …
In the case of jitted/interpreted languages,
are jit compilation time and interpreter runtime startup
factored out from the elapsed time computation?
If not, wouldn’t be better to run some more iterations in order to make the jit time and interpreter startup negligible with respect to the total run time?
I would target 60 seconds as the total run time …
If you double the number of input, almost no changes in time. I think VM startup probably takes ~0.1 sec, which does not affect the general conclusion at all.
It turns out you are right. Although some VMs have short startup time, others do not. It would be good to make programs run longer and I have done that. Thank you.
Hello,
Thanks for those measurements, that’s much interesting/concrete than few other synthetic benchs.
FYI, out of curiosity, I ran your go version against go 1.1-beta1 (without any modification but using Go’s benchmark lib) and it looks like a nice improvement is coming for free.
benchmark old ns/op new ns/op delta
Benchmark1 43055580 34747065 -19.30%
Cheers,
Sebastien
what about scala and clojure they gained bit of limelight in recent times
Could you provide a version based on the Java/C implementation? Thanks.
Why are you comparing clang-svn and gcc 4.2 which was released 5 years ago? The latest version of gcc is 4.8.
4.2.1 is shipped with Mac. I think there are licensing issues with higher versions. Anyway, both 4.8 and 4.7.2 turns out to be ~8% slower than 4.2.1.
I don’t think that clang-svn is shipped with your mac. You somehow managed to compare drastically outdated version of gcc with not-even-released version of clang in spite of this fact.
The licensing problem of gcc is only inside Apple Inc. heads. It does not prevent users from downloading new versions of gcc.
Clang is apple’s project and kept updated. Installing the latest Xcode will give me the latest clang.
I was hoping to get gccgo, so tried to compile gcc-4.7.2 from source on Mac. It turned out to be a huge pain. I had to install gmp etc. and fix an iconv problem, but after 2 hours, compilation failed again on another error. In the end, I grab binaries from MacPorts. Anyway, both 4.7.2 and 4.8 from MacPorts are slower than llvm-gcc-4.2.1 (forgot to say that it is based on LLVM).
As to the licensing issue, I believe it is caused by GPLv3 as I read somewhere. It is fair to say it is mostly Apple’s problem.
Since you still managed (and had the patience) to get a gcc-4.7+ works on you mac, could you take the opportunity to make a `gccgo -O3′ run? On my linux desktop, gccgo 4.7.2 reaches 9.54s where the “more standard” and up-to-date gc compiler max-out at 17.43s …
The gcc-4.7 from MacPorts does not have gccgo compiled. That is why I wanted to compile gcc from source, but gave up in the end.
I really doubt that, but this might be related to you using -O2 for some odd reason instead of -O3 which is the optimizer setting aimed at producing the fastest performing code.
If you are talking about gcc, -O3 adds more optimizations, which not necessarily lead to faster binary. At my hand, -O3 generates slower executables than -O2 half of time. On this example, -O2 is faster.
-O3 outperforms -O2 in my tests (linux debian/squeeze). Here is a link[1] w/ the results. O3 is around 5-6% faster than O2 and O2 is comparable to the improved java version.
[1]: http://pastebin.com/kcYgs1Um
performance drops in c might be to do with linking.
Try `gcc -O3 -static` or `gcc -O3 -flto`.
Any reason you aren’t using Java 7 .. it is available for OSX from Oracle.
Upgrade your GCC. 4.2 is probably what your grandpa used to learn C. The latest release is GCC 4.8!
Benchmarks are fun! Please accept the pull requests and run another.
The dart code uses left-shifts (<<) but doesn't truncate the result. This means that the VM might need to allocate bignums (or for the very least it needs to check for it).
If you have a little bit of time it would be nice to see if this makes a difference.
Also (more of a 'fyi'): Dart doesn't do on-stack-replacement yet. That means that `_solve` would not be optimized for the first run. It could be interesting to see the peak-performance by doing a warm-up round first. (see http://www.dartlang.org/articles/benchmarking/)
I would think “<<" is fine. The maximum integer in this program is around 655360. Someone used to explain to me this OSR issue. Dart really should implement that IMO. I have other posts about Dart. I think it is really promising as a replacement of most scripting languages such as Perl/Python, but in its current direction, it is another web language only.
If you truncate the “<<" result appropriately better code will be generated as there is no need to check for overflow at each "<<".
Dart will implement OSR. Until then proper warmup is needed to simulate real-app performance .
A 3x speed difference is huge for those of us working on high-traffic servers, simulations, hard realtime systems, and high-frequency trading systems.
As we say, C IS KING. Nothing can compare to it. It is the only tool to use for serious work.
GCC 4.2.1, seriously? Using GDC instead of DMD for the D code would put in on par with the C version. Making sure it implements the same logic would help too (it uses 32 bit ints instead of 16 bit ones and uses transposed matrices, so there are lot’s more cache misses).
Have you read the update? llvm-gcc 4.2.1 is the fastest gcc on mac for this benchmark. On matrix, it seems that “int[9][324] a” in D is equivalent to “int a[324][9]” in C or am I missing something? Has GDC finished the of support D2?
>llvm-gcc 4.2.1 is the fastest gcc on mac for this benchmark
I’m guessing that’s a gcc front end to LLVM? In that case, maybe the D compiler that produces the fastest code is LDC (which works better than GDC on a mac anyway, at least it did the last time have I tried it).
> On matrix, it seems that “int[9][324] a” in D is equivalent to “int a[324][9]” in C or am I missing something?
You’re right.
> Has GDC finished the of support D2?
It has over a year ago, and LDC has too.
Serious work? So the other 99% of industry is just fooling around? I’m a little tired of the caveman argument, particularly when people want to join the HFT cool kids, and ignore its rapid migration to FPGAs.
Since your 2011 results had Ruby, would you do it again with Ruby 2.0 ?
Hi,
Two comments:
Your Dart benchmark is not following the benchmarking guidelines for Dart, making your results less useful. Please follow the directions found here: http://www.dartlang.org/articles/benchmarking/. In particular, add a warm-up phase to your benchmark program.
Finally, how are you timing things? In the benchmark code I don’t see any timing code. So, I’m guessing you are just running ‘$ time benchmark’. If so, this will inflate all your numbers because ‘time’ will include the time the OS took to load the application into memory and get it ready to run, it will also include the time the OS takes to terminate the process. Some runtimes, for example, C can begin executing immediately, while others, like Java/Dart/dynamic languages, must parse the source or byte code and generate code before it can be run- how are you account for this extra time which has nothing to do with solving sudoku?
John
I know VM startup is negligible for LuaJIT. V8 turns out to be similar. Here is the timing for solving 1000/10000 sudoku with Luajit: 3.05s/30.57s; and with V8: 2.83s/28.02s. If startup/warmup takes significant amount of time, solving 10000 sudoku would be obviously faster per sudoku.
For Dart, the timing for solving 1000/10000 sudoku is 2.77s/26.77s. The startup/warmup is slower than Luajit/V8, but that is only 0.1s, which does not affect the general conclusion.
I don’t think it is fair to ask me to add warmup code for benchmarking Dart (and in this case, it is not so easy to warmup). Firstly, it has a pretty minor effect in this benchmark and secondly requiring warmup is really a flaw in Dart. I know you will make the same argument that for real applications, OSR is not necessary. That is probably true for web development, but not in my domain.
BTW, it would be good if the Dart team could interact with the Go team about the design of features and libraries for command-line applications. I know it is not of your priority.
How exactly do you test a program 50 times? Do you stick a for-loop in each to run the solver 50 times for each puzzle? Do you run the solver from a shell script 50 times?
seq 50 | xargs -i echo sudoku.txt > 1000.txt
time ./a.out < 1000.txt > /dev/null
I assume that should be “cat” instead of “echo”?
First of all, thanks for benchmarking Dart VM.
I took a cursory glance at the produced code and already see some minor inefficiencies and optimization opportunities.
One optimization advice however: consider making _R and _C final in your class and initialize them in the initializer list
Sudoku() : _R = new List(324), _C = new List(729) { /* … */ }
This should improve generated code.
Im working in submitting pull requests right now to speed up Dart in his testing.
Great! Please check out the latest version in github. Someone has already made it faster.
Thanks. Jwendel has just sent a pull request that fixes the issue. Dart is ~30% faster now. I am waiting to see if someone want to send V8 and LuaJIT programs to beat Dart. 😉
Since you disabled bounds checking on the D source, maybe you could pass “-gcflags -B” to disable bounds checking on Go? I get about a ~10% perf increase locally.
Done.
I’d be interested in compile times.
I am not sure why would you benchmark JIT languages in few seconds. Startup time/compilation, etc would be factored in each time.
Running java w/ -d64 either does make sense but it doesn’t matter too much. The option is virtually not useful even on 64bit system and I’d be picked automatically if there is 32GB+ memory. You want -server, though. Speaking of java code – the way it’s written it’s virtually non-optimizable in normal way and only OSR (on stack replacement) can be applied that is suboptimal. Comparing to array.length instead of constant would be beneficial most of the time (loop unrolling, bounds checking removal, etc). Zeroing memory is needed in C only, the code looks like ported from C w/o much effort put in ( 0 <<7|9 ==9, for instance). Overall writing microbenchmarks is very hard unless you know the JIT very well. In the end, I might try to rewrite the code if I find some time.
For Dart/V8/Luajit, VM startup can largely be ignored. For Java, however, your criticism is valid. I have made the input size 10 times as larger. Java startup time should not be an issue. Thank you. As to your other concerns, you will find those have at most 1-2% effect I guess: there are 10000 Sudokus, OSR will kick out after solving the first one; zeroing memory takes almost no time; -d64 implies -server and using either option makes no difference on my laptop; bound check in D/go only adds a couple of percent CPU time.
I did some modifications on java code and getting comparable (+-10%) execution as C (counting only read execution, and proper warm up). Note: printing on the system console kinda suck (takes like 400+ms and the calcs are round 750ms on E31270@3.4GHz), so redirecting to file would be better in that aspect; in my case: Elapsed: 1244.44ms w/ console, Elapsed: 818.47ms w/ java … > out. The problem w/ OSR is that the generated code is poorer than non-OSR one. I have to see the generated assembly to tell if that would have any effect. The C code is time (./a.out out ~ 930ms… All tests w/ 50iteration.
In the benchmark, all output is directed to /dev/null.
I’ve Modified both C and Java[1] to print the elapsed time, gettimeofday/System.nano respectively. The results w/ 500Iter: (C: gcc -O3 -msse4.2) — Elapsed 7636
(java -server -XX:CompileThreshold=1500 -Dwarmup=4000) — Elapsed: 7865.70, Elapsed+Read: 7933.81
Also the java code uses short type for R and C, not printing in the computation method (bailing to the kernel screws L1 IC) and few more optimization. If everything fits L1 short[] won’t matter but in real world short type would be a win – less memory pressure. Overall java won’t match C w/ -O3 but it should be no more than 5-10% slower.
[1]: http://pastebin.com/syYQiLBa
I have no idea if it’s even possible to run from a command line to test this, but I’d love to see asm.js added to the list.
I’m learning Rust right now and I’m really encouraged to see how well it’s doing for a 0.* release.
Someone has compiled sudoku.c to asm.js, but I do not know how to use it…
Use OdinMonkey (Mozilla).
It used to be that llvm-gcc generated much slower code than plain old GCC: http://www.phoronix.com/scan.php?page=article&item=apple_llvm_gcc&num=1
I wonder if that’s still the case today.
Maybe in general, but on this benchmark, llvm-gcc is faster than the latest plain gcc.
You mention that Sudoku is NP-hard – wouldn’t it be more precise to say that it’s NP-complete instead?
It was shown to be NP-complete in http://www-imai.is.s.u-tokyo.ac.jp/~yato/data2/SIGAL87-2.pdf (obviously that means it’s also NP-hard so your statement isn’t incorrect, just less precise).
WIth gcc you should always try out all the four combinations of -O2 / -O3 and -funroll-loops / -fno-unroll-loops. Playing with these I get between 7.9s and 9.5s on my Core(TM) i7-2670QM CPU @ 2.20GHz running Fedora 18
Small note about D: Currently fastest D compiler is llvm-based ldc, here is execution speeds of solving lots of sudoku on my machine:
“dmd -O -release -noboundscheck sudoku_v2.d”:
real 0m9.445s
user 0m9.427s
sys 0m0.008s
“ldc2 -O3 -release -disable-boundscheck -vectorize-loops sudoku_v2.d”:
real 0m7.065s
user 0m7.045s
sys 0m0.013s
Are the benchmarks updated against beta version of Rust?