After the withdrawal of LuaJIT, PyPy and a few other great implementations from the Compute Language Benchmarks Game, we are virtually left with no benchmarks that compare the performance between languages and between implementations of the same language. Without both types of comparisons, we will be unable to tell which programming language has the best performance in practice, because languages and implementations are closely related.
In the end, I decide to benchmark programming languages by myself. Although right now the benchmark is very primitive and limited, in my opinion it is not very misleading and thus better than nothing.
In the current benchmark, I am evaluating the speed of multiplying two 1000×1000 matrices using the standard method (there are theoretically faster algorithms). The benchmark covers 11 programming languages including C (icc, gcc and clang), C# (mono), D (gdc), Go (6g), Java, Javascript (V8, JaegarMonkey and Rhino), Lua (Lua and LuaJIT), Perl, Python (CPython2/3, PyPy, Jython and IronPython@mono), Ruby (Ruby, JRuby, Rubinius and IronRuby@mono) and R.
Note that this is the first time I write C#, Go and Ruby and I am very unfamiliar with D, R, Python and Java. It is quite likely that I did something silly even though matrix multiplication sounds easy. Please feel free to correct me. Thank you.
Results: http://attractivechaos.github.com/plb/
Source code: https://github.com/attractivechaos/plb/
EDIT: Results are also in this post.
Any chance you could have the Python version of the matmul benchmark use the array module for storage of the numbers in the matrix?
I know these things surely happen. I have only written ~200 lines of python programs in total. Could you help to send a version to me? I will really appreciate that!
EDIT: with Brentp’s help, I have added a new Python script which is faster than v1. The improvement is dramatic for pypy.
For javascript: is .push() really that effective?
Some quick tests in the browser gives a minor improvement (2-4% for n=400) when creating arrays with initial size.
The bottleneck is this loop:
for (var k = 0; k < n; ++k) sum += ai[k] * cj[k];
Using push or not has little effect on efficiency.
Good work. Very useful.
Why do you say it is unfair for other programming languages when comparing to R?
I guess, instead of using R’s matrix multiplication %*% operator, you should implement it as a nested loop – which is what is being benchmarked in the other languages. The difference should be pretty spectacular..
here’s a version using array and a bit more idiomatic python and array.array (though i’m not sure it’s what alex had in mind). it seems about 10-15% faster on python2.7, 3.2.
import sys
import array
def matmul(a, b): # FIXME: no error checking
ra, rb = list(range(len(a))), list(range(len(b)))
c = [array.array(‘d’, [b[j][i] for j in rb]) for i in rb]
d = [array.array(‘d’, [0 for j in range(len(b[0]))]) for i in ra] # transpose
for i in ra:
ai = a[i]
for j in rb:
d[i][j] = sum(ai[k] * c[j][k] for k in ra)
return d
do you have some “rules” for this? e.g. is numpy allowed?
Thanks. Nonetheless, it seems that python, especially pypy, likes “+=” better than “sum()”. I have committed a version based on your modification. Except Jython, the other Python implementations all get improvements.
gist for formatting:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gistfile1.py
hosted with ❤ by GitHub
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gistfile1.txt
hosted with ❤ by GitHub
If LuaJIT uses FFI then JavaScript implementations should use flat Float64Array (available e.g. in V8’s shell).
Quick rewrite from normal JS arrays to flat Float64Array (allocated with Float64Array(n*n), indexed as a[i*n + j]) shows a factor of 3.5 speedup (14s -> 4s) on V8 r7678 (arch=ia32) running on my laptop.
The only problem is that shell (unlike d8 which you apparently use for benchmarking) does not expose command line arguments so you’ll have to hardcode 1000 into script source.
Not sure if I should forbid using FFI, as my initial intention is to minimize implementation-specific optimization. Nonetheless, your argument is fair: if LuaJIT uses FFI, programs should be allowed to use low-level arrays. BTW, without FFI, LuaJIT runs in 2.7 CPU seconds.
Could you put your version somewhere that can be publicly accessed? No matter what is the decision on FFI in future, I will surely include your version in GitHub. Many thanks.
Heh. Quite impressive though not unexpected. Having a NaN-boxed value representation is a huge win for this kind of benchmark.
V8 stores all doubles by “reference” (each double is stored as a small heap allocated object — HeapNumber) so it has to go through additional level of indirection when accessing matrix elements and has to allocate memory when storing a number into matrix.
That’s why using webgl arrays instead of normal JS arrays helps so much.
Regarding the FFI: I think it’s reasonable to allow everything except hacks 🙂
LuaJIT2 FFI is definitely not a hack, it a very nice and fast way of interfacing with native code without actually writing native code.
The same is true for Float64Array (and all other typed webgl arrays), it’s a part of upcoming webgl standard, modern browsers like Chrome and Firefox already support them, and people are using them for image manipulation, rendering, binary data marshaling etc.
If you don’t forbid pointers in C I see not reason why you should forbid any low-level features in other languages 🙂
Regarding my version — unfortunately it does not perfectly fit into your code pattern because V8’s shell does not provide access to command line arguments so I had to hardcode 1000 to perform benchmarking.
Do you have the program somewhere? d8 actually accepts command options, with “d8 yourscript.js — arguments”.
d8 does support arguments but does _not_ support Float64Array.
shell (which is build with scons sample=shell and used to run all tests etc) supports Float64Array (and all other webgl arrays) but does not support arguments.
I see. I did not know this. Many thanks for the clarification. But perhaps I may still want to include this version as an interesting alternative, if you are happy with that, of course.
The Go benchmark looks suspect to me. Here are the numbers I get, the first with the C version (gcc 4.2.1) the second with Go (6g weekly.2011-04-13 8085+) and the third a very slightly improved version that uses big pre-allocated slices to move the calls to make() out of the loops (didn’t improve as much as I might have thought). These results put go within 6% of C instead of 2x slower.
$ time ./matmul_v1c 1000
-95.583583
real 0m9.424s
user 0m9.204s
sys 0m0.087s
$ time ./matmul_v1 1000
-95.583583
real 0m9.990s
user 0m9.760s
sys 0m0.109s
$ time ./matmul_v2 1000
-95.583583
real 0m9.929s
user 0m9.691s
sys 0m0.098s
matmul_v1.go is the very first Go program I have ever written. Not surprising that I did something silly. Could you put your version somewhere? I would greatly appreciate!
BTW, what are compiler options are you using? Could that matter?
Just sent you a pull request on github. I don’t think what you did was silly – the code you wrote is clearer and the improvement I made turns out to be marginal.
Aha. Left off the optimisation flags for compiling of the C version. Now I get results similar to yours.
It would be nice to have an alternative to the Language Benchmarks Game. I’ve sent you a pull request for an updated Go benchmark that makes a significant difference.
A few more general suggestions:
1. Please document your timing method.
2. Don’t name the files matmul_v1, matmul_v2, etc. You have the files in version control, so the old versions will be kept anyway. Avoid the clutter and keep only the best version for each language/implementation.
3. Be more specific about the rules. You say “no libraries in foreign languages” but that seems unrealistic. Many interpreters are written in C and thus have parts of the runtime and standard library that are written in C also. The C library has parts that are written in assembly. I’d suggest a rule more like “no non-standard libraries written in another language”. You might also consider saying that no language extensions should be used.
1. Yes. I will do some time later.
2. I think it would be good to keep a couple of interesting alternative versions. In addition, although I would not encourage implementation-specific optimizations, sometimes this cannot be fully avoided.
3. A good suggestion.
Many thanks.
Your benchmarking article is good, but I have one problem with it: you are not showing how much of the program actually is implemented in the programming language itself.
The best example of this problem is the benchmark “patmch:1t”. This benchmark is won by Perl. Clearly, if the most part of the executed code would be actually implemented in Perl, then Perl would have no chance of winning against C, Java, Go or similar languages.
So, what I miss in the benchmarks (not just yours, the same issue can be found in many other benchmarks on the Internet) is another column in your tables which would be saying how many of the executed CPU instructions actually belong to code in the language being benchmarked. I know, it is hard to define where the precise boundary is – but in my opinion it is even worse not to define any boundary at all.
The “matmul” benchmark evaluates the speed of the language itself. For regex and dictionary, the benchmarks are designed such that most of CPU time is NOT spent on the language itself. This is evident from the comparison of LuaJIT with vs. without JIT. For regex and dict, we do not care if the language itself is fast. We only care if our program as a whole is fast.
I agree about “matmul”.
What you wrote about “regex” and “dictionary” is confusing to me. It is impossible to apply the rule “these two benchmarks are designed such that most of CPU time is NOT spent on the language itself” to compiled languages like C, Go or Java.
In case of interpreted programming languages, these two benchmarks are about the used libraries – not about the programming languages themselves.
Maybe it would be better for the article to have the title “Benchmarks of programming language compilers and interpreters, and of their libraries which may be implemented in another programming language”, because that is what your measurements are about.
Question: Would it be too problematic to add a column “patmch:1et” right next to the existing “patmch:1t” column? The “patmch:1et” column would mean the effective (e=effective) language in which the program is spending the most of its execution time. So, if Perl/Python/Lua/etc calls a C function and spend there 95% of the time, the effective language is C.
I think many people would consider such information to be useful.
This is a sensible suggestion. For Iron* and J*, I now give the virtual machine the programs actually run on. Presumably Iron* and J* are reusing .NET and Java functions behind the scene.
As to whether we should evaluate libraries, I think the core library is an indispensable part of a programming language. Depending on applications, the quality of libraries can be more important than the language itself. If a language is fast but comes with a crappy library, it is a bad language. (PS: In practice, however, a fast language tends to come with a good library because good programmers write efficient code almost everywhere.)
I’d be interested in seeing the 8g go compiler (32-bit x86 vs 64-bit) added to the tests in addition to the 6g compiler. Of course, this uses a different instruction set, so it mixes the “implementation difference” with the different performance of your CPU on different instructions. But it’d still be an interesting comparison, since often compiling 32-bit code is a very reasonable option even on 64-bit platforms, and it’d be interesting to see what the cost of doing so is, at least on one CPU with a few simple example programs.
It’d also be interesting to see a comparison with a simple blocked matrix multiply. It so much more efficient than the one you use, and is not so hard to implement (or to ensure that the algorithm is the same on each language version).
I cannot access a 32-bit machine and on 64-bit machines, “8g” seems not compiled.
I do not see how allocating the matrix in one memory block can greatly improves the efficiency, at least for C. For matrix memory allocation, addresses of adjacent rows should be close (this is the feature of most modern allocators). Allocating one memory block makes little improvement. As I have tried, the practical difference is indistinguishable. If you have an implementation that is faster than my implementation, I will certainly use yours in the benchmark. Thank you.
> 1) C programs are compiled with “gcc/clang -O3 -fomit-frame-pointer” or “icc -O3 -fomit-frame-pointer -xSSE4.1”
Do you know that Clang has -O4?
This is not the faster D2 version, but it’s an improvement over your code:
http://codepad.org/aifQYhjI
Great! Your version is definitely cleaner and slightly faster than my version. I have replaced my version with yours. Nonetheless, I have removed a few D2.0 specific keywords as for now I only have GDC in Linux which does not support D2.0. The D website has a ubuntu x86_64 package, but I cannot install it on the machine where I run the benchmarks.
Thank you!
For a fair comparison with C code you have to use LDC D1, a more optimizing D1 compiler:
http://www.dsource.org/projects/ldc
Code for LDC D1:
http://codepad.org/080zUrsA
Sorry for the tone of my last answer. I’d also like to see the ASM generated by ICC compiler on the matrix mul.
One fast D2 version:
http://ideone.com/LieSi
I really appreciate your contribution, but after an hour, I still could not get LDC running. Ldc can be compiled, but phobos causes troubles. Sorry.
Sorry for your wasted time. For LDC1 I use Tango standard library, not Phobos. The Computer Game site needs a lot of work to be kept updated 🙂
Thanks for the info. I can compile your program with LDC1+Tango now. LDC1 is indeed faster than GDC at least for matmul.
Somewhat faster and nicer Go dict code: http://pastie.org/1843735
The problem there is that every line results in a new string, and so long as we use the built in “map”, that’s mostly unavoidable. It’d be a lot faster with a hand-rolled hash table, but that seems a bit unnecessary here.
Yes, it is cleaner and slightly faster. I have replaced my version with yours. The benchmark page has been updated. Thanks a lot.
Hello,
Does this soft work with the Windows 98 edition? I am wondering how to install this, could you please put the installation work on here too.
I don’t think it is good thing to talk about a soft but not provide the soft on your site.
The github link gives the source code. They can be compiled/executed on Linux.
What options are you using for ldc? Do you add ‘-release’ flag?
Thanks a lot. I did not know this. Now for matmul, LDC-D is nearly as fast as GCC-C.
Some D1/D2 versions for the dict benchmark:
http://codepad.org/9NDDLOZ2
I’d like a lot to see the ASM generated by ICC compiler on the matrix mul benchmark, to see if ICC is using prefetching or other things.
It would be helpful if you can put a note about what happened with the LDC-LLVM implementation for the dict* test?
Also is the perl regex matching for `patmch:1t` 0.5 really correct? If I didn’t know better, I would have pegged it as an out-liar.
Thanks
LDC uses Tango which I am not familiar with. I am going to port the implementations by leonardo, but have not got time to do that.
Perl’s regex engine is really that fast. PCRE, onig and RE2 libraries can achieve a similar speed. For text processing, RE2 is probably the fastest library.
EDIT: leonardo does not provide a D1-LDC implementation for now.
I wonder if you could add (1) Fortran (2) Python with psycho (3) Python with shedskin?
If you would like to provide benchmark programs for Fortran, I will add. ShedSkin added. I have tried psycho before, but failed to run it successfully. Also, according to its website, pypy is considered to be its successor.
Sudoku, D2-DMD2 http://codepad.org/RejDPkGh and D1-LDC1: http://codepad.org/9KksgQFL (Do you have an email address?)
You are quick :). Thank you very much. My email address is “attractor (at) live.co.uk”.
I have put the D1-LDC1 result in the the benchmark page. Do you have time to implement patmch and dict in D1-LDC1, too? These should be very short programs. I have not evaluated DMD because I am running the programs on a Linux server I have no control of. The “libc” is quite old and incompatible with the binary release of dmd.
By the way, I like D a lot more than C++.
Email sent (days ago).
Hi, I notice your C# matrix multiplier uses 2-dimensional arrays. A skilled C# developer probably wouldn’t do that, because rumors abound that 2D arrays are slow. It’s probably better to use jagged arrays or a custom 2D array data type. I’m planning to test both of those possibilities tomorrow and add the results to my C#-vs-C++ benchmark at codeproject: http://www.codeproject.com/KB/cross-platform/BenchmarkCppVsDotNet.aspx
That would be great, David. Mono after all does not represent the best .NET VM, but I do not have a Windows machine, so cannot evaluate the MS version.
C# itself is great, but the lack of a high-performance VM in Linux/Mac is the major obstacle to its popularity. Mono has not reached the quality of the Java VM and the MS .NET VM.
I just noticed a problem in the C# sudoku algorithm. It contains the following line, with a loop variable mismatch:
for (j = 0; j < 81; ++j) y[i] = (char)(o[i] + '1');
The C version doesn't have this line. May I ask where you found the algorithm? It's a little hard to understand.
You may have a look at my post about solving sudoku puzzles. This algorithm is modified from suexco. The C version has a brief description of the algorithm.
As to that specific line, the C version directly modifies out[], while C# copies o[] to y[] first. I was experimenting C# features, and unfortunately left some intermediate results there. This should not change the benchmark results, though.
I believe the matrix multiplication algorithm in C# is incorrect. It works for the purpose of the benchmark (in which the matrixes are square) but would not work for arbitrary matrixes. Just look at the first loop which accesses b[i,j]; i goes up to “a.GetLength(1)” and j goes up to “b.GetLength(0)”, both wrong. I’ll try to fix that in my version of the benchmark at codeproject (which I’ll probably post today or tomorrow).
Yes, that is very true. The same problem may occur to a few other language implementations. Thanks.
[…] benchmarks were originally written to use 2-dimensional arrays in C#. Now, writing about the matrix test, the author stated: “this is the first time I write C#”. More seasoned C# developers […]
Any chance to update these results?
I’d love to see how pypy compares after one year of continuous progress…
I ran across this post just now. I realize is more than a year old. I am wondering whether you ran the C# / mono version with –llvm. Also there are some other options one should use. I find that am generally within 5-10% of C with care in how one codes in C# / mono.
Yes, I need to update the benchmark sometime. Things are moving fast. A lot have been changed. Thanks.
Since this is a benchmark of programming languages and a variety of what are normally used as scripting languages, could you include TinyCC in the benchmarks as an implementation of C? TinyCC is one of the few C implementations that can be integrated into other C applications, and knowing how it performs relative to the alternatives would be useful.
Hey, just awesome, finally someone showed us an alternative to the broken-minded Language Shootout Game. Update, please! Update, please! Update, please! Cool, man !