My programming language benchmarks (PLB)

April 25, 2011 by attractivechaos

After the withdrawal of LuaJIT, PyPy and a few other great implementations from the Compute Language Benchmarks Game, we are virtually left with no benchmarks that compare the performance between languages and between implementations of the same language. Without both types of comparisons, we will be unable to tell which programming language has the best performance in practice, because languages and implementations are closely related.

In the end, I decide to benchmark programming languages by myself. Although right now the benchmark is very primitive and limited, in my opinion it is not very misleading and thus better than nothing.

In the current benchmark, I am evaluating the speed of multiplying two 1000×1000 matrices using the standard method (there are theoretically faster algorithms). The benchmark covers 11 programming languages including C (icc, gcc and clang), C# (mono), D (gdc), Go (6g), Java, Javascript (V8, JaegarMonkey and Rhino), Lua (Lua and LuaJIT), Perl, Python (CPython2/3, PyPy, Jython and IronPython@mono), Ruby (Ruby, JRuby, Rubinius and IronRuby@mono) and R.

Note that this is the first time I write C#, Go and Ruby and I am very unfamiliar with D, R, Python and Java. It is quite likely that I did something silly even though matrix multiplication sounds easy. Please feel free to correct me. Thank you.

Results: http://attractivechaos.github.com/plb/
Source code: https://github.com/attractivechaos/plb/

EDIT: Results are also in this post.

Posted in Uncategorized | 64 Comments

64 Responses

on April 25, 2011 at 4:48 am | Reply Alex Gaynor

Any chance you could have the Python version of the matmul benchmark use the array module for storage of the numbers in the matrix?
- on April 25, 2011 at 4:57 am | Reply attractivechaos
  
  I know these things surely happen. I have only written ~200 lines of python programs in total. Could you help to send a version to me? I will really appreciate that!
  
  EDIT: with Brentp’s help, I have added a new Python script which is faster than v1. The improvement is dramatic for pypy.
on April 25, 2011 at 9:04 am | Reply Fredrik Carlbom

For javascript: is .push() really that effective?
Some quick tests in the browser gives a minor improvement (2-4% for n=400) when creating arrays with initial size.
- on April 25, 2011 at 2:17 pm | Reply attractivechaos
  
  The bottleneck is this loop:
  
  for (var k = 0; k < n; ++k) sum += ai[k] * cj[k];
  
  Using push or not has little effect on efficiency.
on April 25, 2011 at 1:03 pm | Reply drio

Good work. Very useful.

Why do you say it is unfair for other programming languages when comparing to R?
on April 25, 2011 at 1:06 pm | Reply Aylwyn

I guess, instead of using R’s matrix multiplication %*% operator, you should implement it as a nested loop – which is what is being benchmarked in the other languages. The difference should be pretty spectacular..
on April 25, 2011 at 1:51 pm | Reply brentp

here’s a version using array and a bit more idiomatic python and array.array (though i’m not sure it’s what alex had in mind). it seems about 10-15% faster on python2.7, 3.2.

import sys
import array

def matmul(a, b): # FIXME: no error checking
ra, rb = list(range(len(a))), list(range(len(b)))
c = [array.array(‘d’, [b[j][i] for j in rb]) for i in rb]
d = [array.array(‘d’, [0 for j in range(len(b[0]))]) for i in ra] # transpose
for i in ra:
ai = a[i]
for j in rb:
d[i][j] = sum(ai[k] * c[j][k] for k in ra)
return d

do you have some “rules” for this? e.g. is numpy allowed?
- on April 25, 2011 at 5:44 pm | Reply attractivechaos
  
  Thanks. Nonetheless, it seems that python, especially pypy, likes “+=” better than “sum()”. I have committed a version based on your modification. Except Jython, the other Python implementations all get improvements.

on April 25, 2011 at 1:52 pm | Reply

brentp

gist for formatting:

	import sys
	import array

	# Writen by Attractive Chaos; distributed under the MIT license

	# reference: http://www.syntagmatic.net/matrix-multiplication-in-python/

	def matmul(a, b): # FIXME: no error checking
	ra, rb = list(range(len(a))), list(range(len(b)))
	c = [array.array('d', [b[j][i] for j in rb]) for i in rb]
	d = [array.array('d', [0 for j in range(len(b[0]))]) for i in ra] # transpose
	for i in ra:
	ai = a[i]
	for j in rb:
	d[i][j] = sum(ai[k] * c[j][k] for k in ra)
	return d

view raw

gistfile1.py

hosted with ❤ by GitHub

	import sys
	import array

	# Writen by Attractive Chaos; distributed under the MIT license

	# reference: http://www.syntagmatic.net/matrix-multiplication-in-python/

	def matmul(a, b): # FIXME: no error checking
	ra, rb = list(range(len(a))), list(range(len(b)))
	c = [array.array('d', [b[j][i] for j in rb]) for i in rb]
	d = [array.array('d', [0 for j in range(len(b[0]))]) for i in ra] # transpose
	for i in ra:
	ai = a[i]
	for j in rb:
	d[i][j] = sum(ai[k] * c[j][k] for k in ra)
	return d

view raw

gistfile1.txt

hosted with ❤ by GitHub

on April 25, 2011 at 8:49 pm | Reply Vyacheslav Egorov

If LuaJIT uses FFI then JavaScript implementations should use flat Float64Array (available e.g. in V8’s shell).

Quick rewrite from normal JS arrays to flat Float64Array (allocated with Float64Array(n*n), indexed as a[i*n + j]) shows a factor of 3.5 speedup (14s -> 4s) on V8 r7678 (arch=ia32) running on my laptop.

The only problem is that shell (unlike d8 which you apparently use for benchmarking) does not expose command line arguments so you’ll have to hardcode 1000 into script source.
- on April 25, 2011 at 9:18 pm | Reply attractivechaos
  
  Not sure if I should forbid using FFI, as my initial intention is to minimize implementation-specific optimization. Nonetheless, your argument is fair: if LuaJIT uses FFI, programs should be allowed to use low-level arrays. BTW, without FFI, LuaJIT runs in 2.7 CPU seconds.
  
  Could you put your version somewhere that can be publicly accessed? No matter what is the decision on FFI in future, I will surely include your version in GitHub. Many thanks.
  - on April 25, 2011 at 9:35 pm mraleph
    
    Heh. Quite impressive though not unexpected. Having a NaN-boxed value representation is a huge win for this kind of benchmark.
    
    V8 stores all doubles by “reference” (each double is stored as a small heap allocated object — HeapNumber) so it has to go through additional level of indirection when accessing matrix elements and has to allocate memory when storing a number into matrix.
    
    That’s why using webgl arrays instead of normal JS arrays helps so much.
  - on April 25, 2011 at 9:51 pm Vyacheslav Egorov
    
    Regarding the FFI: I think it’s reasonable to allow everything except hacks 🙂
    
    LuaJIT2 FFI is definitely not a hack, it a very nice and fast way of interfacing with native code without actually writing native code.
    
    The same is true for Float64Array (and all other typed webgl arrays), it’s a part of upcoming webgl standard, modern browsers like Chrome and Firefox already support them, and people are using them for image manipulation, rendering, binary data marshaling etc.
    
    If you don’t forbid pointers in C I see not reason why you should forbid any low-level features in other languages 🙂
  - on April 25, 2011 at 9:54 pm Vyacheslav Egorov
    
    Regarding my version — unfortunately it does not perfectly fit into your code pattern because V8’s shell does not provide access to command line arguments so I had to hardcode 1000 to perform benchmarking.
  - on April 26, 2011 at 2:50 pm attractivechaos
    
    Do you have the program somewhere? d8 actually accepts command options, with “d8 yourscript.js — arguments”.
  - on April 26, 2011 at 4:32 pm Vyacheslav Egorov
    
    d8 does support arguments but does _not_ support Float64Array.
    
    shell (which is build with scons sample=shell and used to run all tests etc) supports Float64Array (and all other webgl arrays) but does not support arguments.
  - on April 26, 2011 at 5:48 pm attractivechaos
    
    I see. I did not know this. Many thanks for the clarification. But perhaps I may still want to include this version as an interesting alternative, if you are happy with that, of course.
on April 25, 2011 at 9:19 pm | Reply William Waites

The Go benchmark looks suspect to me. Here are the numbers I get, the first with the C version (gcc 4.2.1) the second with Go (6g weekly.2011-04-13 8085+) and the third a very slightly improved version that uses big pre-allocated slices to move the calls to make() out of the loops (didn’t improve as much as I might have thought). These results put go within 6% of C instead of 2x slower.

$ time ./matmul_v1c 1000
-95.583583

real 0m9.424s
user 0m9.204s
sys 0m0.087s

$ time ./matmul_v1 1000
-95.583583

real 0m9.990s
user 0m9.760s
sys 0m0.109s

$ time ./matmul_v2 1000
-95.583583

real 0m9.929s
user 0m9.691s
sys 0m0.098s
- on April 25, 2011 at 9:25 pm | Reply attractivechaos
  
  matmul_v1.go is the very first Go program I have ever written. Not surprising that I did something silly. Could you put your version somewhere? I would greatly appreciate!
  
  BTW, what are compiler options are you using? Could that matter?
  - on April 25, 2011 at 9:32 pm William Waites
    
    Just sent you a pull request on github. I don’t think what you did was silly – the code you wrote is clearer and the improvement I made turns out to be marginal.
  - on April 25, 2011 at 9:48 pm William Waites
    
    Aha. Left off the optimisation flags for compiling of the C version. Now I get results similar to yours.
on April 25, 2011 at 9:58 pm | Reply Evan Shaw

It would be nice to have an alternative to the Language Benchmarks Game. I’ve sent you a pull request for an updated Go benchmark that makes a significant difference.

A few more general suggestions:
1. Please document your timing method.

2. Don’t name the files matmul_v1, matmul_v2, etc. You have the files in version control, so the old versions will be kept anyway. Avoid the clutter and keep only the best version for each language/implementation.

3. Be more specific about the rules. You say “no libraries in foreign languages” but that seems unrealistic. Many interpreters are written in C and thus have parts of the runtime and standard library that are written in C also. The C library has parts that are written in assembly. I’d suggest a rule more like “no non-standard libraries written in another language”. You might also consider saying that no language extensions should be used.
- on April 25, 2011 at 10:16 pm | Reply attractivechaos
  
  1. Yes. I will do some time later.
  
  2. I think it would be good to keep a couple of interesting alternative versions. In addition, although I would not encourage implementation-specific optimizations, sometimes this cannot be fully avoided.
  
  3. A good suggestion.
  
  Many thanks.
on April 26, 2011 at 8:23 am | Reply ⚛

Your benchmarking article is good, but I have one problem with it: you are not showing how much of the program actually is implemented in the programming language itself.

The best example of this problem is the benchmark “patmch:1t”. This benchmark is won by Perl. Clearly, if the most part of the executed code would be actually implemented in Perl, then Perl would have no chance of winning against C, Java, Go or similar languages.

So, what I miss in the benchmarks (not just yours, the same issue can be found in many other benchmarks on the Internet) is another column in your tables which would be saying how many of the executed CPU instructions actually belong to code in the language being benchmarked. I know, it is hard to define where the precise boundary is – but in my opinion it is even worse not to define any boundary at all.
- on April 26, 2011 at 1:16 pm | Reply attractivechaos
  
  The “matmul” benchmark evaluates the speed of the language itself. For regex and dictionary, the benchmarks are designed such that most of CPU time is NOT spent on the language itself. This is evident from the comparison of LuaJIT with vs. without JIT. For regex and dict, we do not care if the language itself is fast. We only care if our program as a whole is fast.
  - on April 26, 2011 at 4:10 pm ⚛
    
    I agree about “matmul”.
    
    What you wrote about “regex” and “dictionary” is confusing to me. It is impossible to apply the rule “these two benchmarks are designed such that most of CPU time is NOT spent on the language itself” to compiled languages like C, Go or Java.
    
    In case of interpreted programming languages, these two benchmarks are about the used libraries – not about the programming languages themselves.
    
    Maybe it would be better for the article to have the title “Benchmarks of programming language compilers and interpreters, and of their libraries which may be implemented in another programming language”, because that is what your measurements are about.
    
    Question: Would it be too problematic to add a column “patmch:1et” right next to the existing “patmch:1t” column? The “patmch:1et” column would mean the effective (e=effective) language in which the program is spending the most of its execution time. So, if Perl/Python/Lua/etc calls a C function and spend there 95% of the time, the effective language is C.
    
    I think many people would consider such information to be useful.
  - on April 26, 2011 at 6:33 pm attractivechaos
    
    This is a sensible suggestion. For Iron* and J*, I now give the virtual machine the programs actually run on. Presumably Iron* and J* are reusing .NET and Java functions behind the scene.
    
    As to whether we should evaluate libraries, I think the core library is an indispensable part of a programming language. Depending on applications, the quality of libraries can be more important than the language itself. If a language is fast but comes with a crappy library, it is a bad language. (PS: In practice, however, a fast language tends to come with a good library because good programmers write efficient code almost everywhere.)
on April 26, 2011 at 4:03 pm | Reply David Roundy

I’d be interested in seeing the 8g go compiler (32-bit x86 vs 64-bit) added to the tests in addition to the 6g compiler. Of course, this uses a different instruction set, so it mixes the “implementation difference” with the different performance of your CPU on different instructions. But it’d still be an interesting comparison, since often compiling 32-bit code is a very reasonable option even on 64-bit platforms, and it’d be interesting to see what the cost of doing so is, at least on one CPU with a few simple example programs.

It’d also be interesting to see a comparison with a simple blocked matrix multiply. It so much more efficient than the one you use, and is not so hard to implement (or to ensure that the algorithm is the same on each language version).
- on April 26, 2011 at 5:34 pm | Reply attractivechaos
  
  I cannot access a 32-bit machine and on 64-bit machines, “8g” seems not compiled.
  
  I do not see how allocating the matrix in one memory block can greatly improves the efficiency, at least for C. For matrix memory allocation, addresses of adjacent rows should be close (this is the feature of most modern allocators). Allocating one memory block makes little improvement. As I have tried, the practical difference is indistinguishable. If you have an implementation that is faster than my implementation, I will certainly use yours in the benchmark. Thank you.
on April 28, 2011 at 11:04 am | Reply SciK

> 1) C programs are compiled with “gcc/clang -O3 -fomit-frame-pointer” or “icc -O3 -fomit-frame-pointer -xSSE4.1”

Do you know that Clang has -O4?
on April 28, 2011 at 11:29 am | Reply leonardo

This is not the faster D2 version, but it’s an improvement over your code:
http://codepad.org/aifQYhjI
- on April 28, 2011 at 3:46 pm | Reply attractivechaos
  
  Great! Your version is definitely cleaner and slightly faster than my version. I have replaced my version with yours. Nonetheless, I have removed a few D2.0 specific keywords as for now I only have GDC in Linux which does not support D2.0. The D website has a ubuntu x86_64 package, but I cannot install it on the machine where I run the benchmarks.
  
  Thank you!
  - on April 28, 2011 at 9:33 pm leonardo
    
    For a fair comparison with C code you have to use LDC D1, a more optimizing D1 compiler:
    http://www.dsource.org/projects/ldc
    Code for LDC D1:
    http://codepad.org/080zUrsA
  - on April 28, 2011 at 10:41 pm leonardo
    
    Sorry for the tone of my last answer. I’d also like to see the ASM generated by ICC compiler on the matrix mul.
  - on April 29, 2011 at 1:48 am leonardo
    
    One fast D2 version:
    http://ideone.com/LieSi
  - on April 29, 2011 at 4:08 am attractivechaos
    
    I really appreciate your contribution, but after an hour, I still could not get LDC running. Ldc can be compiled, but phobos causes troubles. Sorry.
- on April 29, 2011 at 7:22 am | Reply leonardo
  
  Sorry for your wasted time. For LDC1 I use Tango standard library, not Phobos. The Computer Game site needs a lot of work to be kept updated 🙂
  - on April 29, 2011 at 11:58 pm attractivechaos
    
    Thanks for the info. I can compile your program with LDC1+Tango now. LDC1 is indeed faster than GDC at least for matmul.
on April 28, 2011 at 3:43 pm | Reply Pat

Somewhat faster and nicer Go dict code: http://pastie.org/1843735

The problem there is that every line results in a new string, and so long as we use the built in “map”, that’s mostly unavoidable. It’d be a lot faster with a hand-rolled hash table, but that seems a bit unnecessary here.
- on April 28, 2011 at 3:59 pm | Reply attractivechaos
  
  Yes, it is cleaner and slightly faster. I have replaced my version with yours. The benchmark page has been updated. Thanks a lot.
on April 29, 2011 at 1:03 pm | Reply Weng Fu

Hello,

Does this soft work with the Windows 98 edition? I am wondering how to install this, could you please put the installation work on here too.

I don’t think it is good thing to talk about a soft but not provide the soft on your site.
- on April 29, 2011 at 1:23 pm | Reply attractivechaos
  
  The github link gives the source code. They can be compiled/executed on Linux.
on May 1, 2011 at 2:11 pm | Reply Al

What options are you using for ldc? Do you add ‘-release’ flag?
- on May 1, 2011 at 2:25 pm | Reply attractivechaos
  
  Thanks a lot. I did not know this. Now for matmul, LDC-D is nearly as fast as GCC-C.
on May 3, 2011 at 12:40 am | Reply leonardo

Some D1/D2 versions for the dict benchmark:
http://codepad.org/9NDDLOZ2

I’d like a lot to see the ASM generated by ICC compiler on the matrix mul benchmark, to see if ICC is using prefetching or other things.
on May 19, 2011 at 9:25 pm | Reply Victor T.

It would be helpful if you can put a note about what happened with the LDC-LLVM implementation for the dict* test?

Also is the perl regex matching for `patmch:1t` 0.5 really correct? If I didn’t know better, I would have pegged it as an out-liar.

Thanks
- on May 19, 2011 at 9:35 pm | Reply attractivechaos
  
  LDC uses Tango which I am not familiar with. I am going to port the implementations by leonardo, but have not got time to do that.
  
  Perl’s regex engine is really that fast. PCRE, onig and RE2 libraries can achieve a similar speed. For text processing, RE2 is probably the fastest library.
  
  EDIT: leonardo does not provide a D1-LDC implementation for now.
on June 9, 2011 at 11:26 pm | Reply Ariel Balter

I wonder if you could add (1) Fortran (2) Python with psycho (3) Python with shedskin?
- on June 10, 2011 at 1:48 am | Reply attractivechaos
  
  If you would like to provide benchmark programs for Fortran, I will add. ShedSkin added. I have tried psycho before, but failed to run it successfully. Also, according to its website, pypy is considered to be its successor.
on June 18, 2011 at 9:59 pm | Reply leonardo

Sudoku, D2-DMD2 http://codepad.org/RejDPkGh and D1-LDC1: http://codepad.org/9KksgQFL (Do you have an email address?)
- on June 21, 2011 at 1:58 am | Reply attractivechaos
  
  You are quick :). Thank you very much. My email address is “attractor (at) live.co.uk”.
  
  I have put the D1-LDC1 result in the the benchmark page. Do you have time to implement patmch and dict in D1-LDC1, too? These should be very short programs. I have not evaluated DMD because I am running the programs on a Linux server I have no control of. The “libc” is quite old and incompatible with the binary release of dmd.
  
  By the way, I like D a lot more than C++.
  - on June 25, 2011 at 8:57 am leonardo
    
    Email sent (days ago).
on June 25, 2011 at 10:10 am | Reply David

Hi, I notice your C# matrix multiplier uses 2-dimensional arrays. A skilled C# developer probably wouldn’t do that, because rumors abound that 2D arrays are slow. It’s probably better to use jagged arrays or a custom 2D array data type. I’m planning to test both of those possibilities tomorrow and add the results to my C#-vs-C++ benchmark at codeproject: http://www.codeproject.com/KB/cross-platform/BenchmarkCppVsDotNet.aspx
- on June 25, 2011 at 4:44 pm | Reply attractivechaos
  
  That would be great, David. Mono after all does not represent the best .NET VM, but I do not have a Windows machine, so cannot evaluate the MS version.
  
  C# itself is great, but the lack of a high-performance VM in Linux/Mac is the major obstacle to its popularity. Mono has not reached the quality of the Java VM and the MS .NET VM.
on June 26, 2011 at 9:35 pm | Reply David

I just noticed a problem in the C# sudoku algorithm. It contains the following line, with a loop variable mismatch:

for (j = 0; j < 81; ++j) y[i] = (char)(o[i] + '1');

The C version doesn't have this line. May I ask where you found the algorithm? It's a little hard to understand.
- on June 27, 2011 at 2:22 am | Reply attractivechaos
  
  You may have a look at my post about solving sudoku puzzles. This algorithm is modified from suexco. The C version has a brief description of the algorithm.
  
  As to that specific line, the C version directly modifies out[], while C# copies o[] to y[] first. I was experimenting C# features, and unfortunately left some intermediate results there. This should not change the benchmark results, though.
on June 29, 2011 at 4:03 pm | Reply qwertie

I believe the matrix multiplication algorithm in C# is incorrect. It works for the purpose of the benchmark (in which the matrixes are square) but would not work for arbitrary matrixes. Just look at the first loop which accesses b[i,j]; i goes up to “a.GetLength(1)” and j goes up to “b.GetLength(0)”, both wrong. I’ll try to fix that in my version of the benchmark at codeproject (which I’ll probably post today or tomorrow).
- on June 29, 2011 at 4:31 pm | Reply attractivechaos
  
  Yes, that is very true. The same problem may occur to a few other language implementations. Thanks.
on October 27, 2011 at 3:25 pm | Reply Head-to-head benchmark: C++ vs .NET | chainding

[…] benchmarks were originally written to use 2-dimensional arrays in C#. Now, writing about the matrix test, the author stated: “this is the first time I write C#”. More seasoned C# developers […]
on April 8, 2012 at 12:08 am | Reply neuruss

Any chance to update these results?
I’d love to see how pypy compares after one year of continuous progress…
on July 15, 2012 at 9:44 pm | Reply Jonathan Shore

I ran across this post just now. I realize is more than a year old. I am wondering whether you ran the C# / mono version with –llvm. Also there are some other options one should use. I find that am generally within 5-10% of C with care in how one codes in C# / mono.
- on July 16, 2012 at 2:21 pm | Reply attractivechaos
  
  Yes, I need to update the benchmark sometime. Things are moving fast. A lot have been changed. Thanks.
on August 9, 2012 at 6:50 am | Reply Ameise

Since this is a benchmark of programming languages and a variety of what are normally used as scripting languages, could you include TinyCC in the benchmarks as an implementation of C? TinyCC is one of the few C implementations that can be integrated into other C applications, and knowing how it performs relative to the alternatives would be useful.
on August 30, 2012 at 12:31 pm | Reply neurosurgus

Hey, just awesome, finally someone showed us an alternative to the broken-minded Language Shootout Game. Update, please! Update, please! Update, please! Cool, man !

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog