Optimizing Matrix Multiplication

Feeds:: Posts; Comments

Optimizing Matrix Multiplication

August 28, 2016 by attractivechaos

Vector and matrix arithmetic (e.g. vector dot and matrix multiplication) are the basic to linear algebra and are also widely used in other fields such as deep learning. It is easy to implement vector/matrix arithmetic, but when performance is needed, we often resort to a highly optimized BLAS implementation, such as ATLAS and OpenBLAS. Are these libraries much faster than our own implementations? Is it worth introducing a dependency to BLAS if you only need basic vector/matrix arithmetic? The following post may give you some hints.

Results

In this github repository, I implemented matrix multiplication in seven different ways, including a naive implementation, several optimized implementations with cache miss reduction, SSE and loop blocking, and two implementations on top of OpenBLAS. The following table shows the timing of multiplying two 2000×2000 or 4000×4000 random matrices on my personal Mac laptop and a remote linux server (please see the source code repo for details):

Implementation	-a	Linux,-n2000	Linux,-n4000	Mac,-n2000
Naive	0	7.53 sec	188.85 sec	77.45 sec
Transposed	1	6.66 sec	55.48 sec	9.73 sec
sdot w/o hints	4	6.66 sec	55.04 sec	9.70 sec
sdot with hints	3	2.41 sec	29.47 sec	2.92 sec
SSE sdot	2	1.36 sec	21.79 sec	2.92 sec
SSE+tiling sdot	7	1.11 sec	10.84 sec	1.90 sec
OpenBLAS sdot	5	2.69 sec	28.87 sec	5.61 sec
OpenBLAS sgemm	6	0.63 sec	4.91 sec	0.86 sec
uBLAS		7.43 sec	165.74 sec
Eigen		0.61 sec	4.76 sec

You can see that a naive implementation of matrix multiplication is quite slow. Simply transposing the second matrix may greatly improve the performance when the second matrix does not fit to the CPU cache (the linux server has a 35MB cache, which can hold a 2000×2000 float matrix in cache, but not a 4000×4000 matrix). Transpose also enables vectorization of the inner loop. This leads to significant performance boost (SSE sdot vs Transposed). Loop blocking further reduces cache misses and timing for large matrices. However, OpenBLAS’ matrix multiplication (sgemm) is still the king of performance, twice as fast as my best hand-written implementation and tens of times faster than a naive implementation. OpenBLAS is fast mostly due to its advanced techniques to minimize cache misses.

As side notes, “sdot with hints” partially unrolls the inner loop. It gives a hint to the compiler that the loop may be vectorized. Clang on Mac can fully vectorize this loop, achieving the same speed of explicit vectorization. Gcc-4.4 seems not as good. The Intel compiler vectorizes the loop even without this hint (see the full table in README). Interestingly, the OpenBLAS sdot implementation is slower than my explicit vectorization on both Linux and Mac. I haven’t figured out the reason. I speculate that it may be related to cache optimization.

As to C++ libraries, Eigen has similar performance to OpenBLAS. The native uBLAS implementation in Boost is quite primitive, nearly as slow as the most naive implementation. Boost should ditch uBLAS. Even in the old days, it was badly implemented.

Conclusions

For multiplying two large matrices, sophisticated BLAS libraries, such as OpenBLAS, are tens of times faster than the most naive implementation.
With transposing, SSE (x86 only) and loop blocking, we can achieve half of the speed of OpenBLAS’ sgemm while still maintaining relatively simple code. If you want to avoid a BLAS dependency, this is the way to go.
For BLAS level-1 routines (vector arithmetic), an implementation with SSE vectorization may match or sometimes exceeds the performance of OpenBLAS.
If you prefer a C++ interface and are serious about performance, don’t use uBLAS; use Eigen instead.

Posted in C, development | 8 Comments

8 Responses

on August 28, 2016 at 7:56 pm | Reply Øystein Schønning-Johansen

A great post as usual! Thanks. Do you have any theories why OpenBLAS and Eigen is so much better? Are they using threading or maybe AVX?
- on August 28, 2016 at 8:17 pm | Reply attractivechaos
  
  Have a look at the “matrix multiplication algorithm” wiki page and you will get some hints. I guess they are faster mostly because they are better at minimizing cache misses by splitting and reordering the computation block by block. As to other possible explanations – I have intentionally disabled multithreading; although the linux server supports AVX, gcc doesn’t and I explicitly tells OpenBLAS not to use AVX.
- on August 29, 2016 at 2:19 pm | Reply BLAS Pascal
  
  For GotoBLAS, from which OpenBLAS was forked, you might want to read the paper https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
on August 28, 2016 at 9:49 pm | Reply zhanxwzhanxw

I like Eigen a lot. It can use Intel MKL backend, and that has good performances besides matrix multiplication.
on August 28, 2016 at 10:44 pm | Reply rurban

What about the new gather-scatter vector intrinsics, esp. the 512 bit ones? Is this included in SSE? I guess not, as it needs AVX or Knights Landing/Phi.
on August 29, 2016 at 8:56 am | Reply 2#

This is one very interesting ecample where you could use some multy threading in order to acchieve faster multiplication.
on October 14, 2016 at 1:45 pm | Reply Paweł

I do recomend including also implementation with instruction reordering(eliminates cache-misisng) and openmp directives for parallization
on February 19, 2017 at 4:39 pm | Reply Siddhesh Urkude

Optimizing matrix multiplication explained very easily. Its easy to understand through your article. I was searching for one such article. Thanx for sharing.

Siddhesh-C programming

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog