Mutex and spinlock
When multiple threads need to modify a shared resource such as a global variable, we typically lock the resource to guarantee at most one thread can do modification, such that a thread cannot interfere with other threads and leads to unexpected behaviors. There are different types of locks. For Linux C programmers, probably the most frequently used lock is mutex from NPTL, the Native POSIX Thread Library.
In NPTL, mutex is implemented on top of the futex, a kernel primitive for assisting the implementation of locking. Essentially, futex provides a way to ask the kernel to block a thread until the value at a given address is changed, and to wake up all threads waiting on an address. The implementation of mutex is kernel-dependent and complex. For now, I could not think of a means to implement mutex without closely interacting with the kernel.
Another much simpler type of lock is spinlock. The fundamental difference between spinlock and mutex is that spinlock keeps checking the lock (busy waiting), while mutex puts threads waiting for the lock into sleep (blocked). A busy-waiting thread wastes CPU cycles, while a blocked thread does not. Regardless of this difference, mutex and spinlock have very similar behavior.
According to POSIX, implementing spinlock in the pthread library is optional. Mac OS X does not implement spinlock in its pthread library; instead, it implements spinlock in several OS-specific functions, which indeed causes portability issue.
Fortunately, spinlock is much easier to implement than mutex. The basic form of spinlock can be implemented with a dozen or so instructions on an X86 CPU. Indeed, NPTL does this. The Linux kernel implements spinlock in a similar way. It is also fairly easy to implement spinlock by yourself if you do not care too much about portability:
volatile int lock = 0; void *worker(void*) { while (__sync_lock_test_and_set(&lock, 1)); // critical section __sync_lock_release(&lock); }
where __sync_lock_test_and_set() and __sync_lock_release() are GCC atomic builtins, which guarantee atomic memory access. In the code above, the volatile keyword suggests that the lock variable may be changed by other threads. __sync_lock_test_and_test(&lock, 1) atomically sets lock as 1 and returns its previous value. If the critical section has been locked (i.e. lock=1) by other threads, we keep looping until another thread holds the lock calls __sync_lock_release() which sets lock to 0.
A probably better implementation is
while (__sync_lock_test_and_set(&lock, 1)) while (lock); // critical section __sync_lock_release(&lock);
It is also straightforward to only use the compare-and-swap (CAS) instruction to implement a spinlock, though this way is a little slower. Here are a little more discussions on the implementation of spinlock.
Evaluating mutex, spinlock and semaphore
Except blocking vs. busy-waiting, mutex and spinlock have very similar effect and can usually (not always) be replaced by each other. Then which lock is preferred? In addition to mutex and spinlock, semaphore can also be used as mutex. Is it faster or slower? More generally, how much overhead locking causes in comparison to a single-threaded program? The following benchmark aims to give a hint about these questions.
Design of the benchmark
The task is simple: we want to multithread the following code:
int n = 1000000, m = 100, i; uint64_t *bits; bits = (uint64_t*)calloc((n + 63) / 64, 8); for (i = 0; i < n*m; ++i) { int x = (uint64_t)i * i * i % n; bits[x>>6] ^= 1LLU << (x & 0x3f); }
The ‘bits‘ will be shared across threads and need to be protected. We are going to evaluate the following strategies:
Method | Description |
---|---|
Single | Single-threaded version |
Lock-free | Using GCC’s atomic builtins without locking |
Spin | Spin lock |
Pthread spin | Spin lock from pthread |
Pthrrea mutex | Mutex from pthread |
Semaphore | Semaphore as lock |
Buffer+spin | Buffering numbers for less frequent locking using spin lock |
Buffer+mutex | Buffering numbers for less frequent locking using mutex |
The C program is available in my github repository as a single C file.
Results and discussions
You can find the full results at the end of the source code file. I give part of the results in the following table, where the “OSX-1” column gives the “real / CPU” time in seconds on Mac OS X with a single thread, “Linux-2” on Linux with 2 threads, and so on. More detailed information about the platforms is given in the source code file.
Method | OSX-1 | OSX-2 | Linux-1 | Linux-2 | Linux-4 |
---|---|---|---|---|---|
Single | 1.4 / 1.4 | N/A | 1.2 / 1.1 | N/A | N/A |
Lock-free | 3.8 / 3.8 | 2.2 / 4.2 | 3.6 / 3.6 | 3.7 / 7.3 | 2.5 / 9.9 |
Spin | 5.8 / 5.7 | 5.1 / 9.5 | 5.1 / 5.1 | 20.7 / 41.2 | 32.7 / 130 |
Pthread spin | N/A | N/A | 3.8 / 3.8 | 21.1 / 42.2 | 53.8 / 206 |
Pthread mutex | 7.5 / 7.5 | 182 / 271 | 6.0 / 5.9 | 31.0 / 45.1 | 97.2 / 284 |
Semaphore | 85 / 85 | 51 / 96 | 5.5 / 5.4 | 46.0 / 68.9 | 133 / 418 |
Buffer+spin | 1.3 / 1.3 | 0.7 / 1.3 | 1.2 / 1.2 | 0.7 / 1.4 | 0.5 / 2.0 |
Buffer+mutex | 1.3 / 1.3 | 0.7 / 1.4 | 1.2 / 1.2 | 0.7 / 1.4 | 0.5 / 1.5 |
Here is what I have learned from this microbenchmark:
- For multi-threaded programming, it is preferred to let threads interact less with each other as CPU has to pay a price for frequent cross-talk between threads. This is true even if we do not use any lock (see lock-free vs. buffer+mutex); frequent locking/unlocking is much worse (see mutex vs. buffer+mutex). A proper design of the algorithm is more important than choosing between spinlock, mutex and semaphore.
- In this benchmark, the lock-free implementation is less efficient than buffer+mutex, but a lock-free algorithm may be more scalable to massive CPU cores because locking will become more often given more concurrent threads. When frequent locking/unlocking can hardly be avoided, a lock-free algorithm, if possible in the first place, may deliver better performance, too.
- When locking is infrequently, it does not matter too much what type of lock to use – little time goes to locking anyway. When locking is frequent, it is worthwhile to explore both spinlock and mutex. Although using mutex is usually safer and more portable, spinlock may be faster (careless micro-benchmarks tend to prefer spinlock actually). Nonetheless, let me emphasize again that we should first try our best to reduce cross-talk between threads, which is more effective than playing with different types of locks.
- The performance of locking is CPU, OS and library dependent. It is worth testing our programs on multiple platforms.
How do I interpret the results? Smaller the number – better the performance? Or the other way? Also what does x / y mean?
Sorry. The first number is real time in seconds, and the second is the CPU time in seconds. So, the smaller the better.
Thanks for your reply. Does real time and CPU time correspond to time spent in user space and in kernel space respectively?
Real time is the wall clock time. CPU time equals the sum of user time plus the system time (time used by the kernel).
I like you. I will be watching this blog.
“It is also straightforward to only use the compare-and-swap (CAS) instruction to implement a spinlock, though this way is a little slower”.
Can you provide some background as to why spin-lock using CAS is a little slower than __sync_lock_test_and_set?
Thanks.
Hi,
Thanks for this information!
One question, can you break this further down into read / write performance. Ie pthread read write locks. So maybe trial three situations (i) you have a lot of reads and few writes (ii) reads are approximately the same as writes, and (iii) the task is dominated by writes.
I am trying to find out some information to see if read contention is done efficiently in the OSX pthread implementation. (From what I have read futex on Linux does this efficiently…)
Thanks,
Jason
Thank you very much. I found this very useful for my profiling code.
[…] Efficiency of locking – by Attractive Chaos (anonymous) […]
I have to say, thanks for the numbers, which are valuable, but the volatile keyword is misused here. Volatile is not used for multi-threading purpose, and this will mislead other people who read the thread. There are lots of articles on the web talking about the misuse of volatile keyword.
[…] I have seen this example of this on https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/. […]
https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723
[…] 这篇英文博客中也有类似的内容 […]
[…] are mentioning it can be used to implement a spinlock instead of a mutex lock. The article Multi-threaded programming: efficiency of locking also reads: I could not think of a means to implement mutex without closely interacting with the […]