Multi-threaded programming: efficiency of locking

October 6, 2011 by attractivechaos

Mutex and spinlock

When multiple threads need to modify a shared resource such as a global variable, we typically lock the resource to guarantee at most one thread can do modification, such that a thread cannot interfere with other threads and leads to unexpected behaviors. There are different types of locks. For Linux C programmers, probably the most frequently used lock is mutex from NPTL, the Native POSIX Thread Library.

In NPTL, mutex is implemented on top of the futex, a kernel primitive for assisting the implementation of locking. Essentially, futex provides a way to ask the kernel to block a thread until the value at a given address is changed, and to wake up all threads waiting on an address. The implementation of mutex is kernel-dependent and complex. For now, I could not think of a means to implement mutex without closely interacting with the kernel.

Another much simpler type of lock is spinlock. The fundamental difference between spinlock and mutex is that spinlock keeps checking the lock (busy waiting), while mutex puts threads waiting for the lock into sleep (blocked). A busy-waiting thread wastes CPU cycles, while a blocked thread does not. Regardless of this difference, mutex and spinlock have very similar behavior.

According to POSIX, implementing spinlock in the pthread library is optional. Mac OS X does not implement spinlock in its pthread library; instead, it implements spinlock in several OS-specific functions, which indeed causes portability issue.

Fortunately, spinlock is much easier to implement than mutex. The basic form of spinlock can be implemented with a dozen or so instructions on an X86 CPU. Indeed, NPTL does this. The Linux kernel implements spinlock in a similar way. It is also fairly easy to implement spinlock by yourself if you do not care too much about portability:

volatile int lock = 0;
void *worker(void*)
{
  while (__sync_lock_test_and_set(&lock, 1));
  // critical section
  __sync_lock_release(&lock);
}

where __sync_lock_test_and_set() and __sync_lock_release() are GCC atomic builtins, which guarantee atomic memory access. In the code above, the volatile keyword suggests that the lock variable may be changed by other threads. __sync_lock_test_and_test(&lock, 1) atomically sets lock as 1 and returns its previous value. If the critical section has been locked (i.e. lock=1) by other threads, we keep looping until another thread holds the lock calls __sync_lock_release() which sets lock to 0.

A probably better implementation is

while (__sync_lock_test_and_set(&lock, 1)) while (lock);
// critical section
__sync_lock_release(&lock);

It is also straightforward to only use the compare-and-swap (CAS) instruction to implement a spinlock, though this way is a little slower. Here are a little more discussions on the implementation of spinlock.

Evaluating mutex, spinlock and semaphore

Except blocking vs. busy-waiting, mutex and spinlock have very similar effect and can usually (not always) be replaced by each other. Then which lock is preferred? In addition to mutex and spinlock, semaphore can also be used as mutex. Is it faster or slower? More generally, how much overhead locking causes in comparison to a single-threaded program? The following benchmark aims to give a hint about these questions.

Design of the benchmark

The task is simple: we want to multithread the following code:

int n = 1000000, m = 100, i;
uint64_t *bits;
bits = (uint64_t*)calloc((n + 63) / 64, 8);
for (i = 0; i < n*m; ++i) {
  int x = (uint64_t)i * i * i % n;
  bits[x>>6] ^= 1LLU << (x & 0x3f);
}

The ‘bits‘ will be shared across threads and need to be protected. We are going to evaluate the following strategies:

Method	Description
Single	Single-threaded version
Lock-free	Using GCC’s atomic builtins without locking
Spin	Spin lock
Pthread spin	Spin lock from pthread
Pthrrea mutex	Mutex from pthread
Semaphore	Semaphore as lock
Buffer+spin	Buffering numbers for less frequent locking using spin lock
Buffer+mutex	Buffering numbers for less frequent locking using mutex

The C program is available in my github repository as a single C file.

Results and discussions

You can find the full results at the end of the source code file. I give part of the results in the following table, where the “OSX-1” column gives the “real / CPU” time in seconds on Mac OS X with a single thread, “Linux-2” on Linux with 2 threads, and so on. More detailed information about the platforms is given in the source code file.

Method	OSX-1	OSX-2	Linux-1	Linux-2	Linux-4
Single	1.4 / 1.4	N/A	1.2 / 1.1	N/A	N/A
Lock-free	3.8 / 3.8	2.2 / 4.2	3.6 / 3.6	3.7 / 7.3	2.5 / 9.9
Spin	5.8 / 5.7	5.1 / 9.5	5.1 / 5.1	20.7 / 41.2	32.7 / 130
Pthread spin	N/A	N/A	3.8 / 3.8	21.1 / 42.2	53.8 / 206
Pthread mutex	7.5 / 7.5	182 / 271	6.0 / 5.9	31.0 / 45.1	97.2 / 284
Semaphore	85 / 85	51 / 96	5.5 / 5.4	46.0 / 68.9	133 / 418
Buffer+spin	1.3 / 1.3	0.7 / 1.3	1.2 / 1.2	0.7 / 1.4	0.5 / 2.0
Buffer+mutex	1.3 / 1.3	0.7 / 1.4	1.2 / 1.2	0.7 / 1.4	0.5 / 1.5

Here is what I have learned from this microbenchmark:

For multi-threaded programming, it is preferred to let threads interact less with each other as CPU has to pay a price for frequent cross-talk between threads. This is true even if we do not use any lock (see lock-free vs. buffer+mutex); frequent locking/unlocking is much worse (see mutex vs. buffer+mutex). A proper design of the algorithm is more important than choosing between spinlock, mutex and semaphore.
In this benchmark, the lock-free implementation is less efficient than buffer+mutex, but a lock-free algorithm may be more scalable to massive CPU cores because locking will become more often given more concurrent threads. When frequent locking/unlocking can hardly be avoided, a lock-free algorithm, if possible in the first place, may deliver better performance, too.
When locking is infrequently, it does not matter too much what type of lock to use – little time goes to locking anyway. When locking is frequent, it is worthwhile to explore both spinlock and mutex. Although using mutex is usually safer and more portable, spinlock may be faster (careless micro-benchmarks tend to prefer spinlock actually). Nonetheless, let me emphasize again that we should first try our best to reduce cross-talk between threads, which is more effective than playing with different types of locks.
The performance of locking is CPU, OS and library dependent. It is worth testing our programs on multiple platforms.

Posted in Uncategorized | 14 Comments

14 Responses

on October 7, 2011 at 5:13 am | Reply Arun

How do I interpret the results? Smaller the number – better the performance? Or the other way? Also what does x / y mean?
- on October 8, 2011 at 2:50 pm | Reply attractivechaos
  
  Sorry. The first number is real time in seconds, and the second is the CPU time in seconds. So, the smaller the better.
  - on October 13, 2011 at 4:39 am Arun
    
    Thanks for your reply. Does real time and CPU time correspond to time spent in user space and in kernel space respectively?
  - on October 13, 2011 at 12:37 pm attractivechaos
    
    Real time is the wall clock time. CPU time equals the sum of user time plus the system time (time used by the kernel).
on January 13, 2012 at 12:09 am | Reply jingtao

I like you. I will be watching this blog.
on June 27, 2012 at 9:37 pm | Reply Manny Ko

“It is also straightforward to only use the compare-and-swap (CAS) instruction to implement a spinlock, though this way is a little slower”.

Can you provide some background as to why spin-lock using CAS is a little slower than __sync_lock_test_and_set?

Thanks.
on July 14, 2012 at 11:18 am | Reply Jason

Hi,

Thanks for this information!

One question, can you break this further down into read / write performance. Ie pthread read write locks. So maybe trial three situations (i) you have a lot of reads and few writes (ii) reads are approximately the same as writes, and (iii) the task is dominated by writes.

I am trying to find out some information to see if read contention is done efficiently in the OSX pthread implementation. (From what I have read futex on Linux does this efficiently…)

Thanks,
Jason
on October 12, 2012 at 12:02 am | Reply foooff

Thank you very much. I found this very useful for my profiling code.
on October 13, 2012 at 6:04 pm | Reply B-Line Medical | Inter process locking

[…] Efficiency of locking – by Attractive Chaos (anonymous) […]
on June 4, 2014 at 9:40 pm | Reply Nima Bi

I have to say, thanks for the numbers, which are valuable, but the volatile keyword is misused here. Volatile is not used for multi-threading purpose, and this will mislead other people who read the thread. There are lots of articles on the web talking about the misuse of volatile keyword.
on April 20, 2018 at 3:58 am | Reply What is the GCC builtin for an atomic set? | Blog nejen pro chlapy od chlapa

[…] I have seen this example of this on https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/. […]
on January 6, 2020 at 3:08 pm | Reply James Beyer

https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723
on March 31, 2021 at 9:11 am | Reply Locks（一） – 论文收割机

[…] 这篇英文博客中也有类似的内容 […]
on November 2, 2022 at 4:25 am | Reply Resolved: can __sync_local_test_and_set/xchg be used to implement a mutex lock? - Daily Developer Blog

[…] are mentioning it can be used to implement a spinlock instead of a mutex lock. The article Multi-threaded programming: efficiency of locking also reads: I could not think of a means to implement mutex without closely interacting with the […]

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog