Today I’d like to talk about one particular day’s work that I found interesting.
Our team was recently in the middle of investigating read performance for small but numerous requests, which had shown to be CPU-bound. About half of our CPU time was spent contending on locks, and we had some success working down the list of top contenders in an effort to eke out more IOPS. Given that we had worked on reducing lock contention in several areas already, I began to wonder how much of an impact our locks themselves had on the performance.
I set aside some time to look at our most commonly used lock implementation. My idea was to see if there were any ways that we could make all of our contended locks a little bit faster. I discovered two implementations. One “spinlock” was a wrapper for pthread_mutex, and the other less commonly used one was a wrapper for futex. A pthread_mutex is a massive, 40 byte structure provided by the pthreads library, while a futex is 4 bytes. I decided to get some data about how they performed, but already I was starting to like the idea of optimizing the futex-based implementation and making it more standard.
Before running off and making larger changes, I decided to benchmark both locks. From my experience, most of the locks causing performance problems tended to be heavily contended. I wrote a microbenchmark that spins up four threads where each thread acquires a lock, increments an integer, and releases the lock. I also recognize that any overhead in uncontended acquisition speed would probably be pretty difficult to detect in the product, so I wrote a single-threaded variant as well. I let each of them run for 30 seconds before checking the counter. Here are the initial results:
Uncontended | Contended (4 Threads) | |
Pthread Mutex | ~47 ma/s* | ~13 ma/s* |
Futex | ~53 ma/s* | ~8.4 ma/s* |
* million acquisitions/second
I validated these results with a quick test on the actual product, where I swapped out the Pthreads implementation for the Futex implementation. Sure enough, the Futex implementation regressed performance. This is a frustrating result, since I happen to know that Pthread Mutex is implemented on top of Futex for this platform. I took a closer look at the Futex implementation. Here’s the important parts in pseudo-code. I’m intentionally simplifying out the details of how futexes work because it’s largely irrelevant to this story.
bool trylock(futex_lock *self) { return !atomic_bool_exchange(&self->is_locked, true); } void lock(futex_lock *self) { for (int i = 0; i < SOME_NUMBER; ++i) { if (trylock(self)) return; pause() // hint to the CPU that we’re in a spinlock loop } while (true) { futex_sleep(&self->is_locked); if (trylock(self)) return; } } void unlock(futex_lock *self) { atomic_bool_set(&self->is_locked, false); futex_wake_all(&self->is_locked); }
The most obvious thing to change here for me was SOME_NUMBER, followed maybe by pause. After a few iterations of trying different numbers and some different ideas for how to pause, I discovered that the most efficient thing to do was not spin at all. The new lock looked somewhat like this.
void lock(futex_lock *self) { while (true) { if (trylock(self)) return; futex_sleep(&self->is_locked); } }
1 Thread | 4 Threads | |
Pthread Mutex | ~47 ma/s* | ~13 ma/s* |
Futex | ~53 ma/s* | ~8.4 ma/s* |
Futex (without spinning) | ~53 ma/s* | ~17 ma/s* |
* million acquisitions/second
This may be surprising given that sleeping on a futex is a syscall, and therefore pretty expensive. Something else expensive must be happening while we’re spinning. On Intel CPUs, there are three states for cache lines. “I” means that this core has exclusive access (safe to write). “S” means that the cache line is clean, but may be in another core’s cache as well (safe to read). “A” means that another core may have exclusive access to the cache line (unsafe without asking). When multiple cores are doing atomic access on data in the same cache line, they trade which core gets exclusive access, and the data ping-pongs between I and A states. This is an expensive operation that potentially involves broadcasting messages between CPU sockets.
With this background, our performance improvement makes a lot of sense. It also opens up some more avenues for optimization. futex_wake_all seems like a particularly important problem here. In the event that there is more than one waiter, only one will succeed in trying to get the lock while all of them will fight for exclusive access to the cache line. After some fiddling, I came up with something like the following.
bool trylock(futex_lock *self) { auto old_state = atomic_uint_compare_exchange(&self->state, LOCK_FREE, LOCK_HELD); return old_state == LOCK_FREE; } void lock(futex_lock *self) { auto old_state = atomic_uint_compare_exchange(&self->state, LOCK_FREE, LOCK_HELD); if (old_state == LOCK_FREE) return if (old_state == LOCK_HELD) old_state = atomic_uint_exchange(&self->state, LOCK_CONTENDED) while (old_state != spinlock_free) { futex_wait_if_value_equals(&self->state, LOCK_CONTENDED) old_state = atomic_uint_exchange(&self->state, LOCK_CONTENDED) } } void unlock(futex_lock *self) { auto old_state = atomic_uint_exchange(&self->state, LOCK_FREE); if (old_state == LOCK_CONTENDED) futex_wake_one(&self->state) }
This new design has some pretty big advantages for contended performance. By adding a new contended state, we can make it safe to only wake one at a time. While it’s still possible to unnecessarily call wake, we should have removed the source of extra cache line contention from extra waking. This lead to some additional increases in microbenchmark performance.
1 Thread | 4 Threads | |
Pthread Mutex | ~47 ma/s* | ~13 ma/s* |
Futex | ~53 ma/s* | ~8.4 ma/s* |
Futex (without spinning) | ~53 ma/s* | ~17 ma/s* |
Futex (3-state) | ~53 ma/s* | ~20 ma/s* |
* million acquisitions/second
While there still seem to be some more minor areas for improvement, this seems to be far enough ahead of the Pthread Mutex that we should see some gains in the real product if we swap it out. When running our random read IOPS benchmark, this gave us a ~4 percent boost. It’s not much, but the change had a very broad impact across many locking-bound workloads for the product. The 40->4 byte reduction in size also reduced our total ram consumption in some scenarios by up to a gigabyte.