Wednesday, July 23, 2008

High Performance Multithreaded Code

Current CPUs have several cores per CPU. If a program wants to speed up with new hardware, the program has to exploit those extra cores. Using multiple threads is, therefore, becoming extremely popular.

Of course, people who talk a lot about multithreaded programming don't ever mention that most programs don't need to be any faster. While I feel obligated to point that out, this article is written for people who do want their applications to run faster.

In fact, I'm going to go even farther than that. This is for people who want to squeeze every last bit of performance out of their multithreaded code. This isn't for everyone.

Since I don't have a lot of experience with this, I'm going to talk about two books that I've read. They both talk about specific topics that are, I suspect, absolutely essential to some extremely high performance multithreaded code.

Interestingly, neither book talks about what I'm talking about directly. The books have absolutely no overlap, either. They both talk about different ends of the same problem without looking at the whole problem.

I enjoyed both of them.

The first, The Art of Multiprocessor Programming, was written by a couple of academics. It's a highly theoretical look at lock free and wait free data structures. It never talks about real hardware. It's also fascinating.

The second, Code Optimization: Effective Memory Usage, is an extremely practical guide to how modern hardware deals with a critical resource, memory. It talks in detail about what the hardware is doing. It doesn't touch algorithms that avoid the many problems that it talks about. It's a bit out of date as well, but it's still worth spending time with.

The Art of Multiprocessor Programming

You wouldn't be able to tell from the cover or the publisher's description of it, but this book is about lock free and wait free algorithms.

A component is lock free when many threads can access the routines and at least one thread will always make progress. If a thread holds a mutex, this is not possible. The thread with the mutex could page fault and be forced to wait. During that wait, no thread will make progress.

Wait free is an even stronger constraint. A routine is wait free if it will complete in a finite number of steps. That is, all threads will simultaneously make progress.

It's possible for a data structure to have some routines that are wait free and others that are merely lock free. The authors frequently try to make the most critical routines wait free and the less important ones lock free.

Lock free programming is a topic that's always fascinated me. It seems incredibly difficult. Researchers like the authors must agree, because there aren't that many lock free algorithms in the literature, yet. There are a few data structures out there, and a lot of work has been done on critical algorithms like heap management routines. There isn't much else, though.

The book, however, walks you through the techniques that are needed to build these algorithms. They describe and analyze the algorithms in ways that I don't normally bother with. Mathematical proofs appear to be critical to their process. Don't worry too much, though. None of the proofs outlined in the book were difficult to follow. Without the proofs, I would have had a difficult time understanding what they were doing, too.

Here's an example of an interesting theorem in the book. Modern processors have a variety of atomic instructions that are designed to help avoid locks. These instructions are critical to lock free programming. Examples include atomic increment and compare and swap.

Lock free algorithms replace locking a mutex with a number of these atomic instructions. Someone created a theorem that essentially states that most of these instructions are pathetic. (I'm only paraphrasing slightly.) Compare and swap is proven to be useful, however.

Lock free articles talk a lot about compare and swap. It's nice to understand why!

Incidentally, despite the title of my post, lock free algorithms are not necessarily faster than a conventional algorithm wrapped in a lock.

The Art of Multiprocessor Programming doesn't talk about it, but these atomic operations are expensive. They require a memory barrier. This requires communication with all of the other cores in the computer, and it's slow.

Arch, an Intel developer, puts it nicely here:
My opinion is that non-blocking algorithms are important in situations where lock preemption (arising from oversubscription) is an issue. But when oversubscription is not an issue, classic locked algorithms, combined with designing to avoid lock contention, will generally out perform non-blocking algorithms on current Intel hardware. I don't know about other hardware, but observe that:
  1. Non-blocking algorithms generally use more atomic operations than it takes to acquire a lock.
  2. Atomic operations in non-blocking algorithms generally need a memory fence.
  3. Memory fences incur an inherent penalty on out-of-order processors or processors with caches.
Do keep that in mind when you read the book! If your algorithm uses too many of these atomic operations, there's no point in doing it. Locking a mutex doesn't require many of these operations.

The authors act like typical academics and ignore this problem completely. :)

Code Optimization: Effective Memory Usage

This book is dated 2003. It's several processor generations out of date. Don't panic, though. It turns out that a lot of what Kris Kaspersky says has been true for far longer than that.

There's a good chance that some of his discussion of ways to exploit specific CPU generations isn't useful anymore. However, interleaving memory bank access, N-way associative cache behavior, and many other interesting properties of memory are unlikely to change immediately.

You'd think that memory technology would change enough that the same quirky code optimizations wouldn't work for a whole decade. Apparently, you'd be wrong.

This is, as you might imagine, the exact opposite of the previous book. This is about how the memory systems in a (somewhat) modern PC work. This is about the details of machine architecture and how to use those details to speed up your code.

In fact, translating the information in this book to highly parallel computing will require some thought. It was written without much thought to the behavior of multicore processors.

That's not the point, though. Multithreaded programming is all about memory access. If you poke the memory in the wrong order, your program will slow way, way down. Compilers are not yet smart enough to do all the work for you.

It's worth talking a bit about how important memory access is. Here's a slide shamelessly stolen from Robert Harkness at the San Diego Supercomputer Center:

Performance in on a log scale. Memory bandwidth has a dramatically lower slope than CPU speed.

You could say that, eventually, performance will be entirely dominated by the usage of memory. However, we're almost there already. High performance data-parallel programming requires the knowledge in this book.

I do wish he'd write a second edition, though. Some of the chip-specific discussions are interesting, but they aren't necessarily relevant anymore.

Lower Performance Multithreaded Code

I should emphasize that most of the stuff in both of these books is useful for pushing performance a bit faster than you'd thought possible.

If you haven't already gotten to the point where you think your code can't be sped up, then you're likely to have more serious problems that will erase the improvements available from these books.

Actually, there's one significant exception to that. If you're using a low associativity cache poorly, then you could get almost no cache utilization. In some cases, you can make minor changes to your memory usage and go from no cache utilization to good cache utilization. That's probably a more important change than using a good algorithm.

Generally, however, these books are not what you need to make your web page load faster. None of the code that I've written in the last few years needs this. I'll keep looking for applications, though, because optimizing performance in exotic ways is a fun!

I enjoyed both books a lot even though they didn't seem directly applicable. I hope you will, too.

No comments: