Software Assurance            Software Hardening            Autonomic Computing

Multi-Core Processors are a Headache for Multithreaded Code

The multi-core revolution that spread throughout the computing industry over the last decade (for example) has dramatically increased the number of developers who face the challenge of building parallel software.

Multithreading is one of the most common foundations upon which developers build parallel software, but writing robust applications with multithreading is extremely hard. Paul's earlier post about concurrency problems addressed some of the nasty multithreading pitfalls that developers can fall into. I'm going to focus here on the related issue of transitioning a project that is already multithreaded from single processor platforms to multi-processor platforms (including multi-core).

It might seem counter-intuitive that there are problems specific to transitioning multithreaded code to a multi-core world. Shouldn't you be able to just move the application and get the performance benefit of running threads in parallel on different processors? It's not quite that simple. It is fairly widely understood that multithreading bugs are more likely to manifest on multi-processor platforms; this post explores a couple of the reasons for this fact.

First an important observation: multithreading is really a dual-purpose technology. Threads are used to build parallel software for multi-processor systems, and they are also used to asynchronously handle interactions with other software and the real world. The latter use case is relevant even if the software is running on a single processor, which is why lots of multithreaded code existed before multi-processor systems were common.

Homebrew Synchronization Sometimes Works on a Single Processor

The first example we will consider is the lazy initialization pattern:

 1    special_object *get_special_object( void ) {
 2        static special_object *obj = NULL;
 3        if( obj == NULL ) {
 4            obj = malloc( sizeof( *obj ) ); /* allocation */
 5            init_special_obj( obj ); /* initialization */
 6        }
 7        return obj;
 8    }

This code allocates and initializes one global instance of some special object, but it doesn't do it until there is an actual need for the object (that's why it's called "lazy"). This code works perfectly in a single-threaded context, but can fail badly in a multithreaded context.

Consider the situation where one thread enters the conditional block because "obj" is NULL and then is interrupted by the scheduler. A second thread calls "get_special_object" and does the allocation and initialization itself, because "obj" is still NULL. When the first thread resumes it will do its own initialization and now you have two distinct copies of the special object floating around. At best this is a resource leak; more likely, there is some program logic that depends on the invariant that there is at most one instance of the special object.

The only simple and portable way to fix this pattern for multithreading is to wrap the whole procedure in a critical section (using a mutex or whatever). Sometimes programmers dislike the overhead of having to synchronize on every call to "get_special_object", so they try to work around it with the "double-checked locking" pattern. If you're unfamiliar with the pattern and why it's so evil, I recommend you do a quick google search.

Once you're familiar with the terribleness of the double-checked locking pattern, it may be surprising that there is a version that actually works if you somehow manage to convince your compiler to not do any interesting optimizations (or you write it directly in machine code) and you only run on a single-processor system: [Never, ever use code like this in a production system.]

 1    special_object *get_special_object( void ) {
 2        static special_object *obj = NULL;
 3        static mutex m;
 4        if( obj == NULL ) {
 5            acquire( &m );
 6            if( obj == NULL ) {
 7                special_object *tmp =
 8                    malloc( sizeof( *tmp ) ); /* allocation */
 9                init_special_obj( tmp ); /* initialization */
10                obj = tmp;
11            }
12            release( &m );
13        }
14        return obj;
15    }

The reason that this code can behave differently on single- and multi-processor systems has to do with caches. All popular multi-processor architectures have complex memory models that do not guarantee that memory operations performed by one processor appear in the same order from the perspective of another processor. The motivation for the complexity of these memory models is that individual processors need flexibility in how they manage their caches.

In this example, it is possible that a thread running on another processor will "see" the updated "obj" pointer before "seeing" the initialized memory that it points to. As a consequence, that thread could follow the pointer and read uninitialized junk. This failure mode is impossible on a single-processor system, because even if the initializing thread is interrupted at an inopportune time, all threads use the same cache hierarchy and therefore cannot see the kind of inconsistent memory contents that are possible on multi-processor systems.

Let me be clear, I am not recommending reliance on highly platform-dependent and complex behavior. The only portable way to avoid these kinds of problems is to ensure that your code is properly synchronized (that your code obeys the multithreaded memory model of your source language). Many multithreading APIs provide a "once" primitive that you should use for all your lazy initialization needs.

The important conclusion here is that extensive, even exhaustive, validation on a single-processor system does not guarantee that your multithreaded code is free of bugs that will manifest on a multi-processor system.

Multi-cores Drive Programs into Odd Corners of the Scheduling Universe

Avoiding exotic synchronization patterns, such as double-checked locking, ameliorates but does not eliminate the difference between single- and multi-processor systems. Consider this procedure, which is used as a thread entry point:

    void thread_entry( void ) {
        do_some_initialization( self() );
        /* ... do other stuff ... */
    }

Assume that inside do_some_initialization various bits of global state are updated to register the existence of the new thread. Assume further that those state updates are individually synchronized (so there are no data races). Finally, assume that it's bad for another thread to observe partially updated state. This is a classic kind of concurrency bug called an atomicity violation. The programmer implicitly assumed that a collection of state updates (the initialization) will happen atomically, even though it is technically possible for another thread to observe partially updated state.

This example is particularly interesting because when a new thread is created on a single-processor system it usually gets a relatively long time (in software terms) to execute before it is interrupted to let someone else go. So it is entirely possible that an atomicity violation right at the beginning of a thread's existence would never cause a problem on a single processor. On the other hand, when this code is run on a multi-processor, the parent thread can continue executing while the child initializes itself. In that situation it is quite easy for the parent to observe the bad, partially updated, state. By going from a single processor to a multi-processor, we have made the probability of this atomicity violation actually being observed go from nearly zero to something that should be too high for comfort for any responsible software engineer.

In general, many concurrency bugs have fairly small windows of vulnerability. On a single-processor system, the manifestation of the bug depends on a thread being interrupted by the scheduler in the middle of such a window. In normal operation, thread schedulers tend to have somewhat predictable and consistent behavior (e.g. letting a thread run for a certain time slice, or choosing which thread to run next in some consistent pattern). Consequently, testing multithreaded code on a single-processor tends to leave huge portions of the universe of possible thread interactions unexplored. On multi-processor systems, even minor perturbations like cache misses can have a big impact on how events (like memory reads and writes) from different threads align relative to each other. These changes can in turn trigger concurrency bugs that would almost never occur on a single-processor system.

What to Do About It

The main message here is that transitioning a multithreaded program that has been thoroughly validated on single-processor systems to multi-processor systems carries a surprisingly high risk of exposing previously latent concurrency bugs.

Fixing up these concurrency defects in established code bases can be an extremely expensive endeavor. A small but vocal group of concurrency experts believe that most application developers should simply not use threads at all. For reactive/interactive programming, event handling loops are generally far less bug-prone than threads. For parallel processing, isolated processes are generally safer (though less convenient) than threads.

However, for many projects, threads are still the default choice for interactive and/or parallel programming. How can you make such programs safer and more robust? One important lesson to remember is that conventional testing techniques, especially on single-processor systems, are basically useless for exposing subtle concurrency bugs.

One approach that provides a partial solution is using a tool like Chess from Microsoft Research. Chess is a tool that systematically explores a carefully constructed subset of the possible schedules of multithreaded programs. The major insight behind the Chess project is that during testing, the thread scheduler can make pathologically bad choices to dramatically improve the probability of exposing concurrency bugs, relative to random scheduling.

In addition to testing tools, static analysis tools can help identify potentially unsafe concurrency patterns. Static analysis tools use symbolic execution engines to identify potential problems, without needing to identify particular input data or thread schedules that would cause the problems to manifest. Because of this, the problems identified by static analysis are usually complementary to problems identified by testing.

Static analysis tools that find concurrency bugs include ThreadSafe from Contemplate and GrammaTech's own CodeSonar. The most recent version of CodeSonar includes checks for unsafe code and can help locate serious concurrency errors.