Warning! Threading in a multiprocessor world

Find out why many tricks to avoid synchronization overhead just don't work

I had intended this month to carry on with presenting a caching class loader -- that was the whole point of the zip-file access classes I presented in the last couple of columns. I've decided to digress for this month, however, on a threading-related issue that's important enough that I wanted to bring it to your attention sooner rather than later.

Double-checked locking can be hazardous to your code!

This week JavaWorld focuses on the dangers of the double-checked locking idiom. Read more about how this seemingly harmless shortcut can wreak havoc on your code:

This article discusses why a whole category of techniques that programmers misguidedly use to avoid synchronization overhead simply don't work. Probably the most common of those is double-checked locking, which I'll discuss in a moment. In a recent posting to a Java Memory Model mailing list, Bill Pugh at the University of Maryland (one of the folks who's spearheading the rewriting of the JLS (Java Language Specification) to compensate for some of the problems discussed this month) said, "...this double-check idiom is like a roach infestation. It doesn't matter how many of them you crush under your shoe, more of them crawl out from under the refrigerator." Well, I hope this article will help stamp out the swarm.

So what's the problem?

In the beginning, a CPU could write directly to memory simply by wiggling the voltages on a few wires that connected the CPU chip to the memory chip. All was well with the world. Even with multithreaded systems, only one path to memory existed, and reads and writes to memory always occurred whenever the CPU executed the associated machine instruction. The introduction of memory caches didn't fundamentally change that model (once they got the cache-coherency bugs worked out). Indeed, the cache is transparent to the program if it's implemented correctly. That simple memory model -- the CPU issues an instruction that modifies memory with an immediate effect -- remains in most programmers' minds.

Then somebody had the bright idea that two processors could run in the same box at the same time, sharing a common memory store (Suddenly, the world became much more complicated). In that situation, a given CPU can no longer access memory directly because another CPU might be using the memory at the same time. To solve the problem, along came a traffic-cop chip, called a memory unit. Each CPU was paired with its own memory unit, and the various memory units coordinated with each other to safely access the shared memory. Under that model, a CPU doesn't write directly to memory but requests a read or write operation from its paired memory unit, which updates the main memory store when it can get access, as seen in Figure 1. Those early memory units effectively managed simple read or write request queues.

Figure 1. The memory unit

Relaxed memory

All was well with the world until some bright engineer (I believe at DEC -- at least the Alpha, to my knowledge, was the earliest machine that worked like that) noticed a way to optimize memory operations with the hardware. Consider this code:

int a[] = new int[n];
int b[] = new int[n];
for( int i = 0; i < a.length; ++i )
    b[i] = a[i];

The resulting read/write-request queue is shown in Figure 2. Our intrepid engineer said, "Ah ha! If I allow the memory units to shuffle around the order of pending read/write requests while they're waiting to get access to the main memory store (giving us Figure 3), then I can turn all these atomic moves into one big block move." That is, the memory unit can transform the requests in Figure 2 into the situation shown in Figure 4. That reordering strategy became known as a relaxed memory model.

Figure 2. Copying an array: the initial state
Figure 3. Copying an array: operations rearranged
Figure 4. Optimizing the array copy

"Oooooooh!" said our engineer, "now things are a lot faster."

"But, but, but," said the programmer, "now the order in which I write things to memory may not be the order in which the writes actually occur!" (More correctly, the writes might become visible in nonsource code order.)

"Well," said the engineer, "I hate to slow things down unnecessarily, but what if I give you a way to prevent requests from moving beyond certain well-defined boundaries, called memory barriers? Requests issued between two memory barriers can be reordered, but no request can move past a memory barrier." (I'm simplifying a bit, since sometimes movement is possible, but that's the central concept.)

Figure 5 shows a couple of memory barriers. Requests are not permitted to migrate across the barrier. That way, at least we can guarantee that a write can occur before a subsequent read request is issued.

Figure 5. Memory barriers

So why do I care?

So why, you might ask, do I care about all that? The answer: programmers are too clever for their own good. They read how synchronized works: it goes out to the OS and acquires an exclusion semaphore (or mutex); an expensive operation, so too-clever programmers try to work around it. A classic too-clever example tries to leverage the fact that modification of a boolean is atomic. For example:

boolean is_okay;
long    value;
void f()
    if( is_okay )
        do_something( value );
void g( long new_value )
    value = new_value;
    is_okay = true;

In the example above, the f() method is called from one thread, and the g() method is called from another. The problem is that, when the two threads run on separate processors, the modifications of value and is_okay are found within the same memory barriers and are therefore subject to reordering. The is_okay field can effectively be set true before the new value of value is available. (A note for the expert-level precision impaired: don't quibble with me about visibility versus actual modification -- the effect is the same.) Note that the reordering would not matter in a single-threaded situation, where g() would complete in its entirety before either value could be used.


So what's a programmer to do? Synchronize! Figure 6 shows the situation when a synchronization occurs. When you dig deep enough into the bowels of the operating system, you'll find that releasing a mutex is effectively nothing but setting a flag, and acquiring a mutex is a loop (called a spin lock) that does an atomic-bit-test-and-set operation. (In one memory cycle, the instruction tests a bit in memory and sets it true if the result was false. The instruction simultaneously sets a machine register to indicate whether or not it modified the bit.) Most importantly, those operations are effectively surrounded by memory barriers. Looking at Figure 6, all the memory operations that occur on CPU1 (to the right of the memory barrier that surrounds the mutex release) must be visible in main memory before any read operations on CPU2 (to the left of the memory barrier that surrounds the mutex-acquire operation) can be executed.

Figure 6. Memory barriers surround synchronization

To summarize, synchronization implies a memory barrier. In that case, two exist: one barrier associated with entering a synchronized block or method and another associated with leaving it:

synchronized( xxx )
{   // code here is bracketed by memory barriers.

Consequently, though the code within the synchronized block is subject to reordering, all modifications made within that block will be visible in main memory (for other threads to view) when you exit the synchronized block.

The following code works just fine and is easier to understand than the earlier example to boot:

long    value;
synchronized void f()
{   do_something( value );
synchronized void g( long new_value )
{   value = new_value;

With all that in mind, live by this rule:

The visibility of a memory modification is guaranteed only when the modifying thread releases a lock that is subsequently acquired by the examining thread.


No exceptions.


I mean it.

Double-checked locking doesn't work

That brings us to double-checked locking, an idiom for Singleton creation many authors, including myself, have talked about. With double-checked locking, you avoid unnecessary overhead in creating a one-of-a-kind Singleton object (accessed by a static accessor method that creates the object the first time it's called and subsequently returns the object created in the initial call). The double-check idiom tries to reduce synchronization overhead by locking only when the object is created rather than in every call to the accessor method. Here's the idiom:

class Disfunctional_singleton
{   private static Disfunctional_singleton the_instance;
    private Disfunctional_singleton()
    {   // Bunches of reads and writes that initialize
        // the object occur here.
    public static Disfunctional_singleton get_instance()
    {   if( the_instance == null )
            synchronized( Disfunctional_singleton.class )
            {   if( the_instance == null )
                    the_instance = new Disfunctional_singleton();
        return the_instance;

Two tests are required because a thread might be preempted after the first test but before the synchronized statement executes. The interrupting thread would then create the object, and the initial (suspended) thread, when it's released, would create a second object.

That seems clever, but unfortunately the code does not work! The problem: the assignments that occur within the constructor occur within the same memory barrier as the assignment to the_instance. As a consequence, it's possible for the_instance to be non-null, even though the assignments that occurred in the constructor are yet not visible. If one thread gets as far as the statement the_instance=new Disfunctional_singleton(), and the assignment to the_instance is visible but the assignments within the constructor are not yet visible, then a second thread that calls get_instance() method can return a pointer to an uninitialized object.

That is a problem in single-CPU boxes as well. The Symantec JIT VM, for example, behaves awkwardly if it inlines the constructor. In particular, the generated code will assign a value to the_instance before the code that comprises the constructor executes. As a consequence, if the initializing thread is interrupted just after the assignment to the_instance but before the constructor code runs, a second thread that calls get_instance() can return a reference to an uninitialized object.

There are only two solutions, neither of which are particularly clever:

  1. You can synchronize the accessor:

    class Singleton // this one works
    {   private static Singleton the_instance;
        private synchronized static Singleton get_instance()
        {   if( the_instance == null )
                the_instance = new Singleton();
            return the_instance;
  2. You can create the object in a static initializer:

    class Singleton // this one works
    {   private static Singleton the_instance = new Singleton();
        public static Singleton get_instance()
        {   return the_instance;

The second solution has the obvious advantage of eliminating the lock on each access. It works because the VM guarantees that a class object can't be accessed until the class fully loads. However, the second solution isn't viable unless all the information used to create the Singleton is available at load time, which can't always be guaranteed. Remember, the VM loads classes whenever it feels like it -- there's no guarantee that the load is deferred until the class is first used.

Conclusion: rules to live by

So, I'll sum up with a set of rules of thumb that you can use to avoid that problem in multiprocessor environments.

Things that work

  1. Shared memory is visible to two threads only if the writing thread gives up a lock that is subsequently acquired by the reading thread.
  2. Modifications made by a thread before it issues a notify() will be visible to the thread that's released from the associated wait() (because a lock is required).
  3. Memory initialized in a static initializer is safely accessible by all threads, including the one that caused the class-file load.


  1. You cannot reliably use a boolean flag to test if modified memory is visible to another thread.
  2. Modifications made while sleeping may not be visible after sleep() returns.
  3. Operations are not necessarily executed in source-code order (not relevant if code is synchronized).
  4. Memory modifications made after a thread is created, but before it is started, may not be visible to the new thread.
  5. Modifications made by a thread before it shuts down may not be visible unless you join() the thread that's shut down. You must call join(), don't use a boolean flag.
  6. In x=new something(), there's no guarantee that the constructor runs before the assignment to x is made; even if the assignment does occur first, there's no guarantee that memory modifications made within the constructor will be visible before the assignment to x becomes visible.

Happy threading!

1 2 Page 1
Page 1 of 2
How to choose a low-code development platform