Points to consider while working on multithreaded code

In general, there are 4 main points to consider when it comes to dealing with multithreaded code :
mutual exclusion
caching
reordering
happens-before

Let's examine each of them in detail :

Mutual Exclusion :

Mutual Exclusion essentially means that only one thread can execute the critical-section at any point of time. Mutual Exclusion is often needed in order to avoid race condition.

Let's look the following piece of code (which attempts to insert a new element at the end of a linked list) :

class Node<V> {    
    V value;
    Node<V> next;

    public Node(V value) {
        this.value = value;
    }
}
class LinkedList<V> {
    Node<V> head;
    Node<V> tail;

    public void addNewNode(V value) {
        Node newNode = new Node(value);
        tail.next = newNode;
        tail = newNode;
    }
}

This is what the addNewNode method does :
Create a new Node object
Make next reference of current tail point to this new object
Change the tail reference to the new object (as it is the new tail)

While this works fine in a single threaded context, there will be problems if two threads execute the addNewNode simultaneously. Let's say currently the tail reference points to node N0 and two threads T1 and T2 are trying to add two new values (say 1 and 2) to the linked list. Here's what might happen :
Thread T1 creates new Node object N1 with value 1
Thread T2 creates new Node object N2 with value 2
Thread T1 changes next reference of N0 to N1. So our linkedlist ends with N0 -> N1
Thread T2 changes next reference of N0 to N2.
So our linkedlist ends with N0 -> N2
Thread T1 changes tail reference to N1
Thread T2 changes tail reference to N2
The net effect is that node N1 has been omitted altogether.

That's why, sometimes, we need mutual exclusion to protect critical section of code from producing spurious results. With mutual exclusion T2 can't execute the method while T1 is executing it.

Note that mutual exclusion and atomicity go hand-in-hand. If the critical section involves multi-step operations, we need to ensure that all the steps are atomic as well as mutually exclusive.

Caching

To understand caching and why it's needed let's take a look at memory hierarchy first :

As we go down the memory hierarchy, reading and writing from/to memory becomes more and more expensive. Cache memory is really fast (SRAM) and is also physically located very close to the processor (L1 is placed on CPU chip closest to processor, L2 is placed in between CPU and RAM, if L3 is present then it is on motherboard). Hence, for performance reasons each thread is allowed to copy state of the variables from main memory (DRAM) to cache and work on the cached copy. This can potentially cause a problem when two copies of the same variable exist in cache for two different threads and writes to the variable by one thread aren't visible to the other. JVM provides no guarantee about when write to a cache is flushed to main memory (or whether flushed at all) unless there's a code construct explicitly suggesting JVM to do so.

On a multi core system, there is an added layer of complexity - L1 and L2 cache are typically never shared across processor/core whereas L3 and L4 cache may or may not be shared depending on the processor architecture. This means that when a thread loses CPU quantum and is later picked up by scheduler again, it might be picked up for processing by another core requiring all the cache data to be copied from one core's cache to another - thus making context switching a really expensive affair. To solve that problem, we have cache-coherent computer architecture (and most modern day processor are designed to be cache coherent). In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data: one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the other copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands (data) are propagated throughout the system in a timely fashion. Even though we might be deploying code an a machine with cache coherent processor, as a Java developer writing a service, or a web app, or a mobile app, or a desktop app (in other words anything apart for writing code for embedded system) we should ensure that the code isn't dependent on the underlying processor being cache coherent.

Data is transferred between memory and cache in blocks of fixed size, called cache lines. When a cache line is copied from memory into the cache, a cache entry is created. Cache is partitioned into lines (also called blocks). Almost all modern CPUs cache line is equal to 64 bytes. During data transfer between CPU and RAM, a whole line is read or written. When there's a request to copy contents of a memory location from main memory to cache, the prefetcher automatically populates additional data (how much data depends on what fits into cache line) into cache line. Theoretically, if you read some data, there is a good chance that probably you might read the next element in future. This principle is called 'Locality of Reference'. This is also one of the reasons why iteration over an array is much faster than iteration over a linked list (Since array elements are stored in adjacent memory locations, iterating over an array has an access pattern whereas there is no such pattern while iterating over elements of a linkedlist)

In the following code, we iterate over a byte array (multiple times) first with a stride of 64 and then with a stride of 16. Considering that in both cases we are iterating equal number of times, they should take similar time. However the first iteration is with a stride of 64 (and hence every iteration needs new data to be loaded into cache line) whereas the second iteration is with a stride of 8 (so multiple reads from cache line possible)

public class Test {

static byte[] data = new byte[128 * 1024]; //128 KB

public static void main(String[] args) {

    for(int i=0; i< data.length; i++) {
        data[i]=(byte)i;
    }

    long start = System.currentTimeMillis();            
    int total = 0;

    // doing it a lot of times just to get accurate benchmark
    int j = 0;
    while(j<1_000_000) {
        // cache line size is typically 64 bytes, so reading with a gap of 64
        int count = 0;
        for (int k = 0; count < 2000; k = k + 64) {
          //System.out.println(count + " " + k);
          total = total + data[k];
          count ++;
        }
        j++;
    }

    long step1 = System.currentTimeMillis();
    System.out.println("Time taken : " + (step1 - start));

    j = 0;
    while(j<1_000_000) {
        int count = 0;
        for (int k = 0; count < 2000; k = k + 8) {
          total = total + data[k];
          count ++;
        }
        j++;
    }

    long step2 = System.currentTimeMillis();
    System.out.println("Time taken : " + (step2 - step1));
}

Output:

Time taken : 1747
Time taken : 877

Clearly the second run is faster because in multiple iterations it can read data from the same cache-line (which almost costs nothing extra).

Fun Exercise : Can we use the same technique to estimate the L1 cache size ?

Take a look at the code snippet below. We start with a byte array of size 8 KB and try to iterate over elements. With every round of iteration, we are increasing the size of the data accessed by 8 KB (and finally stopping at 128 KB). Since there's typically a significant difference between L1 and L2 cache performances, the expectation here is that at some point (when the size is beyond L1 cache size limit) we will see a sharp drop in performance.
We are running it 10 times and using the 10th run data to make conclusions

static byte[] data = new byte[128 * 1024]; //128 KB

public static void main(String[] args) {
    for (int i=0; i<10; i++){
        test(data, i);
    }
}

static void test(byte[] data, int count) {
    // steps of 8 KB each
    for (int len = 8192; len <= data.length; len += 8192) {

        long start = System.nanoTime();
        for (int i = 0; i < 10000; i++) {
            for (int k = 0; k < len; k += 64) {
                data[k] = 1;
            }
        }

        long end = System.nanoTime();
        if (count == 9) {
            System.out.println("size (KB): " + len/1024 + ", total time (ns): " + (end - start) + ", time per step (ns): " + (end - start) / len);
        }
    }
}

Output:

size (KB): 8, total time (ns): 892051, time per step (ns): 108
size (KB): 16, total time (ns): 1844792, time per step (ns): 112
size (KB): 24, total time (ns): 2980942, time per step (ns): 121
size (KB): 32, total time (ns): 8488988, time per step (ns): 259
size (KB): 40, total time (ns): 17236802, time per step (ns): 420
size (KB): 48, total time (ns): 19508207, time per step (ns): 396
size (KB): 56, total time (ns): 22589108, time per step (ns): 393
size (KB): 64, total time (ns): 24905139, time per step (ns): 380
size (KB): 72, total time (ns): 26524576, time per step (ns): 359
size (KB): 80, total time (ns): 30263247, time per step (ns): 369
size (KB): 88, total time (ns): 39579133, time per step (ns): 439
size (KB): 96, total time (ns): 37954343, time per step (ns): 386
size (KB): 104, total time (ns): 43428475, time per step (ns): 407
size (KB): 112, total time (ns): 42282955, time per step (ns): 368
size (KB): 120, total time (ns): 45680246, time per step (ns): 371
size (KB): 128, total time (ns): 49052101, time per step (ns): 374

From the output we can conclude that the L1 cache size is 32 KB.

Reordering

Note that program order isn't necessarily same as execution order. Instructions of a Java can be reordered, either by the compiler, by the JVM at runtime, or even by the processor unless we explicitly use a programming construct that prevents reordering. So we should never assume that the executed program will have its instructions executed in exactly the same order than what we specified in the source code unless we take measures to prevent reordering. JMM (Java Memory Model) specifications state : "If the reordering produces results consistent with a legal execution, it is not illegal". In other words, if the result of a set of actions conforms to a valid chain of execution as per the JMM, then the result is allowed regardless of whether the author intended the code to produce that result or not.

Let's try to demonstrate it using code.

First let us define a class State. It has 3 fields of type int - all of them initialized to 0 by default

class State {
    int a = 0;
    int b = 0;
    int c = 0;
}

Now, let's start a thread which writes to the fields in class State

        State state = new State();            
        Runnable r1 = new Runnable() {            
            @Override
            public void run() {
                state.a = 1; // statement 1
                state.b = 1; // statement 2
                state.c = state.a + 1; // statement 3

            }
        };
        new Thread(r1).start();

There is no mutual exclusion here. So interleaving is possible.
Looking at the code, the valid states of the system are :

a = 1, b = 0, c = 0

a = 1, b = 1, c = 0

a = 1, b =1, c = 2

Now lets create a thread which checks for anomalies due to reordering and prints them out
The possible anomalies can be :

b = 1 , a = 0 (b initialized but not a)

c = 2, b = 0 (c initialized but not b)

c = 2, a = 0 (c initialized using a but a never set )

        Runnable r2 = new Runnable() {
            @Override
            public void run() {
                int tmpC = state.c;
                int tmpB = state.b;
                int tmpA = state.a;
                if (tmpB == 1 && tmpA == 0) {
                    System.out.println("Alert!! b == 1 && a == 0");
                }
                if (tmpC == 2 && tmpB == 0) {
                    System.out.println("Alert!! c == 2 && b == 0");
                }
                if (tmpC == 2 && tmpA == 0) {
                    System.out.println("Alert!! c == 2 && a == 0");
                }                                        
            }            
        };
        new Thread(r2).start();

Of course, we will need to run it a lot of times, in order to be able to catch this, so we run it a million times :

public class Test {

    public static void main(String[] args) {
        for (int i = 0; i < 1_000_000; i++) {
            State state = new State();            
            Runnable r1 = new Runnable() {            
                @Override
                public void run() {
                    state.a = 1;
                    state.b = 1;
                    state.c = state.a + 1;
                }
            };
            new Thread(r1).start();

            Runnable r2 = new Runnable() {
                 @Override
                 public void run() {
                    int tmpC = state.c;
                    int tmpB = state.b;
                    int tmpA = state.a;
                    if (tmpB == 1 && tmpA == 0) {
                        System.out.println("Alert!! b == 1 && a == 0");
                    }
                    if (tmpC == 2 && tmpB == 0) {
                        System.out.println("Alert!! c == 2 && b == 0");
                    }
                    if (tmpC == 2 && tmpA == 0) {
                        System.out.println("Alert!! c == 2 && a == 0");
                    }                                        
                }            
            };
            new Thread(r2).start();
        }                
    }
}
class State {
    int a = 0;
    int b = 0;
    int c = 0;
}

Output:

Alert!! c == 2 && b == 0
Alert!! c == 2 && b == 0
Alert!! c == 2 && b == 0

Clearly reordering of instruction is not a myth or an urban legend and there were occasions where the instructions were reordered.

Why do we need reordering at all ?

First, lets look at reordering by CPU :

Here's an architecture diagram for Intel sandy bridge processor :

There are 6 ports and every instruction is sent to the right port depending on the instruction type. Note that not every port supports every kind of instruction - which Port 0, Port 1 and Port 5 support ALU, JMP instruction is supported only by Port 5. This means that even if instructions are fed to CPU in exactly same order as the program, executions can be out of order because a particular port might be too busy. With a guarantee against reordering, not all ports will be busy all the time leading to sub optimal usage and low throughput.

Similarly, compiler as well as JVM performs a lot of optimizations (like dead code elimination, eliminating useless synchronization, unrolling loops, predicting branching logic etc) which can lead to instruction reordering.

Imagine the following code:

a = 1;
b = 1;
a = a + 1;   // Not present in the register
b = b + 1;   // Not present in the register
a = a + 1;   // Not present in the register
b = b + 1;   // Not present in the register
// Here both a and b has value 3

A possible reordered and optimized version of the code can be :

a = 1;
a = a + 1;   // Already in the register
a = a + 1;   // Already in the register
b = 1;
b = b + 1;   // Already in the register
b = b + 1;   // Already in the register
// Here both a and b has value 3

The final result will be same but with reordered code we'll get a better performance and values would be fetched from register ( < 1 ns per fetch) as opposed to L1 cache (1-3 ns typically)

Happens-before.

If two actions are ordered by a happens-before relationship, it essentially means that the first is visible to and ordered before the second. If we have two actions x and y and x happens-before y [commonly represented as hb(x,y) ], it implies that :

if x and y are actions on the same thread, then x comes before y in program order
if hb(x,y) and hb(y,z) then we can say hb(x,z)
if x and y are actions on different threads, then all the changes as a consequence of executing x must be visible while executing y

Note that Happens-before is a causal, not a temporal relationship. In other words, x happens-before y doesn't literally mean that CPU actually executed x before y.

We have already seen that due to compiler & hardware optimizations, writes by one thread aren't always visible by reads of another thread. Happens-before relationship is a way to solve that problem. A write W that is proven to be happens-before a read R is guaranteed to be visible by that read, assuming that there's no other interfering write (i.e. one with no happens-before relation with the read, or one happening between them according to that relation). The happens-before relationship is simply a guarantee that memory writes by one specific statement are visible to another specific statement. There are several actions that create happens-before relationships most popular of them being synchronization.

Here are some of the few happens-before rules in Java :

Each action in a single thread happens-before every action in that thread that comes later in the program order.
An unlock on a monitor lock (exiting synchronized method/block) happens-before every subsequent acquiring on the same monitor lock.
A write to a volatile field happens-before every subsequent read of that same field. Writes and reads of volatile fields have similar memory consistency effects as entering and exiting monitors (synchronized block around reads and writes), but without actually aquiring monitors/locks.
A call to Thread.start() on a thread happens-before every action in the started thread. Say thread A spawns a new thread B by calling threadA.start(). All actions performed in thread B's run method will see thread A's calling threadA.start() method and before that (only in thread A) happened before them.
All actions in a thread happen-before any other thread successfully returns from a join on that thread. Say thread A spawns a new thread B by calling threadA.start() then calls threadA.join(). Thread A will wait at join() call until thread B's run method finishes. After join method returns, all subsequent actions in thread A will see all actions performed in thread B's run method happened before them.

Now lets have a look at the common concurrency idioms based on the factors discussed above

synchronized keyword :

Synchronization implies mutual-exclusion. So if a thread is executing critical-section of code guarded by a synchronized block or method, no other thread can execute the critical section simultaneously

caching : Synchronization provides guarantee against caching. When a thread enters a synchronized block of code or a synchronized method, it crosses a read memory-barrier which causes local copies of memory that have been updated in main memory to be flushed from the local processor memory cache and replaced with data loaded from main memory. When a thread exits a synchronized block, it crosses a write memory-barrier and all local processor dirty blocks are written to main memory.
Note that even if a thread writes all its local changes to main memory while exiting a synchronized block or method, another thread which is not in a synchronized context, isn't forced to load the state of local variables from main memory and hence might see stale data. Hence if mutable state is shared across multiple threads, it's important to use synchronization both for reads as well as writes

reordering : Synchronization provides some level of guarantee against reordering.
Statements before or after the synchronized block may be moved inside the synchronized block ( provided they are not in a different synchronized block of course) - it is sometimes called the "roach motel" principle. Roach motel principle states that instructions can move into synchronized block, but not out. So, normal accesses can be moved from after a monitor exit to before it, and from before a monitor enter to after it.
Also statements inside a synchronized block may be reordered as long as the modification is sequentially consistent (i.e. the effect is not visible from other code that shares a happens-before relationship with the block

Lets look the following lock of code

int a = 1; //statement 1
int b = 2; //statement 2
synchronized(lock) {
    int c = 3; //statement 3
    int d = 4; //statement 4
    int e = 5; //statement 5
}
int f = 6; //statement 6
int g = 7; //statement 7

So,
statements (3, 4, 5) can be mutually reordered
statements (1, 2) can be mutually reordered
statements (6, 7) can be mutually reordered
statements (1, 2) can be reordered with (6, 7)
statements (3, 4, 5) can't be reordered with (6, 7)

statements (3, 4, 5) can't be reordered with (1, 2)

happens before: Synchronization provides happens before guarantee as in an unlock on a monitor lock (exiting synchronized method/block) always happens-before every subsequent acquiring on the same monitor lock

volatile keyword :

mutual exclusion : volatile doesn't aid in mutual exclusion at all. If you need mutual exclusion, use synchronized

caching : JVM guarantees that if a variable is declared as volatile, every thread reading the variable is always guaranteed to read the value from main memory. Just like synchronization, volatile creates a memory barrier. So when a thread reads a volatile variable it's local processor cache is replaced with the values from heap (including values for non volatile variables as well). Similarly when a thread writes to a volatile variable, changes in local cache are propagated to main memory.

reordering : There's a minor difference between old(1.4) and new(1.5) memory model when it comes to reordering of volatile variables. Old memory model states that a volatile variable cannot be reordered with another volatile variable but can be reordered with a non volatile variable. As per new memory model, however, a volatile variable cannot be reordered at all

happens before : Volatile keyword provides happens before guarantee just like synchronization.
A write to a volatile field happens-before every subsequent read of that same field. Writes and reads of volatile fields have similar memory consistency effects as entering and exiting monitors (synchronized block around reads and writes), but without actually acquiring monitors/locks.

If you change a volatile variable after changing a non-volatile variable, as long as you read the same volatile variable first in a different thread, you'll get the most recent value of the non-volatile variable as well.

Lets create a class State with two internal attributes - a and b - both of type int and initialized to 0

class State {
    int a = 0;
    int b = 0;
}

Now we can run two different threads one which sets b followed by a and the other which checks for reordering ( a is set but b is not). We run it for a million times and as expected in the absence of any happens before guarantee there are occasions where the instructions got reordered

public static void main(String[] args) {

    for (int i=0; i<1_000_000; i++) {
        State state = new State();

        Runnable r1 = new Runnable() {            
            @Override
            public void run() {
                state.b = 1;
                state.a = 1;
            }
        };

        Runnable r2 = new Runnable() {            
            @Override
            public void run() {
                if (state.a == 1 && state.b != 1) {
                    System.out.println("Inconsistency Detected");
                    System.exit(1);
                }

            }
        };

        new Thread(r1).start();
        new Thread(r2).start();            
    }

}

Output:

Inconsistency Detected

Now lets make field a in class State volatile:

class State {
    volatile int a = 0;
    int b = 0;
}

We run the same code again, and this time there isn't a single case of reordering. Also note that even though we declared just a as volatile and not b, we are still getting the most recent value of b (because we are writing to a after b and reading a before b)