False Sharing

Most modern day computer architectures have a tiered memory structure. Cache memory is really fast (SRAM) and is also physically located very close to the processor (L1 is placed on CPU chip closest to processor, L2 is placed in between CPU and RAM, if L3/L4 are present then they are on motherboard) whereas accessing main memory involves data transfer over bus and is significantly slower.

The diagram below indicates memory hierarchy and relative access times:

Clearly, as we go down the memory hierarchy, reading and writing from/to memory becomes more and more expensive. Hence, for performance reasons each thread is allowed to copy state of the variables from main memory (DRAM) to cache and work on the cached copy. The cache is guaranteed to synchronize with main memory only when there's a memory barrier enforcing a happens-before guarantee (e.g synchronized, volatile etc)

Data is transferred between memory and cache in blocks of fixed size, called cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 bytes in size. Almost all modern CPUs cache line is equal to 64 bytes. During data transfer between CPU and RAM, a whole line is read or written. When a thread writes something to cache line, the whole line is marked as dirty. When another thread tries to read from the same cache line, assuming there's a memory barrier, a refresh is triggered for the whole cache line. False sharing is a potential side effect of the way CPU caches work. False sharing is a situation where threads unintentionally impact each other's performance while modifying independent variables sharing the same cache line.

Let's consider two volatile long variables a and b which are part of the same object's state and are occupying contiguous memory locations on the same cache line.

Let's say Thread T1 is executing the method m1() which is written as :

public void m1() {
    a++;
}

Thread T2 is executing the method m1() which is written as :

public void m2() {
    b++;
}

When T1 writes to a location on cache, the whole cache line is marked dirty. When T2 reads b, it will trigger a cache refresh as the cache line is already marked dirty thus causing a cache miss. False sharing can deteriorate performance significantly because of frequent cache misses (Note that accessing main memory can be 100x times slower than accessing L1 cache)

One way to solve the problem is by using artificial padding. We know that cache lines are typically 64 bytes in size. So if we introduce extra padding by adding redundant fields, we can avoid false sharing. For example we can change this class:

class Unpadded {    
    volatile long value;
}

to this:

class Padded {
    long p1, p2, p3, p4, p5, p6, p7 = 0L;
    volatile long value;
}

Although conventional wisdom suggests that allocation of excessive 56 bytes is an awful idea, in practice, it it is often significantly faster.

This trick is used to work in earlier versions of Java but doesn't work Java 7 onwards partly because of dead code elimination by JVM (and the dummy fields are simply ignored) and partly because JVM changed the way it orders the fields between 6 and 7.

Java 8 introduced a much better way to minimize false sharing by introducing the @Contended annotation. @Contended annotation can be applied to fields in a class and is an indicator to JVM that this field should be padded in order to avoid false sharing problems. One benefit is that, since the JVM understands the architecture of the CPU it is running on, it can automatically figure out the size of the required padding - this handles future situations if or when AMD or Intel comes out with a new processor with a new cache line size. By default, the JVM ignores this annotation except within classes in the JDK. To enable application code to use the annotation, include the -XX:-RestrictContended flag as part of VM args. We can also control the size of the padding by using the -XX:ContendedPaddingWidth VM arg (By default padding is 128 bytes on most computer architectures).

Food for thought: if cache line size is typically 64 bytes why is the default padding size 128 bytes and not 64 bytes ?

This is because of cache prefetcher. Cache prefetching is a technique used to boost performance by fetching information from main memory to cache earlier than actually needed. When there's a request to copy contents of a memory location from main memory to cache, the prefetcher automatically populates additional contiguous memory location data into cache line. Theoretically, if you read some data, there is a good chance that probably you might read the next element in future. This principle is called 'Locality of Reference'. Prefetching can occupy two cache lines simultaneously, as result we must double the artificial padding size.

Let's try benchmarking performance with and without @Contended.

We'll define two classes each having a volatile long field - Let's call them VolatileLongPadded and VolatileLongUnpadded. We'll add @Contended annotation to the volatile field in one class and leave the other as it is.

public final static class VolatileLongPadded {
    @Contended
    public volatile long value = 0L;
}


public final static class VolatileLongUnPadded {
    public volatile long value = 0L;
}

We'll maintain two arrays containing instances of these classes

private static VolatileLongPadded[] paddedLongs ;
private static VolatileLongUnPadded[] unPaddedLongs ;

On application startup, the two arrays will be populated:

static {
    paddedLongs = new VolatileLongPadded[NUM_THREADS_MAX] ;
    for (int i = 0 ; i < paddedLongs.length ; i++) {
        paddedLongs[i] = new VolatileLongPadded() ;
    }
    unPaddedLongs = new VolatileLongUnPadded[NUM_THREADS_MAX] ;
    for (int i = 0 ; i < unPaddedLongs.length ; i++) {
        unPaddedLongs[i] = new VolatileLongUnPadded() ;
    }
}

Now we'll define two tasks - for padded and unpadded. Each task will iterate over a loop and write to the volatile long field named value

private static Runnable createUnpaddedRunnable(final int x) {
    return () -> {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            unPaddedLongs[x].value = i;
        }
    };
}


private static Runnable createPaddedRunnable(final int x) {
    Runnable paddedTouch = () -> {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            paddedLongs[x].value = i;
        }
    };
    return paddedTouch;
}

Now lets write the main method. To benchmark this we will execute these two runnables 50,000,000 times for each test. Also we will keep on increasing the number of threads and check the benchmarks

public static int NUM_THREADS_MAX = 4 ;
public final static long ITERATIONS = 50_000_000L;

The main method:

public static void main(final String[] args) throws Exception {
    long begin, end ;


    for (int n=1 ; n<=NUM_THREADS_MAX ; n++) {


        Thread[] threads = new Thread[n];


        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(createPaddedRunnable(i));
        }


        begin = System.currentTimeMillis() ;
        for (Thread t : threads) { t.start() ; }
        for (Thread t : threads) { t.join() ; }
        end = System.currentTimeMillis() ;
        System.out.printf("   Padded # threads %d - T = %dms\n", n, end - begin) ;


        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(createUnpaddedRunnable(i));
        }


        begin = System.currentTimeMillis() ;
        for (Thread t : threads) { t.start() ; }
        for (Thread t : threads) { t.join() ; }
        end = System.currentTimeMillis() ;
        System.out.printf(" UnPadded # threads %d - T = %dms\n\n", n, end - begin) ;
    }
}

Let's run this without the -XX:-RestrictContended first. There should not be any significant difference between the two classes as the @Contended annotation will be ignored

Output:

   Padded # threads 1 - T = 406ms
UnPadded # threads 1 - T = 440ms


   Padded # threads 2 - T = 1017ms
UnPadded # threads 2 - T = 996ms


   Padded # threads 3 - T = 2392ms
UnPadded # threads 3 - T = 2606ms


   Padded # threads 4 - T = 3239ms
UnPadded # threads 4 - T = 3477ms

As expected, the isn't any significant difference between the performances of the Padded and UnPadded classes as the @Contended annotation has been ignored

Now lets run the same code with the -XX:-RestrictContended arg.

Output:

   Padded # threads 1 - T = 396ms
UnPadded # threads 1 - T = 553ms


   Padded # threads 2 - T = 545ms
UnPadded # threads 2 - T = 1094ms


   Padded # threads 3 - T = 714ms
UnPadded # threads 3 - T = 2671ms


   Padded # threads 4 - T = 905ms
UnPadded # threads 4 - T = 3490ms

Clearly padding made a significant difference. As we increase the number of threads, the difference becomes more significant.

What if we override the default padding size and keep it smaller than cache line size ?

Lets try running the same code with 16 byte padding (-XX:ContendedPaddingWidth=16)

Ouput:

   Padded # threads 1 - T = 402ms
UnPadded # threads 1 - T = 395ms


   Padded # threads 2 - T = 469ms
UnPadded # threads 2 - T = 1903ms


   Padded # threads 3 - T = 1204ms
UnPadded # threads 3 - T = 2346ms


   Padded # threads 4 - T = 1819ms
UnPadded # threads 4 - T = 2910ms

Performance has degraded to some extent because of cache misses - this becomes more evident as we run higher number of threads.