Hardware Based Locking

CAS (Compare And Swap) :

CAS (
Compare And Swap) is an atomic instruction used to implement optimistic locking. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it). CAS essentially allows a read-modify-write sequence to be performed atomically without the fear of interleaving by another thread. If two processes try to execute CAS on the same memory location at the same time, the first one will take a write lock at hardware level. The second one's read operation can still go on but it's write operation will be blocked. Note that CAS employs only hardware level lock and no software level lock. The second write will keep on retrying in a loop until it succeeds

Almost every major processor architecture supports instructions implementing CAS. e.g x86 architecture supports the cmpxchg (Compare And Exchange) instruction which updates a memory location with a new value as long as the existing value at that memory location matches the expected value.

com.sun.misc.Unsafe provides a bunch of API for low-level programming in Java including CAS based locking. Unsafe is undocumented and isn't intended for use beyond core Java classes. Unsafe interacts with native libraries which know which exact processor instruction to execute (eg cmpxchg on x86, LLSC on ParPC) when trying to perform CAS.

java.util.concurrent.atomic package defines classes (commonly or collectively called Atomic classes) that support atomic operations on single variables. Atomic classes internally use Unsafe to provide CAS functionalities

Let's look at a few Atomic classes :

AtomicInteger

AtomicInteger represents an int value that may be updated atomically. AtomicIntegers are most commonly used to implement atomically incremented counters in multithreaded environment.

To create an AtomicInteger instance we can use one of the following constructors :

 AtomicInteger(int initialValue)
 AtomicInteger()

Invoking the no arg constructor is same as invoking the other constructor with initial value as 0

To access the current state of the underlying int, use :

int get()

Similarly to set a value, use :

void set(int newValue)
void lazySet(int newValue)

With lazySet, the write is guaranteed to be not reordered with any other write but may be reordered with other operations.

AtomicInteger also provides the following CAS based API :

 int getAndSet(int newValue)

Atomically sets to the given value and returns the old value.

 boolean compareAndSet(int expect, int update)
 boolean weakCompareAndSet(int expect, int update)

the compareAndSet method atomically sets the value to the given updated value if the current value == the expected value. The weakCompareAndSet method does the same but doesn't provide any reordering guarantees so it's rarely appropriate to use it.

AtomicInteger is commonly used as a counter. and provides the following methods to add a value atomically to a counter :

 int addAndGet(int delta)
 int getAndAdd(int delta)

Both of them add delta to the existing counter. As the sames suggest, addAndGet returns the updated value whereas getAndAdd returns the old value (before update)

Since most counters are typically incremented/decremented by 1 at a time, the following convenience methods are available :

int getAndIncrement() // equivalent to i++
int incrementAndGet() // equivalent to ++i
int getAndDecrement() // equivalent to i--
int decrementAndGet() // equivalent to --i

* In multithreaded context, can Atomic classes be used as substitute for wrapper classes ? *

One key difference between Atomic classes and Wrapper classes is that Atomic classes do not override equals/hashcode/compareTo. So its not safe to use them as substitute for wrapper classes (even though they implement Number just like wrapper classes)

AtomicLong

Similar to AtomicInteger except that it represents a long value that may be updated atomically. Consequently, all the method signatures have long datatypes instead of int,

AtomicReference

Represents an object reference that may be updated atomically, Supports generics to indicate type of underlying object

A very good use case for AtomicReference is to implement thread safe operations on data structures backed by linked list without having to use pessimistic lock.

AtomicBoolean

Similar to AtomicInteger/AtomicLong except that the underlying value is of type boolean

AtomicIntegerArray/AtomicLongArray/AtomicReferenceArray

Sometimes we may need to update an array element atomically. AtomicReference helps only if we cant to reassign the array reference to a different object. But what if we don't want to reassign the reference but just update one element in the array ? AtomicXXXArray classes help us with that by extending atomic support to arrays of the corresponding type

We'll explore the APIs for AtomicReferenceArray here. Very similar methods exist in AtomicIntegerArray and AtomicLongArray as well

To create an AtomicReferenceArray instance, use either of the following constructors :

 AtomicReferenceArray(E[] array)
 AtomicReferenceArray(int length)

The second constructor accepts the length of the array as argument and internally creates an array of that size with all elements being initially null.

The get/set methods work similar to AtomicReference except that they need an additional argument indicating the index of the element.

 E get(int i)
 void set(int i, E newValue)
 void lazySet(int i, E newValue)

For compare and set operations, it has similar methods as AtomicReference except that they
need an additional argument indicating the index of the element.

 boolean compareAndSet(int i, E expect, E update)
 boolean weakCompareAndSet(int i, E expect, E update)
 E getAndSet(int i, E newValue)

AtomicReferenceFieldUpdater

AtomicReferenceFieldUpdater is reflection-based utility that enables atomic updates to designated volatile reference fields of designated classes. This class is designed for use in atomic data structures in which several reference fields of the same node are independently subject to atomic updates e.g, elements of a BST where left and right child of a node might undergo modification at the same time. In the generic parameters , T represents the type of object whereas V represents the datatype of the field.

Note that AtomicReferenceFieldUpdater is an abstract class with its most popular implementation being AtomicReferenceFieldUpdaterImpl

AtomicIntegerFieldUpdater/AtomicLongFieldUpdater

Similar to AtomicReferenceFieldUpdater except that the datatype of the volatile field in question is int/long

xadd

xadd (also called Fetch-and-add) is a CPU instruction to atomically add a certain value to the contents of a memory location - functionally, it's very similar to CAS (Compare and swap). Unlike CAS, xadd doesn't require looping/spinning with retries and guarantees success in a definite number of steps. XADD is supported by most of the common processor architectures just like CAS.

In the x86 architecture, the LOCK XADD instruction has been available since Intel 80486.

Java 7 vs Java 8 changes

Because CAS requires retries, under high contention the CAS loop fails often and the branch predictor starts to predict the execution path will stay in the loop, causing lots of pipeline flushing when the CAS eventually succeeds. As a result, in Java 8, the AtomicInteger/AtomicLong implementation has been changed to use XADD instead of CAS.

If you look at the source code from getAndincrement() in AtomicInteger, it changed from this in 1.7

public final int getAndIncrement() {
    for (;;) {
        int current = get();
        int next = current + 1;
        if (compareAndSet(current, next))
            return current;
    }
}

To this in Java 8 :

public final int getAndIncrement() {
    return unsafe.getAndAddInt(this, valueOffset, 1);
}

LongAdder

LongAdder is fastest way to maintain a counter in JVMs on multicore systems. Under low update contention, the two classes (LongAdder and AtomicLong) have similar performance benchmarks. But under high contention, expected throughput of LongAdder is significantly higher, at the expense of higher space consumption.

LongAdder provides methods add() and increment() just like the atomic number classes and is also thread-safe.

 void add(long x)
 void increment() // equivalent to add(1)
 void decrement() // equivalent to add(-1)

To reset the LongAdder to its initial state, use the following method

 void reset()

reset() is only effective and safe if there are no concurrent updates. Because this method is intrinsically racy, it should only be used when it is known that no threads are concurrently updating.

Instead of summing up a single result this class maintains a set of variables internally to reduce contention over threads. The actual result can be retrieved by calling sum() or sumThenReset().

 long sum()
 long sumThenReset() // equivalent to calling sum() followed by reset()

This class is usually preferable over atomic numbers when updates from multiple threads are more common than reads. This is often the case when capturing statistical data, e.g. you want to count the number of requests served on a web server. The drawback of LongAdder is higher memory consumption because a set of variables is held in-memory.

Note that LongAdder extends Number but doesn't implement equals/hashCode/compareTo. So, while LongAdder is a useful utility for maintaining a running counter (or similar requirements), it cannot be used a substitute for wrapper classes.

DoubleAdder

DoubleAdder has almost identical specs as LongAdder except that double adder can work on floating point decimals.

LongAccumulator/DoubleAccumulator

LongAccumulator can be thought of as a more generic version of LongAdder in the sense that LongAccumulator accepts a function indicating the operation and is hence not just limited to addition of of long values. LongAccumulator has similar performance characteristics as LongAdder.

The instance of the LongAccumulator can be created by supplying the LongBinaryOperator and the initial value to its constructor. The call new LongAdder() is equivalent to new LongAccumulator((x, y) -> x + y, 0L.

 public LongAccumulator(LongBinaryOperator accumulatorFunction, long identity)

Example:

 LongAccumulator accumulator = new LongAccumulator(Long::sum, 0L);

The order of accumulation within or across threads is not guaranteed and cannot be depended upon, so this class is only applicable to functions for which the order of accumulation does not matter. The supplied accumulator function should be side-effect-free, since it may be re-applied when attempted updates fail due to contention among threads. The function is applied with the current value as its first argument, and the given update as the second argument.

To perform operations (using the LongBinaryOperator) use :

void accumulate(long x)

To reset the LongAccumulator to it's initial state, use :

void reset()

Just like LongAdder, reset can lead to race condition if there are other threads invoking accumulate

To get the accumulated result, use

long get()
long getThenReset() // this is equivalent to calling get() followed by reset()

Performance Benchmarks

We'll try the following exercise to compare performances. We'll add a bunch of numbers in different ways and compare the time taken. We'll try the following cases :

single threaded, no memory barrier

multiple threads, critical section protected by synchronized

multiple threads, critical section protected by ReentrantLock

Atomic classes

Adders

Accumulators

Addition in a single thread, no memory barriers

We have a very basic Counter class which maintains a running count

 class Counter {
     long count;

     public Counter(long initial) {
         this.count = initial;
     }

     void increment() { count++; }
     long getCount() { return count; }
 }

We'll invoke the increment method a million times :

public static void main(String[] args) throws Exception {
    Counter counter = new Counter(0);
    long start = System.nanoTime();
    for (long index = 0; index<1_000_000; index++) {
        counter.increment();
    }
    long end = System.nanoTime();
    System.out.println("Time taken(ns) : " + (end - start));
    System.out.println("Result: " + counter.getCount());
}

Output:

 Time taken(ns) : 7309528
 Result: 1000000

So, time taken : 7 ms (approx)

Multiple threads, critical section guarded by synchronized

So our Counter class changes to :

class Counter {
    long count;

     public Counter(long initial) {
         this.count = initial;
     }

     synchronized void increment() { count++; }
     long getCount() { return count; }
 }

The main method now starts multiple threads - same number of threads as the number of available processors

public static void main(String[] args) throws Exception {
    int parallelismLevel = Runtime.getRuntime().availableProcessors();
    ExecutorService executor = Executors.newFixedThreadPool(parallelismLevel);        
    Counter counter = new Counter(0);

    Runnable task = () -> counter.increment();

    long start = System.nanoTime();
    for (long index = 0; index<1_000_000; index++) {
        executor.submit(task);
    }

    executor.shutdown();
    while (!executor.isTerminated()) {
        executor.awaitTermination(5, TimeUnit.SECONDS);
    }

    long end = System.nanoTime();
    System.out.println("Time taken(ns) : " + (end - start));
    System.out.println("Result: " + counter.getCount());
}

Output:

 Time taken(ns) : 392368550
 Result: 1000000

So, total time taken = 392 ms (roughly)

Look at the massive cost of synchronization and memory barriers - even though we are running in with 4 threads (i7 processor - 4 cores), the time taken increased almost exponentially from 7 ms to 392 ms

Multiple threads, critical section guarded by ReentrantLock

Let's change the Counter class to use ReentrantLock instead :

class Counter {
     static Lock lock = new ReentrantLock();
     long count;

     public Counter(long initial) {
         this.count = initial;
     }

     void increment() { 
        try {
             lock.lock();
             count++;
        }
        finally {
             lock.unlock();
        }     
     }
     long getCount() { return count; }
 }

The main method remains the same - 4 threads and 1 million calls to increment

Output:

 Time taken(ns) : 407598071
 Result: 1000000

So, total time taken = 408 ms (roughly)

Multiple threads using AtomicLong

So, lets change the code to use Atomic Long

 class Counter {

     AtomicLong count;

     public Counter(long initial) {
         this.count = new AtomicLong(initial);
     }

     void increment() { 
         count.incrementAndGet();     
     }
     long getCount() { return count.get(); }
 }

The main method remains unchanged.

Output:

 Time taken(ns) : 327886292
 Result: 1000000

What if we increase the concurrency level and increase contention ? Lets run it with 8 and 16 threads as well

Output (8 threads)

 Time taken(ns) : 407435637
 Result: 1000000

Output (16 threads)

 Time taken(ns) : 478794116
 Result: 1000000

Multiple Threads using LongAdder

class Counter {

     LongAdder count;

     public Counter(long initial) {
         this.count = new LongAdder();
         this.count.add(initial);
     }

     void increment() { 
         count.increment();
     }
     long getCount() { return count.sum(); }
 }

Output:

 Time taken(ns) : 357386349
 Result: 1000000

Lets run it with 8 and 16 threads as well (just as we did with AtomicLong)

Output (with 8 threads)

Time taken(ns) : 381943772
Result: 1000000

Output (with 16 threads)

Time taken(ns) : 389665186
Result: 1000000

Clearly, with high levels of contention, LongAdder outperforms AtomicLong

Multiple threads and LongAccumulator

Let's rewrite the Counter class with LongAccumulator and run the main (unchanged) again

 class Counter {

     LongAccumulator count;

     public Counter(long initial) {
         this.count = new LongAccumulator(Long::sum, initial);
     }

     void increment() { 
         count.accumulate(1);
     }
     long getCount() { return count.get(); }
 }

Output (with 4 threads)

 Time taken(ns) : 361372625
 Result: 1000000

Output (with 8 threads)

 Time taken(ns) : 394639948
 Result: 1000000

Output (with 16 threads)

 Time taken(ns) : 415286015
 Result: 1000000