Squeeze out the last 5%

Norman Maurer, Principal Software Engineer @ Red Hat Inc

  • Leading Netty effort at Red Hat
  • Vert.x / NIO / Performance
  • Author of Netty in Action
  • @normanmaurer
  • github.com/normanmaurer

Agenda

JIT - Just in Time Compiler

  • loop-unrolling
  • rearranging code
  • removing synchronized
  • optimistic locking
  • inlining hot methods

Loop unrolling

  • JIT may unroll loops
  • eliminate branching
unroll
https://www.flickr.com/photos/st3f4n/3805014022/

Loop unrolling

  • speed up loops
May be hard for the JIT to unroll
int i = 0;
for (;;) {
  if (array.length == i) {
    break;
  }
  doSomething(array[i++]);
}

Loop unrolling - Help the JIT

array.length == 4

Easy to unroll
int i = 0;
for (int i = 0; i < array.length; i++) {
  doSomething(array[i]);
}
unrolled loop
doSomething(array[0]);
doSomething(array[1]);
doSomething(array[2]);
doSomething(array[3]);

Inlining - General

  • Inline hot methods
  • Eliminate call overhead
  • Only works when small enough
Limit for inline hot methods is 325 byte operations per default…

Inlining - How it works

Before inlined
public void methodA() {
  ... // Do some work A (1)
  methodB();
}

private void methodB() {
  ... // Do some more work B (2)
}
methodB is inlined into methodA
public void methodA() {
  ... // Do some work A (3)
  ... // Do some more work B (4)
}

Inlining - How you detect

Record inline tasks
java \
-XX:+PrintCompilation \ (1)
-XX:+UnlockDiagnosticVMOptions \ (2)
-XX:+PrintInlining \ (3)
.... > inline.log
1Prints out when JIT compilation happens
2Is needed to use flags like -XX:+PrintInlining
3Prints what methods were inlined

Inlining - How you detect

inline.log content example
@ 42   io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe::read (191 bytes)   inline (hot) (1)
@ 42   io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe::read (327 bytes)   hot method too big (2)
  @ 4   io.netty.channel.socket.nio.NioSocketChannel::config (5 bytes)   inline (hot)
    @ 1   io.netty.channel.socket.nio.NioSocketChannel::config (5 bytes)   inline (hot)
  @ 12   io.netty.channel.AbstractChannel::pipeline (5 bytes)   inline (hot)
1Method is hot and was inlined
2Method is hot and was too big for inlining :(
You want to have your hot methods inlined, for max performance when possible!

Inlining - Before code was optimized

Original code which was cause of too big for inlining
private final class NioMessageUnsafe extends AbstractNioUnsafe {
  public void read() {
    final SelectionKey key = selectionKey();
    if (!config().isAutoRead()) {
      int interestOps = key.interestOps();
      if ((interestOps & readInterestOp) != 0) {
        // only remove readInterestOp if needed
        key.interestOps(interestOps & ~readInterestOp);
      }
    }
    ... // rest of the method
  }
  ...
}

Inlining - After code was optimized

Factor out less likely branch out in new method
private final class NioMessageUnsafe extends AbstractNioUnsafe {
  private void removeReadOp() {
    SelectionKey key = selectionKey();
    int interestOps = key.interestOps();
    if ((interestOps & readInterestOp) != 0) {
      // only remove readInterestOp if needed
      key.interestOps(interestOps & ~readInterestOp);
    }
  }

  public void read() {
    if (!config().isAutoRead()) {
      removeReadOp();
    }
    ... // rest of the method
  }
  ...
}

Inlining - Result after code change

inline.log content after change
@ 42   io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe::read (288 bytes)   inline (hot) (1)
1Method is now small enough for inlining :)

Inline != Inline … huh ?

  • public vs. protected vs. private
  • final vs. non-final
  • static vs. non-static
type
https://www.flickr.com/photos/brears/2685223351

Inline - Make JIT’s life easier

  • flat-class hierarchy
  • special handling of two impls.
http://www.cliffc.org/blog/2007/11/06/clone-or-not-clone/

Inline by hand - if nothing helps

You can inline by hand
nightmare
https://www.flickr.com/photos/tracheotomy_bob/5430166369
Only do this if you really need the last percent of performance. This can result in a maintainance nightmare!

GC - Garbage Collector

trash
https://www.flickr.com/photos/stavos52093/13807616194/

Finalizer - No thanks

Ouch :(
public class YourClass {
  @Override
  public void finalize() {
    // cleanup resources and destroy them
  }
}
Better :)
public class YourClass {
  public void destroy() {
    // cleanup resources and destroy them
  }
}
Overriding finalize() is 99% of the time a very bad idea!

Finalizer - Welcome to hell

  • more work for the GC
  • never know when it runs
  • ties up GC thread
  • may resurruct object

WeakReference / SoftReference

weak
https://www.flickr.com/photos/counteragent/2190576349
  • more work on GC
  • logic depends on GC
  • GC pressure

GC - Pressure

  • Objects are not for free
  • Datastructures in GC overhead
gc pressure
https://www.flickr.com/photos/tracheotomy_bob/5430166369

Datastructures and GC effect

  • array ⇒ GC friendly
  • linked-list ⇒ GC unfriendly
Also pre-fetch works out quite well for array based datastructures, also cache lines etc.

linked-list made GC-friendly

Roll your own linked-list datastructure
public class Entry {
  Entry next; (1)
  Entry prev;
  public void methodA() {}
}

public class EntryList {
  Entry head; (2)
  Entry tail;
  public void add(Entry entry) {}
}
1Entry itself can be used as node
2head and tail of the list
Each Entry can hold be added to one EntryList.

False-sharing

  • contention on same cache-line
  • cache line ~ 64 bytes
sharing
https://www.flickr.com/photos/ryanr/142455033/

False Sharing - Padding to the rescue

  • pad object = own cache-line
  • best done via adding longs
There is @Contended in Java8 but needs to be enabled via switch and not allow to pack multiple objects in one cache-line. This is part of JEP-142

False Sharing - A case-study

The following code snippets and diagram is based on Martin Thompsons blog posts False Sharing and False Sharing && Java 7.
false sharing
https://twitter.com/normanmaurer/status/465821479012401152
Credit where credit is due.

False Sharing - Padding to the rescue

False-sharing Test
public final class FalseSharing implements Runnable {
  private static AtomicLong[] longs = new AtomicLong[NUM_THREADS];
  ....
  public static void runTest() throws InterruptedException {
    Thread[] threads = new Thread[NUM_THREADS];
    for (int i = 0; i < threads.length; i++)
      threads[i] = new Thread(new FalseSharing(i));
    for (Thread t : threads) { t.start(); }
    for (Thread t : threads) { t.join(); }
  }

  public void run() {
    long i = ITERATIONS + 1;
    while (0 != --i) { longs[arrayIndex].set(i); }
  }
}

False Sharing - No padding

AtomicLong without padding
public final class FalseSharing implements Runnable {
  private static AtomicLong[] longs = new AtomicLong[NUM_THREADS];
  static { 
    for (int i = 0; i < longs.length; i++) {
      longs[i] = new AtomicLong();
    }
  }
  ...
}

False Sharing - Padding for AtomicLong

AtomicLong with padding
public final class FalseSharing implements Runnable {
  private static PaddedAtomicLong[] longs = new PaddedAtomicLong[NUM_THREADS];
  static { 
    for (int i = 0; i < longs.length; i++) {
      longs[i] = new PaddedAtomicLong();
    }
  }
  ...
  public static long sumPaddingToPreventOptimisation(final int index) {
    PaddedAtomicLong v = longs[index];
    return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
  }

  public static class PaddedAtomicLong extends AtomicLong {
    public volatile long p1, p2, p3, p4, p5, p6 = 7L;
  }
}

False Sharing - results

duration
http://mechanical-sympathy.blogspot.de/2011/07/false-sharing.html

Unsafe - Welcome to the dark side

public static Unsafe getUnsafe() {
  Class cc = sun.reflect.Reflection.getCallerClass(2);
  if (cc.getClassLoader() != null)
    throw new SecurityException("Unsafe");
  return theUnsafe;
}
unsafe private
https://www.flickr.com/photos/oldpatterns/6100020538

Unsafe - Swiss-army-knife of Java

  • Allows to act on memory address
  • Array operations
  • volatile / atomic operations
Can do a lot of things faster then abstractions because of eliminate range checks and allow to access stuff directly via memory addresses etc.

Unsafe == unsafe

  • Easy to segfault JVM
  • unofficial API
unsafe
Caution: @lagergren may hunt you down ;)

Unsafe

unsafe danger
http://zeroturnaround.com/rebellabs/dangerous-code-how-to-be-unsafe-with-java-classes-objects-in-memory/

Queues - Use the right for the right job

SPSC ⇒ Single Producer Single Consumer
SPMC ⇒ Single Producer Multi Consumer
MPSC ⇒ Multi Producer Single Consumer
MPMC ⇒ Multi Producer Multi Consumer
Usage pattern is imprtant!

Queues

All concurrent queues in the JDK are MPMC
sad
https://www.flickr.com/photos/kirisryche/4350193154
JAQ-InABox provides some nice alternatives. Say thanks to @nitsanw.

Atomic*.lazySet(…) for the win

Cheap volatile writes, we all love cheap when it comes to performance!
lazy
https://www.flickr.com/photos/vmabney/9431857312
Only works for Single Writers a.k.a Single Producers. More details on Atomic*.lazySet is a performance win for single writers post by @nitsanw.

Bitwise operations are fast

Slow nextIdx(…) method based on modulo operation
private final Object[] array = new Object[128];

private int nextIdx(int index) {
  return (index + 1) % array.length;
}
Fast nextIdx(…) method based on bitwise operation
private final Object[] array = new Object[128];
private final int mask = array.length - 1;

private int nextIdx(int index) {
  return (index + 1) & mask;
}
Bitwise operations are most of the times faster then Arithmetic.
Only works for Single Writers a.k.a Single Producers. More details on Atomic*.lazySet is a performance win for single writers post by @nitsanw.

More informations on these topics

Mechanical Sympathy blog and Mechanical Sympathy Mailinglist
Concurrency Interest Mailinglist
Nitsan Wakarts Blog
Aleksey Shipilevs Blog

So squeeze of the last bit of performance…

squeeze
https://www.flickr.com/photos/dolmansaxlil/5347064183

References

Slides generated with Asciidoctor and DZSlides backend
Original slide template - Dan Allen & Sarah White
All pictures licensed with Creative Commons Attribution or
Creative Commons Attribution-Share Alike

Norman Maurer