Why reverse loops are not faster
TL;DR
Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!
Intro
On the 13th of June @TravCav published on medium an article arguing that reverse loops are faster than a regular loop.
for (int i = 0; i < 10; i++) {
System.out.println(i);
}
was slower than
for (int i = 10; i >= 0; i--) {
System.out.println(i);
}
the code is from @TravCav post, the loop should start at 9 not 10 as pointed on reddit
His analysis is based on looking at the byte generated, but unfortunately, it does not provide any benchmark to confirm his hypothesis.
I don’t think it is the first time I heard about that idea, it generally comes from the belief that a comparison to 0 is faster. That might be true or might have been true who knows.
Is it true now? is it even relevant?
Most of the CPUs can execute multiple instructions in parallel, the main cost of a jump is when the branch prediction is wrong. Even if a cmp with 0 is faster, the author assumes that the bytecode is relevant to what the CPU will execute. Big news it is not, the JIT does a lot of transformation and making an assumption on what gets executed from the bytecode is ludicrous. For example:
After looping back to line 17, the array object has to get loaded back onto the stack at line 18, and the length has to be retrieved again on line 19. And it does this for every loop. Over and over and over…
No it does not, the JIT is smart enough to figure out that the length does not vary and it will load it only once. It will also remove the boundary check for that matter.
The jmh benchmark
Anyway, let’s get some benchmarks in there. The jmh code is as follows:
@State(Scope.Benchmark)
public class LoopBenchmark {
private int[] numbers;
@Param(value = {"1", "10", "1000", "1000000"})
private int size;
@Setup
public void setUp() {
numbers = new int[size];
Random r = new Random();
for(int i = 0; i < size; i++) {
numbers[i] = r.nextInt();
}
}
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long testForwardLoop() {
long l = 0;
for(int i = 0; i < numbers.length; i++) {
l += numbers[i];
}
return l;
}
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long testBackwardLoop() {
long l = 0;
for(int i = numbers.length - 1; i >=0; i--) {
l += numbers[i];
}
return l;
}
}
We initialise an array of int with a random number. Each benchmark will sum the array, one with a regular - forward - loop, the other with a reverse - backward - loop.
Compile, run
mvn clean install
java -jar target/benchmarks.jar
The results
and here are the results - the higher the better -:
Size | 1 | 10 | 1000 | 1000000 |
---|---|---|---|---|
Backward | 240665052 | 146681879 | 3686316 | 3761 |
Forward | 249774234 | 154017527 | 3722753 | 3763 |
(F - B/F) | 3.65% | 4.76% | 0.98% | 0.06% |
The reverse loop is 3 to 4% slower on small array the difference reduces when the size of the array increases.
Contrary to the original post, regular loop is faster.
the asm
Lets have a deeper look at what happens with an array of size 10 by looking at asm generated by the jit for forward
java -jar target/benchmarks.jar -f 1 -psize=10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testForwardLoop
You will need the hsdis library. Load the log through JITWatch.
forward asm
We can see the core loop from line 63 to line 69
L0002: movslq 0x10(%r11,%r8,4),%r10
0x00007f7cad1df4d9: add %r10,%rax ;*ladd
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
0x00007f7cad1df4dc: inc %r8d ;*iinc
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
0x00007f7cad1df4df: cmp %ebx,%r8d
0x00007f7cad1df4e2: jl L0002 ;*if_icmpge
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)
the exit of the method line 71 to line 75
L0003: add $0x10,%rsp
0x00007f7cad1df4e8: pop %rbp
0x00007f7cad1df4e9: test %eax,0x17b81b11(%rip) # 0x00007f7cc4d61000
; {poll_return} *** SAFEPOINT POLL ***
0x00007f7cad1df4ef: retq
but what is happening from line 44 to line 58 ?
L0000: movslq 0x10(%r11,%r8,4),%r9
0x00007f7cad1df4a5: add %rax,%r9
0x00007f7cad1df4a8: movslq %r8d,%rcx
0x00007f7cad1df4ab: movslq 0x1c(%r11,%rcx,4),%rax
0x00007f7cad1df4b0: movslq 0x14(%r11,%rcx,4),%rdi
0x00007f7cad1df4b5: movslq 0x18(%r11,%rcx,4),%rcx
0x00007f7cad1df4ba: add %r9,%rdi
0x00007f7cad1df4bd: add %rdi,%rcx
0x00007f7cad1df4c0: add %rcx,%rax ;*ladd
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
0x00007f7cad1df4c3: add $0x4,%r8d ;*iinc
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
0x00007f7cad1df4c7: cmp %r10d,%r8d
0x00007f7cad1df4ca: jl L0000 ;*if_icmpge
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)
it’s looping over 4 elements at a time and adding them together in one pass. The loop has been unrolled until there are less that 4 elements available falling back on the one by one loop on line 63.
backward asm
java -jar target/benchmarks.jar -f 1 -psize=10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testBackwardLoop
and backward.
We can see the core loop from line 68 to line 74
L0004: movslq 0x10(%r10,%rcx,4),%r11
0x00007fa34c559b15: add %r11,%rax ;*ladd
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559b18: dec %ecx ;*iinc
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559b1a: cmp $0xffffffff,%ecx
0x00007fa34c559b1d: jg L0004 ;*iflt
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
very similar the forward version except for the cmp and dec instead of inc.
the exit of the method line 66 to line 80
L0005: add $0x20,%rsp
0x00007fa34c559b23: pop %rbp
0x00007fa34c559b24: test %eax,0x15e7a4d6(%rip) # 0x00007fa3623d4000
; {poll_return} *** SAFEPOINT POLL ***
0x00007fa34c559b2a: retq
L0001: mov %rax,%r11 ;*lload_1
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
L0002: movslq 0x10(%r10,%rcx,4),%r8
0x00007fa34c559ae8: movslq 0xc(%r10,%rcx,4),%r9
0x00007fa34c559aed: movslq 0x8(%r10,%rcx,4),%rbx
0x00007fa34c559af2: movslq 0x4(%r10,%rcx,4),%rax
0x00007fa34c559af7: add %r11,%r8
0x00007fa34c559afa: add %r8,%r9
0x00007fa34c559afd: add %r9,%rbx
0x00007fa34c559b00: add %rbx,%rax ;*ladd
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559b03: add $0xfffffffc,%ecx ;*iinc
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559b06: cmp $0x2,%ecx
0x00007fa34c559b09: jg L0001 ;*iflt
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
same as for forward adding 4 elements at a time.
but the logic to get there is slighlty more complicated in the backward loop
for forward line 22 to line 43 after getting the array length
0x00007f7cad1df458: test %ebx,%ebx
0x00007f7cad1df45a: jle L0004 ;*if_icmpge
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)
0x00007f7cad1df460: test %ebx,%ebx
0x00007f7cad1df462: jbe L0005
0x00007f7cad1df468: mov %ebx,%r9d
0x00007f7cad1df46b: dec %r9d
0x00007f7cad1df46e: cmp %ebx,%r9d
0x00007f7cad1df471: jae L0005 ;*lload_1
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64)
0x00007f7cad1df477: movslq 0x10(%r12,%r11,8),%rax ;*i2l
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@20 (line 64)
0x00007f7cad1df47c: mov %ebx,%r10d
0x00007f7cad1df47f: add $0xfffffffd,%r10d
0x00007f7cad1df483: shl $0x3,%r11
0x00007f7cad1df487: mov $0x80000000,%r8d
0x00007f7cad1df48d: cmp %r10d,%r9d
0x00007f7cad1df490: cmovl %r8d,%r10d
0x00007f7cad1df494: mov $0x1,%r8d
0x00007f7cad1df49a: cmp $0x1,%r10d
0x00007f7cad1df49e: jle L0001 ;*lload_1
; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64)
for backward line 22 to line 49
0x00007fa34c559a98: mov %r8d,%ecx
0x00007fa34c559a9b: dec %ecx ;*isub
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@8 (line 73)
0x00007fa34c559a9d: xor %r11d,%r11d
0x00007fa34c559aa0: test %ecx,%ecx
0x00007fa34c559aa2: jl L0006 ;*iflt
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
0x00007fa34c559aa8: test %r8d,%r8d
0x00007fa34c559aab: jbe L0007
0x00007fa34c559ab1: cmp %r8d,%ecx
0x00007fa34c559ab4: jae L0007
0x00007fa34c559ab6: add $0xfffffffe,%r8d
0x00007fa34c559aba: shl $0x3,%r10 ;*lload_1
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
L0000: movslq 0x10(%r10,%rcx,4),%r9
0x00007fa34c559ac3: add %r9,%r11 ;*ladd
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559ac6: dec %ecx ;*iinc
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559ac8: cmp %r8d,%ecx
0x00007fa34c559acb: jg L0000 ;*iflt
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
0x00007fa34c559acd: cmp $0x2,%ecx
0x00007fa34c559ad0: jle L0008
0x00007fa34c559ad2: jmp L0002
0x00007fa34c559ad4: nopl 0x0(%rax,%rax,1)
0x00007fa34c559adc: data16 data16 xchg %ax,%ax
L0001: mov %rax,%r11 ;*lload_1
; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
perf analysis
could that be the source of the performance difference?
java -jar target/benchmarks.jar -f 1 -psize=10 -prof=perf testForwardLoop
java -jar target/benchmarks.jar -f 1 -psize=10 -prof=perf testBackwardLoop
looking at the result of perfasm for forward and backward it does not look like it.
there is though some % difference for the unrolled part of the loop.
3.71% 3.16% ││││↗ 0x00007fa8d0bcd920: movslq 0x10(%r11,%r8,4),%r9
1.68% 1.46% │││││ 0x00007fa8d0bcd925: add %rax,%r9
0.43% 0.38% │││││ 0x00007fa8d0bcd928: movslq %r8d,%rcx
2.12% 2.11% │││││ 0x00007fa8d0bcd92b: movslq 0x1c(%r11,%rcx,4),%rax
3.52% 3.02% │││││ 0x00007fa8d0bcd930: movslq 0x14(%r11,%rcx,4),%rdi
1.74% 1.14% │││││ 0x00007fa8d0bcd935: movslq 0x18(%r11,%rcx,4),%rcx
0.34% 0.52% │││││ 0x00007fa8d0bcd93a: add %r9,%rdi
2.26% 1.86% │││││ 0x00007fa8d0bcd93d: add %rdi,%rcx
3.85% 3.53% │││││ 0x00007fa8d0bcd940: add %rcx,%rax ;*ladd
│││││ ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
1.99% 1.70% │││││ 0x00007fa8d0bcd943: add $0x4,%r8d ;*iinc
│││││ ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
0.21% 0.32% │││││ 0x00007fa8d0bcd947: cmp %r10d,%r8d
││││╰ 0x00007fa8d0bcd94a: jl 0x00007fa8d0bcd920 ;*if_icmpge
for backward
0.11% │││ │↗ 0x00007fbab51f4be0: mov %rax,%r11 ;*lload_1
│││ ││ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
3.41% 3.05% │││ ↘│ 0x00007fbab51f4be3: movslq 0x10(%r10,%rcx,4),%r8
4.46% 4.11% │││ │ 0x00007fbab51f4be8: movslq 0xc(%r10,%rcx,4),%r9
0.38% 0.43% │││ │ 0x00007fbab51f4bed: movslq 0x8(%r10,%rcx,4),%rbx
0.90% 0.96% │││ │ 0x00007fbab51f4bf2: movslq 0x4(%r10,%rcx,4),%rax
2.13% 2.54% │││ │ 0x00007fbab51f4bf7: add %r11,%r8
5.53% 5.61% │││ │ 0x00007fbab51f4bfa: add %r8,%r9
3.49% 4.52% │││ │ 0x00007fbab51f4bfd: add %r9,%rbx
7.58% 7.83% │││ │ 0x00007fbab51f4c00: add %rbx,%rax ;*ladd
│││ │ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
7.76% 8.00% │││ │ 0x00007fbab51f4c03: add $0xfffffffc,%ecx ;*iinc
│││ │ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0.03% 0.01% │││ │ 0x00007fbab51f4c06: cmp $0x2,%ecx
look at those 7% in the backward results.
why?
It’s unlikely that the adds cost that much, they either have been stalled or invalidated following a branch misprediction.
Hard to guess there looking a the perf number forward, backward forward has slightly less branch prediction misses. Is it enough to explain the difference, to be honest, I do not really know.
Time to give up on that one, is the branching more complicated? is the going in reverse not playing nice with the cpu? something I would have missed?
Conclusion
- Never believe a performance assertion without a good rationale
- Never believe a rational without reproducible experiment that supports it
- Never believe a benchmark without looking in depth at what is actually benchmarked
- Do not deduce performance implication from looking a the bytecode
And no, reverse loops are not faster than regular loops - caveats the result is only applicable to a loop that adds numbers from an array, any generalisation of the results is at your own risk :) -.
If you have an idea as to what explain the difference please leave a comment, Thanks.