TL;DR

Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!

Intro

On the 13th of June @TravCav published on medium an article arguing that reverse loops are faster than a regular loop.

for (int i = 0; i < 10; i++) {
    System.out.println(i);
}

was slower than

for (int i = 10; i >= 0; i--) {
    System.out.println(i);
}

the code is from @TravCav post, the loop should start at 9 not 10 as pointed on reddit

Face palm

His analysis is based on looking at the byte generated, but unfortunately, it does not provide any benchmark to confirm his hypothesis.

I don’t think it is the first time I heard about that idea, it generally comes from the belief that a comparison to 0 is faster. That might be true or might have been true who knows.

Is it true now? is it even relevant?

Most of the CPUs can execute multiple instructions in parallel, the main cost of a jump is when the branch prediction is wrong. Even if a cmp with 0 is faster, the author assumes that the bytecode is relevant to what the CPU will execute. Big news it is not, the JIT does a lot of transformation and making an assumption on what gets executed from the bytecode is ludicrous. For example:

After looping back to line 17, the array object has to get loaded back onto the stack at line 18, and the length has to be retrieved again on line 19. And it does this for every loop. Over and over and over…

No it does not, the JIT is smart enough to figure out that the length does not vary and it will load it only once. It will also remove the boundary check for that matter.

Jmh

The jmh benchmark

Anyway, let’s get some benchmarks in there. The jmh code is as follows:

@State(Scope.Benchmark)
public class LoopBenchmark {

    private int[] numbers;
 
    @Param(value = {"1", "10", "1000", "1000000"})
    private int size;
    @Setup
    public void setUp() {
        numbers = new int[size];
        Random r = new Random();
        for(int i = 0; i < size; i++) {
            numbers[i] = r.nextInt();
        }
    }
    
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long testForwardLoop() {
        long l = 0;
        for(int i = 0; i < numbers.length; i++) {
            l += numbers[i];
        }
        return l;
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long testBackwardLoop() {
        long l = 0;
        for(int i = numbers.length - 1; i >=0; i--) {
            l += numbers[i];
        }
        return l;
    }
}

We initialise an array of int with a random number. Each benchmark will sum the array, one with a regular - forward - loop, the other with a reverse - backward - loop.

Compile, run

mvn clean install
java -jar target/benchmarks.jar

The results

and here are the results - the higher the better -:

Size	1	10	1000	1000000
Backward	240665052	146681879	3686316	3761
Forward	249774234	154017527	3722753	3763
(F - B/F)	3.65%	4.76%	0.98%	0.06%

The reverse loop is 3 to 4% slower on small array the difference reduces when the size of the array increases.

Contrary to the original post, regular loop is faster.

the asm

Lets have a deeper look at what happens with an array of size 10 by looking at asm generated by the jit for forward

java -jar target/benchmarks.jar -f 1 -psize=10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testForwardLoop

You will need the hsdis library. Load the log through JITWatch.

forward asm

We can see the core loop from line 63 to line 69

             L0002: movslq 0x10(%r11,%r8,4),%r10
0x00007f7cad1df4d9: add %r10,%rax  ;*ladd
                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
0x00007f7cad1df4dc: inc %r8d  ;*iinc
                              ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
0x00007f7cad1df4df: cmp %ebx,%r8d
0x00007f7cad1df4e2: jl L0002  ;*if_icmpge
                              ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)

the exit of the method line 71 to line 75

             L0003: add $0x10,%rsp
0x00007f7cad1df4e8: pop %rbp
0x00007f7cad1df4e9: test %eax,0x17b81b11(%rip)  # 0x00007f7cc4d61000
                                                ;   {poll_return} *** SAFEPOINT POLL ***
0x00007f7cad1df4ef: retq

but what is happening from line 44 to line 58 ?

             L0000: movslq 0x10(%r11,%r8,4),%r9
0x00007f7cad1df4a5: add %rax,%r9
0x00007f7cad1df4a8: movslq %r8d,%rcx
0x00007f7cad1df4ab: movslq 0x1c(%r11,%rcx,4),%rax
0x00007f7cad1df4b0: movslq 0x14(%r11,%rcx,4),%rdi
0x00007f7cad1df4b5: movslq 0x18(%r11,%rcx,4),%rcx
0x00007f7cad1df4ba: add %r9,%rdi
0x00007f7cad1df4bd: add %rdi,%rcx
0x00007f7cad1df4c0: add %rcx,%rax  ;*ladd
                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
0x00007f7cad1df4c3: add $0x4,%r8d  ;*iinc
                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
0x00007f7cad1df4c7: cmp %r10d,%r8d
0x00007f7cad1df4ca: jl L0000  ;*if_icmpge
                              ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)

it’s looping over 4 elements at a time and adding them together in one pass. The loop has been unrolled until there are less that 4 elements available falling back on the one by one loop on line 63.

backward asm

java -jar target/benchmarks.jar -f 1 -psize=10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testBackwardLoop

and backward.

We can see the core loop from line 68 to line 74

             L0004: movslq 0x10(%r10,%rcx,4),%r11
0x00007fa34c559b15: add %r11,%rax  ;*ladd
                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559b18: dec %ecx  ;*iinc
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559b1a: cmp $0xffffffff,%ecx
0x00007fa34c559b1d: jg L0004  ;*iflt
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)

very similar the forward version except for the cmp and dec instead of inc.

the exit of the method line 66 to line 80

             L0005: add $0x20,%rsp
0x00007fa34c559b23: pop %rbp
0x00007fa34c559b24: test %eax,0x15e7a4d6(%rip)  # 0x00007fa3623d4000
                                                ;   {poll_return} *** SAFEPOINT POLL ***
0x00007fa34c559b2a: retq

and from line 49 to line 64?

             L0001: mov %rax,%r11  ;*lload_1
                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
             L0002: movslq 0x10(%r10,%rcx,4),%r8
0x00007fa34c559ae8: movslq 0xc(%r10,%rcx,4),%r9
0x00007fa34c559aed: movslq 0x8(%r10,%rcx,4),%rbx
0x00007fa34c559af2: movslq 0x4(%r10,%rcx,4),%rax
0x00007fa34c559af7: add %r11,%r8
0x00007fa34c559afa: add %r8,%r9
0x00007fa34c559afd: add %r9,%rbx
0x00007fa34c559b00: add %rbx,%rax  ;*ladd
                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559b03: add $0xfffffffc,%ecx  ;*iinc
                                          ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559b06: cmp $0x2,%ecx
0x00007fa34c559b09: jg L0001  ;*iflt
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)

same as for forward adding 4 elements at a time.

but the logic to get there is slighlty more complicated in the backward loop

for forward line 22 to line 43 after getting the array length

0x00007f7cad1df458: test %ebx,%ebx
0x00007f7cad1df45a: jle L0004  ;*if_icmpge
                               ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)
0x00007f7cad1df460: test %ebx,%ebx
0x00007f7cad1df462: jbe L0005
0x00007f7cad1df468: mov %ebx,%r9d
0x00007f7cad1df46b: dec %r9d
0x00007f7cad1df46e: cmp %ebx,%r9d
0x00007f7cad1df471: jae L0005  ;*lload_1
                               ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64)
0x00007f7cad1df477: movslq 0x10(%r12,%r11,8),%rax  ;*i2l
                                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@20 (line 64)
0x00007f7cad1df47c: mov %ebx,%r10d
0x00007f7cad1df47f: add $0xfffffffd,%r10d
0x00007f7cad1df483: shl $0x3,%r11
0x00007f7cad1df487: mov $0x80000000,%r8d
0x00007f7cad1df48d: cmp %r10d,%r9d
0x00007f7cad1df490: cmovl %r8d,%r10d
0x00007f7cad1df494: mov $0x1,%r8d
0x00007f7cad1df49a: cmp $0x1,%r10d
0x00007f7cad1df49e: jle L0001  ;*lload_1
                               ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64)

for backward line 22 to line 49

0x00007fa34c559a98: mov %r8d,%ecx
0x00007fa34c559a9b: dec %ecx  ;*isub
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@8 (line 73)
0x00007fa34c559a9d: xor %r11d,%r11d
0x00007fa34c559aa0: test %ecx,%ecx
0x00007fa34c559aa2: jl L0006  ;*iflt
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
0x00007fa34c559aa8: test %r8d,%r8d
0x00007fa34c559aab: jbe L0007
0x00007fa34c559ab1: cmp %r8d,%ecx
0x00007fa34c559ab4: jae L0007
0x00007fa34c559ab6: add $0xfffffffe,%r8d
0x00007fa34c559aba: shl $0x3,%r10  ;*lload_1
                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
             L0000: movslq 0x10(%r10,%rcx,4),%r9
0x00007fa34c559ac3: add %r9,%r11  ;*ladd
                                  ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
0x00007fa34c559ac6: dec %ecx  ;*iinc
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
0x00007fa34c559ac8: cmp %r8d,%ecx
0x00007fa34c559acb: jg L0000  ;*iflt
                              ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)
0x00007fa34c559acd: cmp $0x2,%ecx
0x00007fa34c559ad0: jle L0008
0x00007fa34c559ad2: jmp L0002
0x00007fa34c559ad4: nopl 0x0(%rax,%rax,1)
0x00007fa34c559adc: data16 data16 xchg %ax,%ax
             L0001: mov %rax,%r11  ;*lload_1
                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)

perf analysis

could that be the source of the performance difference?

java -jar target/benchmarks.jar -f 1 -psize=10 -prof=perf testForwardLoop
java -jar target/benchmarks.jar -f 1 -psize=10 -prof=perf testBackwardLoop

looking at the result of perfasm for forward and backward it does not look like it.

there is though some % difference for the unrolled part of the loop.

frontward

  3.71%    3.16%  ││││↗     0x00007fa8d0bcd920: movslq 0x10(%r11,%r8,4),%r9
  1.68%    1.46%  │││││     0x00007fa8d0bcd925: add    %rax,%r9
  0.43%    0.38%  │││││     0x00007fa8d0bcd928: movslq %r8d,%rcx
  2.12%    2.11%  │││││     0x00007fa8d0bcd92b: movslq 0x1c(%r11,%rcx,4),%rax
  3.52%    3.02%  │││││     0x00007fa8d0bcd930: movslq 0x14(%r11,%rcx,4),%rdi
  1.74%    1.14%  │││││     0x00007fa8d0bcd935: movslq 0x18(%r11,%rcx,4),%rcx
  0.34%    0.52%  │││││     0x00007fa8d0bcd93a: add    %r9,%rdi
  2.26%    1.86%  │││││     0x00007fa8d0bcd93d: add    %rdi,%rcx
  3.85%    3.53%  │││││     0x00007fa8d0bcd940: add    %rcx,%rax          ;*ladd
                  │││││                                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64)
  1.99%    1.70%  │││││     0x00007fa8d0bcd943: add    $0x4,%r8d          ;*iinc
                  │││││                                                   ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63)
  0.21%    0.32%  │││││     0x00007fa8d0bcd947: cmp    %r10d,%r8d
                  ││││╰     0x00007fa8d0bcd94a: jl     0x00007fa8d0bcd920  ;*if_icmpge

for backward

  0.11%           │││ │↗     0x00007fbab51f4be0: mov    %rax,%r11          ;*lload_1
                  │││ ││                                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)
  3.41%    3.05%  │││ ↘│     0x00007fbab51f4be3: movslq 0x10(%r10,%rcx,4),%r8
  4.46%    4.11%  │││  │     0x00007fbab51f4be8: movslq 0xc(%r10,%rcx,4),%r9
  0.38%    0.43%  │││  │     0x00007fbab51f4bed: movslq 0x8(%r10,%rcx,4),%rbx
  0.90%    0.96%  │││  │     0x00007fbab51f4bf2: movslq 0x4(%r10,%rcx,4),%rax
  2.13%    2.54%  │││  │     0x00007fbab51f4bf7: add    %r11,%r8
  5.53%    5.61%  │││  │     0x00007fbab51f4bfa: add    %r8,%r9
  3.49%    4.52%  │││  │     0x00007fbab51f4bfd: add    %r9,%rbx
  7.58%    7.83%  │││  │     0x00007fbab51f4c00: add    %rbx,%rax          ;*ladd
                  │││  │                                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74)
  7.76%    8.00%  │││  │     0x00007fbab51f4c03: add    $0xfffffffc,%ecx   ;*iinc
                  │││  │                                                   ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73)
  0.03%    0.01%  │││  │     0x00007fbab51f4c06: cmp    $0x2,%ecx

look at those 7% in the backward results.

why?

It’s unlikely that the adds cost that much, they either have been stalled or invalidated following a branch misprediction.

Hard to guess there looking a the perf number forward, backward forward has slightly less branch prediction misses. Is it enough to explain the difference, to be honest, I do not really know.

Time to give up on that one, is the branching more complicated? is the going in reverse not playing nice with the cpu? something I would have missed?

Conclusion

Never believe a performance assertion without a good rationale
Never believe a rational without reproducible experiment that supports it
Never believe a benchmark without looking in depth at what is actually benchmarked
Do not deduce performance implication from looking a the bytecode

And no, reverse loops are not faster than regular loops - caveats the result is only applicable to a loop that adds numbers from an array, any generalisation of the results is at your own risk :) -.

If you have an idea as to what explain the difference please leave a comment, Thanks.