The csv parser that I wrote in SimpleFlatMapper uses a simple loop + state to parse a csv.
The code is a lot simpler than other implementation I’ve seen and at the time according to
my test was faster. I was testing on java 7, 8 still being very new.
When I ran it on java8 the results were quite confusing. Sometimes fast, sometimes slow.
It looked like deactivating TieredCompilation removed the regression
I played around and after enough shuffling the problem disappeared.
I did not know why and did not have time/skills to really dig into it, it was good enough at the time. Problem solved.
Last week I started to address some code duplication in the parser. And faced the same problem again.
Eventually reduced the issue to a few line changes. With a reliable way to reproduce the different scenario
it became possible to investigate.
The puzzle
I created a GitHub repo to isolate the behaviour and gather some data.
It uses a slightly simplified version of sfm-csv
and the benchmarking code mapping-benchmark.
The benchmark will parse a csv file,
stores the cell in a String[] and pass that to the blackhole.
There are two version of the parser
orig consistently fast.
that gives a consistantly fast result
The difference between slow an fast/orig being slightly more than 1.55x - as in fast time * 1.55 = slow time -.
That is quite an impact. Specially when you consider the difference between alt and orig.
We moved the mark field from the CharBuffer to the CharConsumer. That should have no impact, especially not degrading performance.
The data
FlighRecorder sampling
It indicates that a slow run spend a lot more time in pushCell/newCell/String init/copy of range code.
perfasm
That is confirmed by the perfasm run which identify
the area that does some arraycopy code as being that Hottest region for a slow run
vs for a fast run
jitwatch
Looking at the Chain graph in jitwatch we can seem some difference in inlining between
slow and fast.
the endOfRow method called in consumeAllBuffer is not inline in the fast run. But
in the orig ref run that method is inlined. so also it does make perf better for the alt version it does not cause an issue on the orig version.
That is quite a confusing picture there, not enough to understand what’s up.
The only thing left is to look at the generated asm. I printed two times 22 pages of asm in font size 6 and armed
with an highlighter and a pen eventually figured out why alt slow is slow.
The expected asm
If we look at the asm generated by the orig code we can see the code that checks if the char is a ‘,’, ‘\r’ or ‘\n’.
that means the asm between line 90 and 168 is executed for every character instead of only
when the char match a ‘,’ or ‘\n’. This is the hot region seen in perfasm that takes more than 50% of the cpu cycles.
Why?
why the orig fast case ?
I think that because mark is in an another object.
If you look at the compilation log and compare them to the alt slow case you can see
the only difference is the bc/dependency declaration not present in the slow run.
why the alt fast case?
endOfRow is not inline for the alt fast case. so the newCell is called only once in the isSeparator code path removing the
“need” for placing some of it before the test.
why the alt slow case?
That is the big question. And it probably would need more work here to figure it out but here are a few points
it happens only on TieredCompilation for that code
but a more simplified version I’ve quickly played with always run in slow mode even on C2 only.
it does not happen on java9 build 129 that I tested with. So it’s either a known issue that is being fixed or that has been fixed by a side effect.
some of the heuristics might wrongly think there is a benefit to do that.
So what?
It’s hard to extract something you can learn from. The hack to force the fast past is not much of a problem here but it’s
weird to have to do that. If you have unstable performance you might have that kind of issue and then spent some time investigating
the asm. But if you don’t that does not mean the problem is not present…
What pushed me to fix it was the conviction that the code should be faster that it was - comparing to jackson csv or univocity.
Having a good benchmark line to measure against or having decent expectation can help identifying those issues.