Skip to content

Add chunked reader fast path#177

Merged
jsoizo merged 3 commits into
version_2_0_0from
perf-reader-alloc
May 22, 2026
Merged

Add chunked reader fast path#177
jsoizo merged 3 commits into
version_2_0_0from
perf-reader-alloc

Conversation

@jsoizo
Copy link
Copy Markdown
Owner

@jsoizo jsoizo commented May 22, 2026

Summary

  • keep the public lazy Sequence<Char> reader API intact
  • add a chunked reader fast path that walks an 8 KB CharArray buffer directly with double buffering for next-char lookahead
  • route JVM read(InputStream) / readFromFile(File) and kotlinx-io read(Source) through the chunked path
  • add ParseStateMachine.reset() so the state machine, StringBuilder, and fields ArrayList are reused across rows
  • extract CsvReader.applyPipeline so I/O wrappers can attach skipEmptyLine + field-count policy without rerouting chars through a Sequence<Char>

Verification

  • ./gradlew check
  • benchmark/parity row-by-row HARD equivalence test passes

Benchmark results

Measured on the same machine as issue #172 (Apple M4 / 32 GB / JDK 21.0.6 / JMH 1.36, seed=42), primary profile (warmup=5 / iter=5 / fork=2 / 10s).

  • Before = v2.0.0 primary numbers from issue #172 primary comment (commit 7fe79b4). Reader sources between 7fe79b4 and this branch are touched only by this PR (ParseStateMachine.kt, SequenceParser.kt, CsvReader.kt, ReaderIo.kt, ReaderIoJvm.kt), so the issue numbers are a valid Before for the reader workloads.
  • After = this PR (cdf3284), measured 2026-05-22.

Throughput (ops/ms, higher is better)

I/O paths (the workloads that drove the issue's long-tail divergence):

Workload Dataset Before After Delta vs v1.10.0
readAll(InputStream) SMALL 0.7669 1.5104 ± 0.0121 +97% (1.97×) 1.82× faster
readAll(InputStream) MEDIUM 0.00407 0.00772 ± 0.00005 +90% (1.90×) 1.82× faster
readAll(InputStream) HARD 0.0538 0.0979 ± 0.0016 +82% (1.82×) 1.97× faster
readAll(File) SMALL 0.7372 1.3543 ± 0.0280 +84% (1.84×) 1.68× faster
readAll(File) MEDIUM 0.00377 0.00740 ± 0.00010 +96% (1.96×) 1.85× faster
readAll(File) HARD 0.0554 0.0922 ± 0.0009 +66% (1.66×) 1.91× faster
open(file){ asSequence().count() } SMALL 0.7427 1.4280 ± 0.0134 +92% (1.92×) 1.76× faster
open(file){ asSequence().count() } MEDIUM 0.00375 0.00756 ± 0.00011 +102% (2.02×) 1.84× faster
open(file){ asSequence().count() } HARD 0.0544 0.0872 ± 0.0058 +60% (1.60×) 1.78× faster

String paths (already v2-favoured; included for completeness):

Workload Dataset Before After Delta vs v1.10.0
readAll(String) SMALL 1.6116 1.7019 ± 0.0125 +6% 2.04× faster
readAll(String) MEDIUM 0.00819 0.00855 ± 0.00021 +4% 2.10× faster
readAll(String) HARD 0.1115 0.1154 ± 0.0011 +3% 2.45× faster
readAllWithHeader(String) SMALL 1.4490 1.4944 ± 0.0665 +3% 1.94× faster
readAllWithHeader(String) MEDIUM 0.00654 0.00708 ± 0.00012 +8% 1.85× faster
readAllWithHeader(String) HARD 0.0969 0.1019 ± 0.0054 +5% 2.28× faster

Average time (ms/op, lower is better)

I/O paths (Before suffered the long-tail outliers — thrpt × avgt ≈ 3):

Workload Dataset Before After Delta
readAll(InputStream) SMALL 5.12 0.676 ± 0.017 7.6× faster
readAll(InputStream) MEDIUM 947.30 129.59 ± 2.69 7.3× faster
readAll(InputStream) HARD 57.56 10.23 ± 0.25 5.6× faster
readAll(File) SMALL 4.53 0.718 ± 0.020 6.3× faster
readAll(File) MEDIUM 876.29 134.31 ± 2.64 6.5× faster
readAll(File) HARD 71.47 11.09 ± 0.30 6.4× faster
open(file){ asSequence().count() } SMALL 3.56 0.736 ± 0.048 4.8× faster
open(file){ asSequence().count() } MEDIUM 557.28 143.61 ± 8.90 3.9× faster
open(file){ asSequence().count() } HARD 27.93 11.25 ± 0.36 2.5× faster

Notes

  • The avgt × thrpt divergence flagged in the issue's primary comment for reader I/O workloads (thrpt × avgt ≈ 3 on Before) is resolved on After. Examples: readAll(InputStream) MEDIUM 0.00772 × 129.59 ≈ 1.00, readAll(File) MEDIUM 0.00740 × 134.31 ≈ 0.99. Long-tail Continuation allocations from the per-char sequence { yield(c) } are gone.
  • Score errors shrank substantially (e.g. readAll(File) MEDIUM: Before 876.29 ± 158.70 ms/op ≈ 18% noise → After 134.31 ± 2.64 ≈ 2%), consistent with the parser no longer producing GC-driven outliers.
  • String paths were already v2-favoured (no per-char yield — JDK's String.asSequence() iterator does not allocate Continuations). The chunked path is a small additional win (+3–8%) but there was no long-tail there to begin with.
  • gc.alloc.rate.norm (B/op) still sits at v1×1.30–1.36 on I/O paths after this PR — this is the structural per-row allocation (e.g. ArrayList.toList() copies, InputStreamReader internal char buffer) that does not produce long-tails. Tracked as a follow-up for v2.0.x; not a blocker for the long-tail goal of this PR.
  • kotlinx-io vs java.io reader paths sit at ~7–13% CPU overhead with alloc/op parity after this PR. Tracked as a follow-up; not addressed here.

Reproduction

```bash
./gradlew :benchmark:v2:jmh -Pbench.profile=primary -Pjmh.include='.ReadBenchmarksV2.'
```

jsoizo added 2 commits May 22, 2026 08:00
Apply jmh.warmupIterations/iterations/fork/timeOnIteration/warmup property
overrides after the bench.profile when block so short-form CLI flags can
override profile defaults during ad-hoc gcprof/stackprof runs.
The v2 reader I/O paths routed every char through a coroutine
sequence builder (`BufferedReader.toCharSequence` and
`Source.toCharSequence`), which allocated a Continuation per character
and produced the avgt × thrpt ≈ 3 long-tail divergence flagged in
issue #172's primary profile.

Add an eager chunked parser that fills a CharArray buffer and walks it
directly, using double buffering to carry the next-char lookahead
across chunk boundaries. The public lazy `Sequence<List<String>>` API
is unchanged; only the I/O wrappers and an internal pipeline helper
are rewired.

- `ParseStateMachine.reset()` lets `SequenceParser` reuse one machine
  instance across rows (no per-row alloc of machine / StringBuilder /
  fields ArrayList).
- `parseRowsFromChunks((CharArray) -> Int, dialect, stripBom)` is the
  new internal entry point; per-char `sequence { yield(c) }` is gone.
- `CsvReader.applyPipeline` exposes the skipEmptyLine + field-count
  policy stages for I/O wrappers to drive directly.
- JVM `read(InputStream)` / `readFromFile(File)` wrap
  `BufferedReader.read(CharArray)`; kotlinx-io `read(Source)` writes
  decoded code points straight into the chunk buffer.
@jsoizo jsoizo mentioned this pull request May 22, 2026
6 tasks
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 98.59155% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 94.87%. Comparing base (3decc39) to head (40d41d6).
⚠️ Report is 1 commits behind head on version_2_0_0.

Files with missing lines Patch % Lines
...jsoizo/kotlincsv/reader/internal/SequenceParser.kt 98.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           version_2_0_0     #177      +/-   ##
=================================================
+ Coverage          93.92%   94.87%   +0.94%     
=================================================
  Files                 22       22              
  Lines                461      507      +46     
  Branches             107      116       +9     
=================================================
+ Hits                 433      481      +48     
+ Misses                16       13       -3     
- Partials              12       13       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add coverage for the new chunked reader fast path so the PR diff is no
longer below the codecov threshold:

- SequenceParserTest: drive `parseRowsFromChunks` directly with a
  custom `(CharArray) -> Int` source, including small-buffer chunk
  boundary swaps, CR/LF on a chunk boundary, the `require(bufferSize
  >= 2)` guard, BOM strip on/off (default), and an unterminated quote
  whose tail-flush takes the null-result branch.
- CsvReaderJvmIoTest: exercise the I/O pipeline with skipEmptyLine
  and with an input larger than the default 8 KB chunk so the double
  buffer swap runs end-to-end.
- CsvReaderPathSmokeTest: call the kotlinx-io Path overloads of
  readFromFile/readAllFromFile with default options, and read a
  multi-chunk file so the kotlinx-io chunk reader exits via its
  `index >= limit` branch.
@jsoizo jsoizo merged commit 3a5dc3e into version_2_0_0 May 22, 2026
5 checks passed
@jsoizo jsoizo deleted the perf-reader-alloc branch May 22, 2026 13:29
jsoizo added a commit that referenced this pull request May 22, 2026
Add follow-up coverage for the chunked reader fast path introduced in
PR #177:

- doubled quote `""` straddling a chunk boundary, so the cross-buffer
  next-char lookahead has to find the second `"` at nextBuffer[0] for
  skipCount=1 to do the right thing
- explicit-escape `\\<target>` straddling a chunk boundary, same
  cross-buffer next-char path with a non-self escape char
- lone CR at a chunk end followed by a non-LF char in the next chunk,
  so the CR terminator must not consume the next field char as part
  of CRLF
- supplementary code point (U+1F600) at the parseRowsFromChunks layer
  where it just passes through as ordinary chars, and at the
  kotlinx-io Source layer with the 😀 high surrogate at index
  `buffer.size - 2` so the low surrogate must land on the reserved
  last slot — a regression in `limit = buffer.size - 1` would overflow

Also rewrite the existing chunked-path test comments to lead with
why-the-test-exists instead of restating the parser branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant