Skip to content

Version 2.0 🎉#180

Merged
jsoizo merged 120 commits into
mainfrom
version_2_0_0
May 23, 2026
Merged

Version 2.0 🎉#180
jsoizo merged 120 commits into
mainfrom
version_2_0_0

Conversation

@jsoizo
Copy link
Copy Markdown
Owner

@jsoizo jsoizo commented May 23, 2026

Closes #149

Summary

  • Introduces the 2.0 API under com.jsoizo.kotlincsv: CsvDialect, immutable reader/writer configs, sequence-first read/write, header helpers, and v2 exception types.
  • Adds kotlinx-io based common/JVM/JS/Native I/O, JS file I/O, Native targets, and wasmWasi main compilation support.
  • Removes the legacy 1.x API surface and refreshes README, the migration guide, CI, benchmarks, parity tests, and property tests for 2.0.0.

Verification

  • ./gradlew checkKotlinAbi jvmTest jsNodeTest macosArm64Test iosSimulatorArm64Test compileKotlinIosArm64 compileTestKotlinIosArm64 compileKotlinLinuxArm64 compileTestKotlinLinuxArm64 compileKotlinWasmWasi koverXmlReport

jsoizo added 30 commits August 12, 2024 22:39
Change namespace to com.jsoizo
Add com.jsoizo.kotlincsv.CsvDialect as a data class capturing the
four CSV format fields (delimiter / quoteChar / escapeChar /
lineTerminator) shared by reader and writer, plus RFC4180 and TSV
presets. Construction is validated via require(): delimiter must
differ from quoteChar and escapeChar, and lineTerminator must not
be empty. Existing reader / writer code paths are unchanged --
wiring CsvDialect into them is the next phase.
Add kotest-assertions-core (multiplatform) to commonTest dependencies
and rewrite CsvDialectTest using kotest assertion style (shouldThrow,
shouldNotBeNull, shouldContain). The kotest runner is intentionally
not added; tests still run via kotlin.test's @test annotation.
Align the existing exception tests with the new commonTest convention
(kotlin.test runner + kotest-assertions-core assertions). Behaviour
unchanged; only assertion call sites are migrated to shouldBe /
shouldContain / shouldBeInstanceOf / shouldNotBeNull.
Implements Phase 3 of the v2 migration: a Sequence<Char> -> Sequence<List<String>>
core reader living under com.jsoizo.kotlincsv.reader, with the immutable
CsvReaderConfig (data class) + CsvReaderConfigBuilder (DSL receiver) split,
top-level csvReader { } / csvReader(config) entry points, and the
Sequence<List<String>>.withHeader() extension that yields LinkedHashMap rows.

The legacy ParseStateMachine is reused unchanged except for an internal
isLineComplete() observer added so the new SequenceParser can detect row
boundaries while driving the state machine character-by-character. Legacy
util.CSVParseFormatException raised by ParseStateMachine is converted into
the new exceptions.CsvParseFormatException at the parser boundary.

The old client / dsl / util packages are left untouched and continue to
work alongside the new API; they will be removed in a later phase.
jsoizo added 27 commits May 13, 2026 15:16
- Declare wasmWasi(nodejs) in build.gradle.kts so commonMain (kotlinx-io
  based reader/writer) compiles for WASI runtimes.
- Disable compileTestKotlinWasmWasi / wasmWasiTest / wasmWasiNodeTest
  until Kotest publishes wasmWasi artifacts; commonTest cannot be
  compiled for this target otherwise.
- Add compileKotlinWasmWasi to the Linux CI job to guard against
  regressions.
…ialects

For dialects where `escapeChar != quoteChar` (e.g. `CsvDialect(escapeChar = '\\')`),
the reader previously treated the escape character literally in the START and
DELIMITER states, and accepted only `escapeChar` (not `quoteChar`) as the
escaped character in the FIELD state. This rejected CSV produced by other
libraries (Python `csv`, Apache Commons CSV, OpenCSV) that emit unquoted
escape sequences such as `a\\b` or `a\"b`.

Introduce a `handleUnquotedEscape` helper and route START / DELIMITER / FIELD
through it so all three states share the same `{escapeChar, quoteChar}`
acceptance set. When `escapeChar == quoteChar` (RFC 4180 default) the helper
degenerates to a single-element accept set, preserving the existing strict
behaviour (e.g. `a"b` under the default dialect still throws).

Closes #168
Add a property-based test that generates arbitrary unquoted-safe fields under
`CsvDialect(escapeChar = '\\')`, serialises them with the minimal Python-csv
style backslash escaping (`\` -> `\\`, `"` -> `\"`), and asserts the reader
round-trips them. 500 iterations, seed 0L, follows the project PBT conventions
(checkAll + runTest, generators reused from PbtArbs).
…cape

feat: parse escape sequences in unquoted fields (closes #168)
Scaffolds four JVM-only subprojects under benchmark/ to visualize the
performance characteristics of v1.10.0 (Maven artifact, com.github.doyaaaaaken.*)
against v2.0.0 (this branch, com.jsoizo.*) on identical workloads.

Subprojects:
- benchmark/shared: deterministic data generation (CsvDataGen / DatasetSpec /
  DataStats) and environment probe. Depends only on kotlin-stdlib so it stays
  out of every benchmark classpath as a library.
- benchmark/v1: JMH source set whose only kotlin-csv on classpath is
  com.jsoizo:kotlin-csv-jvm:1.10.0. Covers readAll(String/InputStream/File),
  iterative Sequence over File, readAllWithHeader, writeAll(OutputStream/File).
- benchmark/v2: JMH source set whose only kotlin-csv on classpath is the
  current project. Mirrors the v1 workloads on the v2 API and adds
  V2BackendBenchmarks comparing java.io vs kotlinx-io paths.
- benchmark/parity: JUnit subproject that intentionally puts both v1 and v2
  on the test classpath (FQCNs do not collide) and asserts row-by-row
  equality on the HARD dataset for readAll/readAllWithHeader/writeAll.

Classpath isolation is achieved by separating v1 and v2 into different
Gradle resolution scopes; this stops Gradle from collapsing kotlinx-coroutines
(and other transitive deps) to a single version across the two artifacts.
Resolved jmhRuntimeClasspath was verified to contain v1 only on the v1 side
and the v2 project only on the v2 side.

JMH defaults: warmup=5, iterations=5, fork=2, modes=throughput+avgt,
jvmArgs=[-Xms2g,-Xmx2g], JDK 21 toolchain. -Pbench.profile=large|gcprof|
stackprof overrides the defaults for the long-running LARGE dataset and
profiler runs. -Pjmh.include / -Pjmh.warmupIterations / -Pjmh.iterations /
-Pjmh.fork allow per-invocation overrides for smoke runs.
Sets warmup=3, iter=3, fork=1, time=5s and restricts dataset @Param to
SMALL and HARD via JMH '-p dataset' equivalent. Intended for the first
issue #172 comment so readers see numbers before the full primary run
finishes.
The MapProperty<String, ListProperty<String>> setter does not accept a
plain List<String>. Wrap the value in objects.listProperty(...).set(...).
Restricts dataset @Param to SMALL/MEDIUM/HARD; the LARGE dataset is
covered by the separate 'large' profile per the methodology in #172.
Apply jmh.warmupIterations/iterations/fork/timeOnIteration/warmup property
overrides after the bench.profile when block so short-form CLI flags can
override profile defaults during ad-hoc gcprof/stackprof runs.
The v2 reader I/O paths routed every char through a coroutine
sequence builder (`BufferedReader.toCharSequence` and
`Source.toCharSequence`), which allocated a Continuation per character
and produced the avgt × thrpt ≈ 3 long-tail divergence flagged in
issue #172's primary profile.

Add an eager chunked parser that fills a CharArray buffer and walks it
directly, using double buffering to carry the next-char lookahead
across chunk boundaries. The public lazy `Sequence<List<String>>` API
is unchanged; only the I/O wrappers and an internal pipeline helper
are rewired.

- `ParseStateMachine.reset()` lets `SequenceParser` reuse one machine
  instance across rows (no per-row alloc of machine / StringBuilder /
  fields ArrayList).
- `parseRowsFromChunks((CharArray) -> Int, dialect, stripBom)` is the
  new internal entry point; per-char `sequence { yield(c) }` is gone.
- `CsvReader.applyPipeline` exposes the skipEmptyLine + field-count
  policy stages for I/O wrappers to drive directly.
- JVM `read(InputStream)` / `readFromFile(File)` wrap
  `BufferedReader.read(CharArray)`; kotlinx-io `read(Source)` writes
  decoded code points straight into the chunk buffer.
Add coverage for the new chunked reader fast path so the PR diff is no
longer below the codecov threshold:

- SequenceParserTest: drive `parseRowsFromChunks` directly with a
  custom `(CharArray) -> Int` source, including small-buffer chunk
  boundary swaps, CR/LF on a chunk boundary, the `require(bufferSize
  >= 2)` guard, BOM strip on/off (default), and an unterminated quote
  whose tail-flush takes the null-result branch.
- CsvReaderJvmIoTest: exercise the I/O pipeline with skipEmptyLine
  and with an input larger than the default 8 KB chunk so the double
  buffer swap runs end-to-end.
- CsvReaderPathSmokeTest: call the kotlinx-io Path overloads of
  readFromFile/readAllFromFile with default options, and read a
  multi-chunk file so the kotlinx-io chunk reader exits via its
  `index >= limit` branch.
Add follow-up coverage for the chunked reader fast path introduced in
PR #177:

- doubled quote `""` straddling a chunk boundary, so the cross-buffer
  next-char lookahead has to find the second `"` at nextBuffer[0] for
  skipCount=1 to do the right thing
- explicit-escape `\\<target>` straddling a chunk boundary, same
  cross-buffer next-char path with a non-self escape char
- lone CR at a chunk end followed by a non-LF char in the next chunk,
  so the CR terminator must not consume the next field char as part
  of CRLF
- supplementary code point (U+1F600) at the parseRowsFromChunks layer
  where it just passes through as ordinary chars, and at the
  kotlinx-io Source layer with the 😀 high surrogate at index
  `buffer.size - 2` so the low surrogate must land on the reserved
  last slot — a regression in `limit = buffer.size - 1` would overflow

Also rewrite the existing chunked-path test comments to lead with
why-the-test-exists instead of restating the parser branch.
Add cross-chunk boundary regression tests for chunked reader
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 92.94118% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.37%. Comparing base (19c2883) to head (ecaf608).

Files with missing lines Patch % Lines
...ain/kotlin/com/jsoizo/kotlincsv/writer/WriterIo.kt 83.33% 4 Missing and 2 partials ⚠️
...soizo/kotlincsv/writer/internal/SequenceEncoder.kt 92.94% 2 Missing and 4 partials ⚠️
...mmonMain/kotlin/com/jsoizo/kotlincsv/CsvDialect.kt 82.75% 3 Missing and 2 partials ⚠️
...jsoizo/kotlincsv/reader/internal/SequenceParser.kt 94.73% 0 Missing and 4 partials ⚠️
.../jsoizo/kotlincsv/reader/CsvReaderConfigBuilder.kt 83.33% 2 Missing ⚠️
...ain/kotlin/com/jsoizo/kotlincsv/reader/ReaderIo.kt 92.00% 2 Missing ⚠️
.../kotlin/com/jsoizo/kotlincsv/writer/WriterIoJvm.kt 88.23% 2 Missing ⚠️
...in/kotlin/com/jsoizo/kotlincsv/reader/CsvReader.kt 96.00% 0 Missing and 1 partial ⚠️
...izo/kotlincsv/reader/internal/ParseStateMachine.kt 95.83% 1 Missing ⚠️
.../kotlin/com/jsoizo/kotlincsv/reader/ReaderIoJvm.kt 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #180      +/-   ##
============================================
+ Coverage     86.66%   93.37%   +6.70%     
============================================
  Files            21       22       +1     
  Lines          1282      528     -754     
  Branches        192      122      -70     
============================================
- Hits           1111      493     -618     
+ Misses           32       17      -15     
+ Partials        139       18     -121     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jsoizo jsoizo merged commit d1c9c81 into main May 23, 2026
5 checks passed
@jsoizo jsoizo deleted the version_2_0_0 branch May 23, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Version 2.0 Release Plan

1 participant