Version 2.0 🎉#180
Merged
Merged
Conversation
Change namespace to com.jsoizo
version up KMP to 2.1
Bump up gradle and libraries
Change publishing plugin
Use version catalog file
…Exception, CsvFieldNumDifferentException)
Add com.jsoizo.kotlincsv.CsvDialect as a data class capturing the four CSV format fields (delimiter / quoteChar / escapeChar / lineTerminator) shared by reader and writer, plus RFC4180 and TSV presets. Construction is validated via require(): delimiter must differ from quoteChar and escapeChar, and lineTerminator must not be empty. Existing reader / writer code paths are unchanged -- wiring CsvDialect into them is the next phase.
Add kotest-assertions-core (multiplatform) to commonTest dependencies and rewrite CsvDialectTest using kotest assertion style (shouldThrow, shouldNotBeNull, shouldContain). The kotest runner is intentionally not added; tests still run via kotlin.test's @test annotation.
Align the existing exception tests with the new commonTest convention (kotlin.test runner + kotest-assertions-core assertions). Behaviour unchanged; only assertion call sites are migrated to shouldBe / shouldContain / shouldBeInstanceOf / shouldNotBeNull.
Implements Phase 3 of the v2 migration: a Sequence<Char> -> Sequence<List<String>>
core reader living under com.jsoizo.kotlincsv.reader, with the immutable
CsvReaderConfig (data class) + CsvReaderConfigBuilder (DSL receiver) split,
top-level csvReader { } / csvReader(config) entry points, and the
Sequence<List<String>>.withHeader() extension that yields LinkedHashMap rows.
The legacy ParseStateMachine is reused unchanged except for an internal
isLineComplete() observer added so the new SequenceParser can detect row
boundaries while driving the state machine character-by-character. Legacy
util.CSVParseFormatException raised by ParseStateMachine is converted into
the new exceptions.CsvParseFormatException at the parser boundary.
The old client / dsl / util packages are left untouched and continue to
work alongside the new API; they will be removed in a later phase.
Add Kotlin/Native targets
- Declare wasmWasi(nodejs) in build.gradle.kts so commonMain (kotlinx-io based reader/writer) compiles for WASI runtimes. - Disable compileTestKotlinWasmWasi / wasmWasiTest / wasmWasiNodeTest until Kotest publishes wasmWasi artifacts; commonTest cannot be compiled for this target otherwise. - Add compileKotlinWasmWasi to the Linux CI job to guard against regressions.
Add wasmWasi target
…ialects
For dialects where `escapeChar != quoteChar` (e.g. `CsvDialect(escapeChar = '\\')`),
the reader previously treated the escape character literally in the START and
DELIMITER states, and accepted only `escapeChar` (not `quoteChar`) as the
escaped character in the FIELD state. This rejected CSV produced by other
libraries (Python `csv`, Apache Commons CSV, OpenCSV) that emit unquoted
escape sequences such as `a\\b` or `a\"b`.
Introduce a `handleUnquotedEscape` helper and route START / DELIMITER / FIELD
through it so all three states share the same `{escapeChar, quoteChar}`
acceptance set. When `escapeChar == quoteChar` (RFC 4180 default) the helper
degenerates to a single-element accept set, preserving the existing strict
behaviour (e.g. `a"b` under the default dialect still throws).
Closes #168
Add a property-based test that generates arbitrary unquoted-safe fields under `CsvDialect(escapeChar = '\\')`, serialises them with the minimal Python-csv style backslash escaping (`\` -> `\\`, `"` -> `\"`), and asserts the reader round-trips them. 500 iterations, seed 0L, follows the project PBT conventions (checkAll + runTest, generators reused from PbtArbs).
…cape feat: parse escape sequences in unquoted fields (closes #168)
Scaffolds four JVM-only subprojects under benchmark/ to visualize the performance characteristics of v1.10.0 (Maven artifact, com.github.doyaaaaaken.*) against v2.0.0 (this branch, com.jsoizo.*) on identical workloads. Subprojects: - benchmark/shared: deterministic data generation (CsvDataGen / DatasetSpec / DataStats) and environment probe. Depends only on kotlin-stdlib so it stays out of every benchmark classpath as a library. - benchmark/v1: JMH source set whose only kotlin-csv on classpath is com.jsoizo:kotlin-csv-jvm:1.10.0. Covers readAll(String/InputStream/File), iterative Sequence over File, readAllWithHeader, writeAll(OutputStream/File). - benchmark/v2: JMH source set whose only kotlin-csv on classpath is the current project. Mirrors the v1 workloads on the v2 API and adds V2BackendBenchmarks comparing java.io vs kotlinx-io paths. - benchmark/parity: JUnit subproject that intentionally puts both v1 and v2 on the test classpath (FQCNs do not collide) and asserts row-by-row equality on the HARD dataset for readAll/readAllWithHeader/writeAll. Classpath isolation is achieved by separating v1 and v2 into different Gradle resolution scopes; this stops Gradle from collapsing kotlinx-coroutines (and other transitive deps) to a single version across the two artifacts. Resolved jmhRuntimeClasspath was verified to contain v1 only on the v1 side and the v2 project only on the v2 side. JMH defaults: warmup=5, iterations=5, fork=2, modes=throughput+avgt, jvmArgs=[-Xms2g,-Xmx2g], JDK 21 toolchain. -Pbench.profile=large|gcprof| stackprof overrides the defaults for the long-running LARGE dataset and profiler runs. -Pjmh.include / -Pjmh.warmupIterations / -Pjmh.iterations / -Pjmh.fork allow per-invocation overrides for smoke runs.
Sets warmup=3, iter=3, fork=1, time=5s and restricts dataset @Param to SMALL and HARD via JMH '-p dataset' equivalent. Intended for the first issue #172 comment so readers see numbers before the full primary run finishes.
The MapProperty<String, ListProperty<String>> setter does not accept a plain List<String>. Wrap the value in objects.listProperty(...).set(...).
Restricts dataset @Param to SMALL/MEDIUM/HARD; the LARGE dataset is covered by the separate 'large' profile per the methodology in #172.
add v2 benchmark
Add direct writer fast path
Apply jmh.warmupIterations/iterations/fork/timeOnIteration/warmup property overrides after the bench.profile when block so short-form CLI flags can override profile defaults during ad-hoc gcprof/stackprof runs.
The v2 reader I/O paths routed every char through a coroutine sequence builder (`BufferedReader.toCharSequence` and `Source.toCharSequence`), which allocated a Continuation per character and produced the avgt × thrpt ≈ 3 long-tail divergence flagged in issue #172's primary profile. Add an eager chunked parser that fills a CharArray buffer and walks it directly, using double buffering to carry the next-char lookahead across chunk boundaries. The public lazy `Sequence<List<String>>` API is unchanged; only the I/O wrappers and an internal pipeline helper are rewired. - `ParseStateMachine.reset()` lets `SequenceParser` reuse one machine instance across rows (no per-row alloc of machine / StringBuilder / fields ArrayList). - `parseRowsFromChunks((CharArray) -> Int, dialect, stripBom)` is the new internal entry point; per-char `sequence { yield(c) }` is gone. - `CsvReader.applyPipeline` exposes the skipEmptyLine + field-count policy stages for I/O wrappers to drive directly. - JVM `read(InputStream)` / `readFromFile(File)` wrap `BufferedReader.read(CharArray)`; kotlinx-io `read(Source)` writes decoded code points straight into the chunk buffer.
Add coverage for the new chunked reader fast path so the PR diff is no longer below the codecov threshold: - SequenceParserTest: drive `parseRowsFromChunks` directly with a custom `(CharArray) -> Int` source, including small-buffer chunk boundary swaps, CR/LF on a chunk boundary, the `require(bufferSize >= 2)` guard, BOM strip on/off (default), and an unterminated quote whose tail-flush takes the null-result branch. - CsvReaderJvmIoTest: exercise the I/O pipeline with skipEmptyLine and with an input larger than the default 8 KB chunk so the double buffer swap runs end-to-end. - CsvReaderPathSmokeTest: call the kotlinx-io Path overloads of readFromFile/readAllFromFile with default options, and read a multi-chunk file so the kotlinx-io chunk reader exits via its `index >= limit` branch.
Add chunked reader fast path
Add follow-up coverage for the chunked reader fast path introduced in PR #177: - doubled quote `""` straddling a chunk boundary, so the cross-buffer next-char lookahead has to find the second `"` at nextBuffer[0] for skipCount=1 to do the right thing - explicit-escape `\\<target>` straddling a chunk boundary, same cross-buffer next-char path with a non-self escape char - lone CR at a chunk end followed by a non-LF char in the next chunk, so the CR terminator must not consume the next field char as part of CRLF - supplementary code point (U+1F600) at the parseRowsFromChunks layer where it just passes through as ordinary chars, and at the kotlinx-io Source layer with the 😀 high surrogate at index `buffer.size - 2` so the low surrogate must land on the reserved last slot — a regression in `limit = buffer.size - 1` would overflow Also rewrite the existing chunked-path test comments to lead with why-the-test-exists instead of restating the parser branch.
Add cross-chunk boundary regression tests for chunked reader
fix some logics & docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #149
Summary
com.jsoizo.kotlincsv:CsvDialect, immutable reader/writer configs, sequence-first read/write, header helpers, and v2 exception types.Verification
./gradlew checkKotlinAbi jvmTest jsNodeTest macosArm64Test iosSimulatorArm64Test compileKotlinIosArm64 compileTestKotlinIosArm64 compileKotlinLinuxArm64 compileTestKotlinLinuxArm64 compileKotlinWasmWasi koverXmlReport