Renovate library interface#165
Merged
Merged
Conversation
…Exception, CsvFieldNumDifferentException)
Add com.jsoizo.kotlincsv.CsvDialect as a data class capturing the four CSV format fields (delimiter / quoteChar / escapeChar / lineTerminator) shared by reader and writer, plus RFC4180 and TSV presets. Construction is validated via require(): delimiter must differ from quoteChar and escapeChar, and lineTerminator must not be empty. Existing reader / writer code paths are unchanged -- wiring CsvDialect into them is the next phase.
Add kotest-assertions-core (multiplatform) to commonTest dependencies and rewrite CsvDialectTest using kotest assertion style (shouldThrow, shouldNotBeNull, shouldContain). The kotest runner is intentionally not added; tests still run via kotlin.test's @test annotation.
Align the existing exception tests with the new commonTest convention (kotlin.test runner + kotest-assertions-core assertions). Behaviour unchanged; only assertion call sites are migrated to shouldBe / shouldContain / shouldBeInstanceOf / shouldNotBeNull.
Implements Phase 3 of the v2 migration: a Sequence<Char> -> Sequence<List<String>>
core reader living under com.jsoizo.kotlincsv.reader, with the immutable
CsvReaderConfig (data class) + CsvReaderConfigBuilder (DSL receiver) split,
top-level csvReader { } / csvReader(config) entry points, and the
Sequence<List<String>>.withHeader() extension that yields LinkedHashMap rows.
The legacy ParseStateMachine is reused unchanged except for an internal
isLineComplete() observer added so the new SequenceParser can detect row
boundaries while driving the state machine character-by-character. Legacy
util.CSVParseFormatException raised by ParseStateMachine is converted into
the new exceptions.CsvParseFormatException at the parser boundary.
The old client / dsl / util packages are left untouched and continue to
work alongside the new API; they will be removed in a later phase.
Mirrors the Phase 3 reader pattern for the writer side:
- writer/WriteQuoteMode.kt re-declares the enum under writer/ to keep the
v2 API self-contained. The legacy enum at dsl.context.WriteQuoteMode
is still consumed by v1 CsvFileWriter and will be removed in Phase 9.
- writer/CsvWriterConfig.kt is an immutable data class with no init
block — all format-level invariants are already enforced by CsvDialect.
- writer/CsvWriterConfigBuilder.kt is a public class with an internal
constructor so the builder is only reachable via the upcoming
csvWriter { ... } DSL.
No tests for these — copy()/equals/default-value behaviour is guaranteed
by the data class contract (matching how CsvReaderConfig/Builder shipped
in Phase 3 without dedicated unit tests).
Implements the Sequence<List<String>> -> Sequence<Char> encoder under com.jsoizo.kotlincsv.writer, with the eager writeAll(List<List<String>>) wrapper that joinsTo a String. Mirrors the Phase 3 reader split where the lazy core lives in reader/internal/SequenceParser.kt. Quoting follows v1 semantics for CANONICAL/ALL/NON_NUMERIC. Escape output now branches on the dialect: - escapeChar == quoteChar (RFC 4180 doubling, v1-compatible) - escapeChar != quoteChar (explicit escape, a v2-only extension) The line-terminator policy uses an iter.hasNext() lookahead inside the sequence builder so the trailing terminator decision (controlled by outputLastLineTerminator) is taken after the final row without materialising the whole input.
Adds the top-level csvWriter { ... } / csvWriter(config) helpers under
com.jsoizo.kotlincsv, mirroring the Phase 3 csvReader DSL. Includes
common round-trip tests that verify the writer's output is parseable
back into the original rows by the new reader for RFC 4180, TSV, and
explicit-escape (escapeChar = '\\') dialects.
Adds kotlinx-io 0.7.0 as a commonMain dependency for the I/O layer. Note: 0.7.0 is selected (not 0.9.0) because 0.8+ is built with Kotlin 2.2+, which is metadata-incompatible with this project's Kotlin 2.1.0 consumer.
Adds the reader I/O layer over kotlinx-io. CsvReader.read(source, ...) and read(path, ...) decode UTF-8 in one pass and feed a Sequence<List<String>> to the caller's lambda. The Path overload owns the underlying Source and closes it on lambda exit (normal return, take(n) early stop, or exception). The CsvReadIoOptions.stripBom flag drops a leading U+FEFF before parsing.
Adds the writer I/O layer over kotlinx-io. CsvWriter.write(rows, sink, ...) buffers encoded characters into 8 KiB chunks and writes them via writeString, preserving the lazy Sequence<Char> output of the core encoder. The Path overload opens the file in truncate mode (kotlinx-io default) and closes the sink on normal return or exception. List<List<String>> overloads delegate to the Sequence variant. CsvWriteIoOptions.prependBom emits a leading U+FEFF before the body when enabled.
Adds 1 reader and 2 writer overloads that take a path as String, delegating to the kotlinx.io.files.Path overload via Path(path). Lets callers use the path APIs without having to import kotlinx-io.
Adds three test files that exercise the I/O layer end-to-end: - IoRoundTripTest verifies write -> read across the in-memory Sink/Source pair, with and without BOM prepend/strip, recovering the original rows. - CsvReaderPathSmokeTest and CsvWriterPathSmokeTest exercise the String -> Path -> Source/Sink delegation against a real temp file.
Removes the special-case U+FEFF skip from ParseStateMachine.START so the parser no longer treats BOM as a control character. CsvReadIoOptions.stripBom becomes the single source of BOM handling — BOM is a byte-boundary concept owned by the I/O layer, not the format-level parser. Test updates: - ReaderIoTest.read_source_stripBomFalse_keepsLeadingBomInFirstField now asserts the U+FEFF survives parsing when stripBom is off. - SequenceParserTest.bomCharacter_passedThroughByParser replaces the prior bomCharacter_skippedAtStart test; the parser sees BOM as a data char now. - The legacy client's BOM-prefixed file tests in CsvReaderTest are disabled; the legacy code path no longer strips a leading BOM at the I/O layer either, so reading a BOM-prefixed CSV via the legacy DSL is broken.
Pins the contract that CsvWriter.write(rows, sink, ...) calls flush() on the underlying RawSink once at the end, so downstream consumers see all bytes before close. Prevents accidental regression if the chunked-flush path is later reorganised.
The String overload of read/write takes a path on disk, but the bare String type leaves the meaning ambiguous (any string could be a URL, key, etc.). Naming the parameter `filePath` clarifies intent without giving up the kotlinx-io-style `path` for the typed Path overload, where the type already carries the meaning.
Replaces source.readString() (which materialised the whole input as a String before parsing) with an incremental Sequence<Char> built from Source.readCodePointValue(). The parser can now short-circuit on take(n) / first() and stop pulling bytes from the underlying source — the lazy Sequence<List<String>> contract becomes meaningful end-to-end on JVM. Supplementary-plane code points are emitted as UTF-16 surrogate pairs so the char stream stays compatible with the existing parser. BOM strip moves to a per-code-point check on the very first read, replacing the prior String-prefix substring(1). Notes: - JS (Node.js) keeps its full-file load characteristic because kotlinx-io 0.7.0's FileSource calls fs.readFileSync on first read; this change does not alter that, only the JVM streaming path. - ReaderIoTest gains coverage for surrogate pair round-trip and short-circuit-after-take.
- Drop the redundant `if (exhausted()) return@sequence` early-return; the while loop's exhausted() check already covers an empty source without any spurious work. - Clarify the Source overload KDoc to note that streaming decode is JVM-only; on JS (Node.js) the underlying kotlinx-io FileSource loads the entire file on first read, so the Sequence yields from an in-memory buffer there.
Code comments should not reference process vocabulary like "Phase N". Removes the phase mentions from RoundTripTest and CsvReaderPathSmokeTest so the comments express only the durable intent of the tests.
Adds the JVM reader I/O layer that accepts a charset by IANA name. CsvReader.read(file, charset, ...) opens a FileInputStream owned by the function and closes it on lambda exit; CsvReader.read(stream, charset, ...) leaves the InputStream owned by the caller and neither closes the stream nor the internally created InputStreamReader/BufferedReader. Java charset aliases (e.g. "SJIS" for Shift_JIS) are resolved through Charset.forName. The Sequence<Char> is built by pulling one char at a time from BufferedReader.read() so take(n) / first() short-circuits stop fetching bytes from the underlying stream. BOM is matched after charset decoding, so stripBom = true drops a leading U+FEFF regardless of the charset. Note: the Source.readCodePointValue() path used by the commonMain overload is UTF-8 only and cannot be reused here, so this overload builds a separate decode path through InputStreamReader.
Adds the JVM writer I/O layer that accepts a charset by IANA name. CsvWriter.write(rows, file, charset, ...) opens a FileOutputStream in truncate mode owned by the function and closes it after the write; CsvWriter.write(rows, stream, charset, ...) leaves the OutputStream owned by the caller, only flushing the OutputStreamWriter at the end so written bytes are visible without waiting for close. Encoded characters are buffered into 8 KiB chunks before being handed to OutputStreamWriter.write, mirroring the commonMain Sink overload. List<List<String>> overloads delegate to the Sequence variant. CsvWriteIoOptions.prependBom emits a leading U+FEFF; the actual bytes depend on the charset encoder.
Document the exception surface for CsvReader.read/readAll, withHeader, and the I/O-layer reader extensions so callers know which failures to catch and where they surface (terminal operation for Sequence-returning APIs). Spell out the kotlinx-io fs.readFileSync constraint on Node.js inline so JS users encounter the memory caveat at the call site.
Document IOException surfaces for CsvWriter sink/path/file/stream extensions, and clarify that the core CsvWriter.write/writeAll do not throw by themselves (encoding-only paths). Charset-aware JVM overloads list UnsupportedCharsetException / IllegalCharsetNameException so callers know when an invalid charset name fails fast.
Centralize the long-form notes that don't fit cleanly into per-symbol KDoc — Sequence laziness semantics, resource ownership rules, the exception hierarchy, and the JS in-memory-load caveat — into a single Module.md following the Dokka Gradle example pattern. Wire it into dokkaHtml via dokkaSourceSets.includes so the published API docs render the new sections.
A single-file reference for users coming from v1, covering dependency updates, package moves, reader/writer/IO API changes, removed features and their replacements, new capabilities, behavioural shifts (lazy Sequence exception timing, JS in-memory load), and a cookbook of v1 to v2 rewrites mirroring the v1 README examples.
Rebuild the README around the v2 surface (csvReader / csvWriter DSL, CsvDialect, lambda-style I/O extensions) and trim the v1 line-by-line and openAsync cookbook into a one-line link to V2_MIGRATION_GUIDE.md. The badges, contributing notes, and acknowledgments carry over; only the code samples change.
The parser treats U+2028, U+2029, and U+0085 as row terminators in addition to LF/CR, but the writer's canonical-quote check only quoted fields containing LF/CR or the dialect's configured line terminator. A field whose value was one of those Unicode separators was therefore emitted unquoted and parsed back as a row break, corrupting the Writer -> Reader round-trip. Quote whenever a field contains any character the parser recognises as a row terminator. Add a round-trip regression test in RoundTripTest.
The block-shaped I/O extensions (read(source/path/file/stream) {}) cannot
forbid callers from returning the Sequence at the type level, so the
"do not leak the Sequence past the block" rule has to be a contract.
The README/Module phrasing previously implied it was enforced.
- Add readAll overloads alongside each read overload (commonMain:
Source/Path/String; jvmMain: File/InputStream). Each is a thin
read(...) { it.toList() } wrapper, giving callers a direct path to
an eager List<List<String>> without writing the lambda by hand.
- Restate the resource-management contract in Module.md and on each
read(...) KDoc: callers must consume the Sequence inside the block,
and readAll is the right tool when an eager list is wanted.
- Cover readAll(source) and the readAll(filePath) overload signature in
commonTest, plus readAll(file) and readAll(stream) (with caller-owned
stream-not-closed assertion) in jvmTest.
The v2 design treats CsvReaderConfig / CsvWriterConfig as immutable value objects, with CsvReader / CsvWriter holding them in a public property so callers can inspect or copy(...) them. The implementation had them as private val, which forced callers to keep a parallel reference if they wanted base.config.copy(skipEmptyLine = false). Drop the private modifier on both. The configs are data classes with no mutable state, so widening visibility introduces no aliasing risk.
Adds SequenceParserTest cases for branches that the existing test surface left as Codecov misses or partials: - CRLF in START state (empty leading row) - field-state escape mismatch in default dialect - DELIMITER state followed by U+2028 / U+2029 / U+0085 / CR / CRLF - QUOTE_END state followed by U+2028 / U+2029 / U+0085 / CR / CRLF - escapeChar != quoteChar with an invalid escape character - escapeChar != quoteChar with an unterminated escape at EOF - unterminated quote at EOF dropping the in-flight row
The String overload `CsvReader.readAll(filePath: String)` collided with the core API `CsvReader.readAll(text: String)` and was silently shadowed by the member function. Rename file-targeted I/O extensions on Reader and Writer to make intent explicit and avoid the ambiguity: - read/readAll(path|filePath|file) -> readFromFile/readAllFromFile - write(rows, path|filePath|file) -> writeToFile Source / Sink / InputStream / OutputStream overloads keep their original names since they are stream-oriented and never collided.
Trim KDoc on the 22 files added or rewritten in the v2 migration so IDE
hover shows the function's behaviour in 1-3 lines instead of paragraphs.
The lazy-sequence semantics, charset / BOM behaviour, JS streaming
caveat, field-count policies and exception timing are already covered
in Module.md; the per-overload duplication is removed.
Resource-ownership reminders ("sequence must be consumed inside the
block", "stream is caller-owned and not closed") stay on each I/O
overload because Module.md is not surfaced in IDE hover.
The parser package only held ParseStateMachine, which is internal and used solely by SequenceParser. Co-locating it removes the single-file package and improves cohesion.
…e style In an explicit-escape dialect (escapeChar != quoteChar) the parser only honours escape sequences inside quoted fields. The writer used to emit escape-doubled characters in unquoted fields too, breaking round-trip when a field contained the escape character but no other special char. Include escapeChar in needsCanonicalQuote so such fields are forced into a quoted region. Aligns with Apache Commons CSV / OpenCSV writer behaviour.
Generate randomised rows over the cartesian product of dialect, quoteMode, and trailing-terminator settings. Field characters are weighted ~7:3 toward dialect/control/Unicode-line-separator characters vs full BMP. 500 iterations. Skip a single degenerate case where a final row of one empty field with no trailing terminator yields zero output and cannot round-trip.
Verify that csvReader().readAll(text) over arbitrary input either succeeds or throws only MalformedCsvException-derived exceptions, complementing the existing round-trip property which only feeds the parser writer-emitted text. Extract shared kotest-property generators (fieldChar, dialectArb, etc.) into PbtArbs.kt so both RoundTripPropertyTest and the new crash-freedom test reuse the same input distribution.
The crash-freedom property is meant to cover any `String` accepted by the public API, but the existing generator filtered out the surrogate range so unpaired surrogates were never exercised. Add `surrogateChar` and a 9:1 mix `anyChar` and use it for the crash-freedom text generator. Round-trip inputs keep using `fieldChar` since writing assumes well-formed Unicode.
Switch ParseStateMachineCrashFreedomTest from a manual Arb.take loop to kotest-property's checkAll inside runTest. When the property fails kotest now shrinks the offending text to a minimal reproduction and prints the shrink trace, which a fixed-seed manual loop cannot do. Add kotlinx-coroutines-test to commonTest dependencies for runTest, which is needed because checkAll is suspend.
Adds a property that asserts the CANONICAL writer's quoting decision
is sound: any single field round-trips through writer (CANONICAL,
outputLastLineTerminator=true) and reader. Single-field scope lets
kotest shrink predicate-induced failures down to a minimal field
(verified: an injected `field.contains(',')` failure shrinks to ",").
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
#164
Summary
This PR implements the v2 public API redesign on top of the
version_2_0_0branch.The change is intentionally breaking within the v2 line: it removes the legacy
client/dsl.contextAPI shape and replaces it with a smaller, immutable, lazy Sequence-based API undercom.jsoizo.kotlincsv.Design goals:
Public API Shape
Entry points
csvReader { ... }CsvReadercsvReader(config: CsvReaderConfig)csvWriter { ... }CsvWritercsvWriter(config: CsvWriterConfig)Shared format model
CsvDialectrepresents CSV format itself and is shared by reader and writer. Reader/writer-specific behavior lives in their own config types.Reader model
Reader core has two entry points with the same parsing behavior:
read(...): lazy API returningSequence<List<String>>readAll(...): eager API materializing all rows intoList<List<String>>Exceptions are raised during terminal operations such as
toList()orforEach()when using the lazy API.Writer model
Writer core has the same lazy/eager split:
write(...): lazy API returningSequence<Char>writeAll(...): eager API materializing the encoded CSV asStringI/O API Shape
Core APIs do not own files or streams. I/O is provided as a thin outer layer.
For I/O-owned resources,
read(path) { rows -> ... }keeps lazy sequence consumption inside the resource lifetime:reader.read(path) { rows: Sequence<List<String>> -> rows.take(10).toList() }Eager helpers are also provided:
The difference between
readandreadAllis lazy vs eager evaluation, not a different parsing model.JVM-specific overloads provide
File,InputStream, andOutputStreamsupport. Common / JS APIs usekotlinx-ioPath,Source, andSink.Breaking Changes Compared With
origin/version_2_0_0Removed
clientpackage APICsvFileReaderCsvReaderCsvWriterBufferedLineReaderReaderICsvFileWriterdsl.CsvReaderDsldsl.CsvWriterDsldsl.context.CsvReaderContextdsl.context.CsvWriterContextdsl.context.CsvWriteQuoteContextCsvParserCSVExceptionConstCsvDslMarkerLoggerLoggerNopAdded
CsvDialectcsvReader { ... }csvReader(config: CsvReaderConfig)csvWriter { ... }csvWriter(config: CsvWriterConfig)CsvReaderCsvReaderConfigCsvReaderConfigBuilderInsufficientFieldsRowBehaviourExcessFieldsRowBehaviourCsvHeaderCsvWriterCsvWriterConfigCsvWriterConfigBuilderWriteQuoteModeREADME.mdrewriteV2_MIGRATION_GUIDE.mdModule.mdMigration Sketch
Reader
Before:
After:
For lazy consumption:
reader.read(path) { rows -> rows.forEach { row -> // consume inside the block } }Writer
Before:
csvWriter { quote { mode = WriteQuoteMode.ALL } }.writeAll(rows, file)After:
Public Functions
Top-level DSL entry points (
commonMain)csvReadercsvReader(init: CsvReaderConfigBuilder.() -> Unit = {}): CsvReadercsvReadercsvReader(config: CsvReaderConfig): CsvReadercsvWritercsvWriter(init: CsvWriterConfigBuilder.() -> Unit = {}): CsvWritercsvWritercsvWriter(config: CsvWriterConfig): CsvWriterCore (
commonMain)CsvReaderreadread(chars: Sequence<Char>): Sequence<List<String>>CsvReaderreadAllreadAll(text: String): List<List<String>>CsvWriterwritewrite(rows: Sequence<List<String>>): Sequence<Char>CsvWriterwriteAllwriteAll(rows: List<List<String>>): StringSequence<List<String>>withHeaderwithHeader(...): Sequence<CsvHeader>Reader I/O extensions (
commonMain)CsvReader.readread(source: Source, options, block): TCsvReader.readAllreadAll(source: Source, options): List<List<String>>CsvReader.readFromFilereadFromFile(path: Path, options, block): TCsvReader.readFromFilereadFromFile(filePath: String, options, block): TCsvReader.readAllFromFilereadAllFromFile(path: Path, options): List<List<String>>CsvReader.readAllFromFilereadAllFromFile(filePath: String, options): List<List<String>>Writer I/O extensions (
commonMain)CsvWriter.writewrite(rows: Sequence<List<String>>, sink: Sink, options)CsvWriter.writewrite(rows: List<List<String>>, sink: Sink, options)CsvWriter.writeToFilewriteToFile(rows: Sequence<List<String>>, path: Path, options)CsvWriter.writeToFilewriteToFile(rows: List<List<String>>, path: Path, options)CsvWriter.writeToFilewriteToFile(rows: Sequence<List<String>>, filePath: String, options)CsvWriter.writeToFilewriteToFile(rows: List<List<String>>, filePath: String, options)JVM-only I/O extensions (
jvmMain)CsvReader.readFromFilereadFromFile(file: File, charset, options, block): TCsvReader.readAllFromFilereadAllFromFile(file: File, charset, options): List<List<String>>CsvReader.readread(stream: InputStream, charset, options, block): TCsvReader.readAllreadAll(stream: InputStream, charset, options): List<List<String>>CsvWriter.writeToFilewriteToFile(rows: Sequence<List<String>>, file: File, charset, options)CsvWriter.writeToFilewriteToFile(rows: List<List<String>>, file: File, charset, options)CsvWriter.writewrite(rows: Sequence<List<String>>, stream: OutputStream, charset, options)CsvWriter.writewrite(rows: List<List<String>>, stream: OutputStream, charset, options)Notes:
optionsは Reader 側がCsvReadIoOptions、Writer 側がCsvWriteIoOptions。charsetはすべて default"UTF-8"。blockシグネチャは(Sequence<List<String>>) -> T。