feat: parse escape sequences in unquoted fields (closes #168)#174
Merged
Conversation
…ialects
For dialects where `escapeChar != quoteChar` (e.g. `CsvDialect(escapeChar = '\\')`),
the reader previously treated the escape character literally in the START and
DELIMITER states, and accepted only `escapeChar` (not `quoteChar`) as the
escaped character in the FIELD state. This rejected CSV produced by other
libraries (Python `csv`, Apache Commons CSV, OpenCSV) that emit unquoted
escape sequences such as `a\\b` or `a\"b`.
Introduce a `handleUnquotedEscape` helper and route START / DELIMITER / FIELD
through it so all three states share the same `{escapeChar, quoteChar}`
acceptance set. When `escapeChar == quoteChar` (RFC 4180 default) the helper
degenerates to a single-element accept set, preserving the existing strict
behaviour (e.g. `a"b` under the default dialect still throws).
Closes #168
Add a property-based test that generates arbitrary unquoted-safe fields under `CsvDialect(escapeChar = '\\')`, serialises them with the minimal Python-csv style backslash escaping (`\` -> `\\`, `"` -> `\"`), and asserts the reader round-trips them. 500 iterations, seed 0L, follows the project PBT conventions (checkAll + runTest, generators reused from PbtArbs).
b8b3e6c to
3ab4b2a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## version_2_0_0 #174 +/- ##
=================================================
+ Coverage 94.49% 94.77% +0.27%
=================================================
Files 22 22
Lines 418 421 +3
Branches 93 95 +2
=================================================
+ Hits 395 399 +4
Misses 14 14
+ Partials 9 8 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
\\,\") in unquoted fields under explicit-escape dialects (escapeChar != quoteChar, e.g.CsvDialect(escapeChar = '\\')), bringing kotlin-csv in line with Pythoncsv, Apache Commons CSV, and OpenCSV.a934753; this PR completes the reader side and the "writer is strict / reader is lenient" round-trip story.escapeChar == quoteChar == '"') behaviour is unchanged — the helper's{escapeChar, quoteChar}accept set degenerates to a single element so e.g.a"bstill throws (covered by existingSequenceParserTest.escapeChar_inFieldState_invalidEscape_throws).Design notes
Routed all three unquoted states (
START,DELIMITER,FIELD) through a singlehandleUnquotedEscapehelper so the accept set and error message are uniform. InSTART/DELIMITERthequoteChararm is placed before theescapeChararm so theescapeChar == quoteCharcase is unreachable (handled byQUOTE_STARTtransition). The change was reviewed in plan form by both an internal planning agent and Codex; both independently flagged an earlier draft that special-casedescapeChar == quoteCharinsideFIELDas breaking the existing default-dialect throw test — fixed before implementation.Strict semantics retained: escape followed by EOF /
\r/\n/ any non-escapeChar-or-quoteCharcharacter throwsCsvParseFormatException.Closes #168.
Test plan
./gradlew checkgreen on all multiplatform targets (JVM / JS / Linux / macOS / MinGW / wasmWasi).CsvReaderTest:x,a\\b→[["x","a\\b"]],\\b→[["\\b"]], escape +"at START / DELIMITER / FIELD, escape at EOF (throw), escape + ordinary char (throw), escape + CR / LF (throw), multi-row.UnquotedEscapePropertyTest(500 iterations, seed 0L) generates unquoted-safe fields underCsvDialect(escapeChar = '\\'), serialises with Python-csv-style escaping, and round-trips through the reader.handleUnquotedEscape(extrafield.append(escapeChar)); both the manual test and PBT went red; reverted.SequenceParserTest(escapeChar_inFieldState_invalidEscape_throws,escapeCharDifferent_invalidEscape_throws,escapeCharDifferent_escapeAtEof_throws, etc.) still pass.