Skip to content

feat: parse escape sequences in unquoted fields (closes #168)#174

Merged
jsoizo merged 2 commits into
version_2_0_0from
feat/issue-168-reader-explicit-escape
May 16, 2026
Merged

feat: parse escape sequences in unquoted fields (closes #168)#174
jsoizo merged 2 commits into
version_2_0_0from
feat/issue-168-reader-explicit-escape

Conversation

@jsoizo
Copy link
Copy Markdown
Owner

@jsoizo jsoizo commented May 16, 2026

Summary

  • Make the reader interpret escape sequences (\\, \") in unquoted fields under explicit-escape dialects (escapeChar != quoteChar, e.g. CsvDialect(escapeChar = '\\')), bringing kotlin-csv in line with Python csv, Apache Commons CSV, and OpenCSV.
  • The writer-side counterpart (force-quoting fields that contain the escape character) already shipped in a934753; this PR completes the reader side and the "writer is strict / reader is lenient" round-trip story.
  • RFC 4180 default (escapeChar == quoteChar == '"') behaviour is unchanged — the helper's {escapeChar, quoteChar} accept set degenerates to a single element so e.g. a"b still throws (covered by existing SequenceParserTest.escapeChar_inFieldState_invalidEscape_throws).

Design notes

Routed all three unquoted states (START, DELIMITER, FIELD) through a single handleUnquotedEscape helper so the accept set and error message are uniform. In START / DELIMITER the quoteChar arm is placed before the escapeChar arm so the escapeChar == quoteChar case is unreachable (handled by QUOTE_START transition). The change was reviewed in plan form by both an internal planning agent and Codex; both independently flagged an earlier draft that special-cased escapeChar == quoteChar inside FIELD as breaking the existing default-dialect throw test — fixed before implementation.

Strict semantics retained: escape followed by EOF / \r / \n / any non-escapeChar-or-quoteChar character throws CsvParseFormatException.

Closes #168.

Test plan

  • ./gradlew check green on all multiplatform targets (JVM / JS / Linux / macOS / MinGW / wasmWasi).
  • New manual cases in CsvReaderTest: x,a\\b[["x","a\\b"]], \\b[["\\b"]], escape + " at START / DELIMITER / FIELD, escape at EOF (throw), escape + ordinary char (throw), escape + CR / LF (throw), multi-row.
  • New PBT UnquotedEscapePropertyTest (500 iterations, seed 0L) generates unquoted-safe fields under CsvDialect(escapeChar = '\\'), serialises with Python-csv-style escaping, and round-trips through the reader.
  • Verification ritual: intentionally broke handleUnquotedEscape (extra field.append(escapeChar)); both the manual test and PBT went red; reverted.
  • Existing escape regression tests in SequenceParserTest (escapeChar_inFieldState_invalidEscape_throws, escapeCharDifferent_invalidEscape_throws, escapeCharDifferent_escapeAtEof_throws, etc.) still pass.

jsoizo added 2 commits May 17, 2026 01:20
…ialects

For dialects where `escapeChar != quoteChar` (e.g. `CsvDialect(escapeChar = '\\')`),
the reader previously treated the escape character literally in the START and
DELIMITER states, and accepted only `escapeChar` (not `quoteChar`) as the
escaped character in the FIELD state. This rejected CSV produced by other
libraries (Python `csv`, Apache Commons CSV, OpenCSV) that emit unquoted
escape sequences such as `a\\b` or `a\"b`.

Introduce a `handleUnquotedEscape` helper and route START / DELIMITER / FIELD
through it so all three states share the same `{escapeChar, quoteChar}`
acceptance set. When `escapeChar == quoteChar` (RFC 4180 default) the helper
degenerates to a single-element accept set, preserving the existing strict
behaviour (e.g. `a"b` under the default dialect still throws).

Closes #168
Add a property-based test that generates arbitrary unquoted-safe fields under
`CsvDialect(escapeChar = '\\')`, serialises them with the minimal Python-csv
style backslash escaping (`\` -> `\\`, `"` -> `\"`), and asserts the reader
round-trips them. 500 iterations, seed 0L, follows the project PBT conventions
(checkAll + runTest, generators reused from PbtArbs).
@jsoizo jsoizo force-pushed the feat/issue-168-reader-explicit-escape branch from b8b3e6c to 3ab4b2a Compare May 16, 2026 16:20
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.77%. Comparing base (5d062c9) to head (3ab4b2a).
⚠️ Report is 1 commits behind head on version_2_0_0.

Additional details and impacted files
@@                Coverage Diff                @@
##           version_2_0_0     #174      +/-   ##
=================================================
+ Coverage          94.49%   94.77%   +0.27%     
=================================================
  Files                 22       22              
  Lines                418      421       +3     
  Branches              93       95       +2     
=================================================
+ Hits                 395      399       +4     
  Misses                14       14              
+ Partials               9        8       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jsoizo jsoizo merged commit 39cf6c6 into version_2_0_0 May 16, 2026
5 checks passed
@jsoizo jsoizo mentioned this pull request May 22, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant