Skip to content

fix(parser): strip CFWS from obs-local-part and obs-domain spans#32

Merged
polaz merged 7 commits intomainfrom
feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d
Apr 4, 2026
Merged

fix(parser): strip CFWS from obs-local-part and obs-domain spans#32
polaz merged 7 commits intomainfrom
feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 3, 2026

Summary

  • In Lax mode, skip_cfws() between atoms advanced parser.pos, making spans include CFWS/comment characters in semantic local-part and domain values
  • Now builds a clean String incrementally when CFWS is detected — zero allocation in the common no-CFWS path (Option<String> stays None)
  • Adds local_part_clean / domain_clean optional fields to Parsed with local_part_str() / domain_str() accessors
  • Updates normalize.rs and validate.rs to use CFWS-stripped content

Technical Details

Before: user (comment) . name@example.com → local_part span = user (comment) . name
After: user (comment) . name@example.comlocal_part_str() returns user.name

Key design decisions:

  • Incremental String builder: when CFWS is first detected, the contiguous clean prefix is copied from the raw span, then subsequent atoms are appended directly. No bounds arrays, no capacity limits
  • Prefix-length capacity: String::with_capacity(last_clean_end - outer_start) — sizes to the known prefix, not the remaining input
  • Deferred allocation: had_cfws flag only set after a dot is successfully consumed; trailing CFWS before @/>/EOF is backtracked without allocation
  • Comment span backtracking: parser.comments truncated alongside parser.pos on backtrack
  • No leading CFWS: bare addr-spec rejects leading RFC 5322 comments before the first word

Test Plan

  • 14 regression tests: CFWS between atoms (comments + whitespace), CFWS-after-dot, multiple segments, obs-domain CFWS, zero-copy verification, trailing CFWS preserves zero-copy with comment count assertion, leading comment rejected, end-to-end normalization (case folding + IDNA)
  • All 80 tests pass (66 existing + 14 new)
  • 5 doc-tests pass
  • Clippy clean (zero warnings, all features)

Closes #13

Summary by CodeRabbit

  • Bug Fixes
    • Comments and whitespace within email addresses are now correctly stripped during validation and normalization
    • Length validation constraints are now calculated on semantic content, not raw input with comments
    • Domain literal detection improved for addresses containing whitespace

Copilot AI review requested due to automatic review settings April 3, 2026 17:50
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9436ce78-4491-4027-bdf5-834866d534ec

📥 Commits

Reviewing files that changed from the base of the PR and between 9106ab6 and 6f9abed.

📒 Files selected for processing (2)
  • src/normalize.rs
  • src/parser.rs

📝 Walkthrough

Walkthrough

Parsed now stores optional CFWS-stripped "clean" strings for local-part and domain; parser populates these during obs-mode parsing. normalize.rs and validate.rs use Parsed::local_part_str() / domain_str() (CFWS-stripped) instead of raw span slices. Tests updated accordingly.

Changes

Cohort / File(s) Summary
Parser core
src/parser.rs
Extend Parsed with local_part_clean: Option<String> and domain_clean: Option<String>; add local_part_str()/domain_str() accessors; parse_local_part/parse_domain and dot-atom helpers return (Span, Option<String>) and allocate clean strings only when CFWS appears between dot segments; adjust parsing/backtracking to avoid duplicate comments; tests updated.
Normalization
src/normalize.rs
Switch normalization and quoted-local handling to use parsed.local_part_str() / parsed.domain_str() instead of raw span slices; update tests/fixtures to reflect local_part_clean: None / domain_clean: None where applicable.
Validation
src/validate.rs
Use parsed.local_part_str() / parsed.domain_str() for RFC 5321 length checks, total address length, and domain-literal detection (CFWS-stripped semantic content) instead of raw span slices; core validation logic otherwise preserved.
Tests
tests/..., src/..._tests
Add regression tests asserting CFWS/comment stripping before normalization/IDNA/case-folding and verifying allocation-elision (*_clean.is_none() when no CFWS); update existing tests to match expanded Parsed shape.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Parser as Parser (src/parser.rs)
    participant Parsed as Parsed (struct)
    participant Normalizer as normalize.rs
    participant Validator as validate.rs

    Client->>Parser: parse(input)
    Parser-->>Parsed: Parsed{ local_part_span, domain_span, local_part_clean?, domain_clean? }
    Client->>Normalizer: normalize(parsed)
    Normalizer->>Parsed: local_part_str(), domain_str()
    Normalizer-->>Client: normalized address
    Client->>Validator: validate(parsed or normalized)
    Validator->>Parsed: local_part_str(), domain_str()
    Validator-->>Client: validation result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I munched the comments, slewed the noise away,
Rejoined the atoms, let the bytes all play,
Clean hops for local, tidy bounds for domain,
I twitched my whiskers — parsing's neat again!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: fixing CFWS stripping from obs-local-part and obs-domain spans, which is the main focus of the PR.
Linked Issues check ✅ Passed The PR successfully addresses all requirements from issue #13: fixes the CFWS inclusion bug, preserves zero-copy behavior with deferred allocation, extends Parsed with clean fields and accessors, and handles backtracking/trailing CFWS correctly.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to fixing the CFWS parsing bug: parser.rs implements CFWS stripping, normalize.rs and validate.rs are updated to use cleaned content, and test additions directly verify the fix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 99.50980% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/parser.rs 99.41% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Lax-mode parsing bug where CFWS/comment text between atoms/labels could be included in the semantic local-part/domain content, which then impacted normalization and validation.

Changes:

  • Extend Parsed to optionally carry CFWS-stripped local_part_clean / domain_clean, plus local_part_str() / domain_str() helpers.
  • Update obs-local-part and obs-domain parsing to collect per-atom/label spans and reconstruct clean content when CFWS is present.
  • Switch normalization and validation to operate on CFWS-stripped content and add regression tests for the affected cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/parser.rs Adds optional clean fields + helpers; reconstructs CFWS-stripped local-part/domain in obs-mode; adds regression tests.
src/normalize.rs Uses local_part_str() / domain_str() so normalization operates on CFWS-stripped semantics.
src/validate.rs Uses local_part_str() / domain_str() for length checks and domain handling.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/validate.rs`:
- Around line 26-29: The variables raw_local/raw_domain and the comment about
"RAW addr-spec" are misleading because parsed.local_part_str() and
parsed.domain_str() return CFWS-stripped semantic content; update the comment to
say length limits apply to the CFWS-stripped addr-spec content and rename
raw_local/raw_domain to local/domain (and update all references of
raw_local/raw_domain at the places noted) so the names reflect that they contain
cleaned semantic local and domain strings (refer to local_part_str(),
domain_str(), and the variable uses on the lines indicated).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf216156-deb6-4d1f-808a-b877d8e3b42d

📥 Commits

Reviewing files that changed from the base of the PR and between 146dbdf and 2d78302.

📒 Files selected for processing (3)
  • src/normalize.rs
  • src/parser.rs
  • src/validate.rs

polaz added 2 commits April 3, 2026 21:04
In Lax mode, skip_cfws() between atoms advanced parser.pos, making
the resulting Span include CFWS/comment characters in semantic values.
For example, `user (comment) . name@example.com` produced local_part
span `user (comment) . name` instead of `user.name`.

- Collect atom/label spans during obs-mode parsing and reconstruct
  clean content by joining with dots
- Add `local_part_clean` / `domain_clean` optional fields to Parsed
  (None when no CFWS present — preserves zero-copy path)
- Add `local_part_str()` / `domain_str()` helpers that return the
  clean version when available, raw span otherwise
- Update normalize.rs and validate.rs to use helpers
- Add 8 regression tests covering comments, whitespace, multiple
  segments, leading CFWS, and zero-copy verification

Closes #13
Trailing CFWS before @/>/EOF was incorrectly marking had_cfws=true
even though backtracking excluded it from the span. This caused
unnecessary allocation and loss of the zero-copy path.

- Only set had_cfws after '.' is successfully consumed
- Rename raw_local/raw_domain to local/domain_str in validate.rs
  to reflect CFWS-stripped semantic content
- Add 2 tests verifying trailing CFWS preserves zero-copy path

Closes #13
@polaz polaz force-pushed the feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d branch from 2d78302 to ee2b888 Compare April 3, 2026 18:07
@polaz polaz requested a review from Copilot April 3, 2026 18:14
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/parser.rs`:
- Around line 353-361: skip_cfws() may append tentative comment spans into
parser.comments but the code only restores parser.pos on backtrack
(parser.restore(save)), leaving duplicate entries; update the backtrack logic in
parse_dot_atom_domain() so you save the current parser.comments.len() (or
equivalent) before calling skip_cfws(), and when you restore after a failed '.'
check (and in the analogous block around lines 521-527) truncate parser.comments
back to that saved length in addition to restoring parser.pos via
parser.restore(save); use the existing parser.save()/parser.restore(save)
pattern and the parser.comments collection to locate and implement the
truncation.
- Around line 339-349: The code currently calls skip_cfws(parser, 0) before
parsing the first local-part word, which incorrectly allows leading CFWS in bare
addr-spec; remove that initial skip_cfws call and only allow CFWS between words.
Concretely, in the local-part parsing logic around eat_atext_run and
try_quoted_string (where atoms is pushed and ErrorKind::EmptyLocalPart is
returned on failure), do not call skip_cfws(parser, 0) before the first
atom_start; instead only invoke skip_cfws when atoms is non-empty (i.e., before
consuming a separator or the next word) so leading CFWS/comments are rejected
for bare addr-spec while still permitting CFWS between word segments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f226571e-74de-4869-9acb-6e4eb41b8b4f

📥 Commits

Reviewing files that changed from the base of the PR and between 2d78302 and ee2b888.

📒 Files selected for processing (3)
  • src/normalize.rs
  • src/parser.rs
  • src/validate.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

…cktrack state

- Remove skip_cfws before first obs-local-part atom: bare addr-spec
  does not permit CFWS before the local-part, only between segments
- Replace Vec<&str> with stack-allocated [(usize, usize)] arrays in
  both parse_dot_atom_local and parse_dot_atom_domain — zero heap
  allocation when no CFWS is present (the common Lax-mode path)
- Save and truncate parser.comments on backtrack when trailing CFWS
  before @/>/EOF has no following dot — prevents duplicate spans
  from tentative skip_cfws calls

Closes #13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/parser.rs`:
- Around line 338-396: The parser currently uses a fixed-size inline buffer
(MAX_ATOMS, bounds, count) and silently ignores atoms once count reaches
MAX_ATOMS, which causes reconstructed cleaned values to drop trailing labels
when had_cfws is true; change the logic in the local-part/domain parsing
function (the code around MAX_ATOMS, bounds, count, had_cfws, eat_atext_run,
try_quoted_string and the reconstruction that builds clean) to either (a) on
first overflow allocate a Vec<(usize,usize)> on the heap, copy existing bounds
into it and continue pushing new atom spans into that Vec, or (b) return an
explicit parse error instead of continuing; ensure downstream calls to
local_part_str()/domain_str() use the full heap-backed spans when used and
remove any reconstruction that uses only bounds[..count] when overflow occurred.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa309430-dec6-493b-81d8-ff7d8527dbf5

📥 Commits

Reviewing files that changed from the base of the PR and between ee2b888 and 0dc3db4.

📒 Files selected for processing (1)
  • src/parser.rs

… builder

Removes the bounded [(usize, usize)] arrays that silently dropped
atoms/labels beyond MAX_ATOMS/MAX_LABELS. Now builds the clean
String incrementally: zero allocation when no CFWS is present
(Option<String> stays None), and when CFWS is first detected the
contiguous prefix is copied from the raw span then subsequent atoms
are appended directly. No capacity limits, no silent truncation.

Closes #13
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/parser.rs`:
- Around line 373-385: The current logic seeds the clean buffer with
parser.input[outer_start..last_clean_end + 1], which includes the '.' separator
and then later unconditionally appends another '.', producing double dots when
CFWS appears after the dot; update the seeding so that when initializing clean
(in the block that checks clean.is_none() && parser.pos > last_clean_end + 1)
you copy only up to last_clean_end (exclude the trailing dot) so that the later
s.push('.') in the atom-parsing branch yields a single separator; locate this
behavior around the variables/identifiers clean, outer_start, last_clean_end,
parser.pos, atom_start and the calls
eat_atext_run/try_quoted_string/ErrorKind::EmptyLocalPart and adjust the slice
bounds accordingly (the same fix should be applied to the analogous block at
lines 533-543).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bb641f3-18a8-4b06-8681-413d3b4d2c5c

📥 Commits

Reviewing files that changed from the base of the PR and between 0dc3db4 and 9106ab6.

📒 Files selected for processing (1)
  • src/parser.rs

When CFWS appeared only after a dot (e.g. "user. name"), the clean
buffer was seeded with the prefix including the dot, then push('.')
appended another — producing "user..name". Seed now copies only up
to last_clean_end (before the dot), so push('.') yields a single
separator. Applies to both obs-local-part and obs-domain.

Closes #13
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 3, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

@polaz Sure, I'll re-review the PR now!


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-email-address PR: 25
File: src/parser.rs:961-980
Timestamp: 2026-04-03T15:26:03.810Z
Learning: In `src/parser.rs` (structured-email-address crate, Rust), leading CFWS/comments before the local-part in bare `addr-spec` form (e.g. `(comment)userexample.com`) are NOT supported in any strictness mode. RFC 5322 §3.4.1 addr-spec grammar only allows CFWS around `@` and after the domain. Leading CFWS is only valid inside the angle-bracket `name-addr` form. Do not suggest tests or coverage for leading CFWS acceptance in bare addr-spec Standard mode.

Learnt from: polaz
Repo: structured-world/structured-email-address PR: 11
File: src/normalize.rs:31-38
Timestamp: 2026-03-25T12:34:33.881Z
Learning: In `src/normalize.rs` (structured-email-address crate, Rust), the `is_quoted` heuristic `raw_local.starts_with('"') && raw_local.ends_with('"')` is intentionally used instead of an explicit flag on `Parsed`. It is sound for the current grammar: Standard quoted-string occupies the entire local-part span (both ends carry `"`), and Lax obs-local-part forms like `"a".b` don't end with `"`. Adding an `is_quoted_local` flag to `Parsed` was deliberately rejected as unnecessary complexity for a speculative future concern.

Learnt from: polaz
Repo: structured-world/structured-email-address PR: 11
File: src/parser.rs:216-246
Timestamp: 2026-03-25T12:34:34.819Z
Learning: In `src/parser.rs` (structured-email-address crate, Rust), the unquoted display name parsing loop in `try_parse_display_name` intentionally accepts characters beyond RFC 5322's `phrase` production (e.g., `"`, `\`, `(`, `)`, etc.). This is a deliberate design decision for real-world tolerance, matching how major email clients handle display names. Do not flag this as a violation of RFC 5322 `phrase` grammar.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/normalize.rs (1)

32-45: 🧹 Nitpick | 🔵 Trivial

Add one end-to-end normalization test for obs-form CFWS.

raw_local and raw_domain now come from the cleaned accessors, but the tests in this file still only cover already-clean addr-specs. A parse_and_normalize() regression with CFWS between local atoms and domain labels would lock in that quote detection, NFC, and provider rules run on the stripped value.

Based on learnings: All normalization paths must have tests for edge cases (Unicode NFC, IDNA, confusables).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/normalize.rs` around lines 32 - 45, Tests currently only cover
already-clean addr-specs and miss obs-form CFWS cases; add an end-to-end test
that exercises parse_and_normalize() with CFWS (comments/folding) between local
atoms and between domain labels to ensure raw_local/raw_domain obtained via
parsed.local_part_str()/parsed.domain_str() are the cleaned values used for
quote detection (is_quoted), unescape_quoted_string, NFC normalization
(nfc_local/nfc_domain), and provider rules. The test should pass an addr-spec
containing CFWS around atoms/labels and assert the normalized output equals the
expected cleaned, NFC-normalized, IDNA/filtered result so regressions where
quote detection or NFC runs on unstripped input are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/parser.rs`:
- Around line 1187-1206: The tests
obs_trailing_cfws_before_at_preserves_zero_copy and
obs_trailing_cfws_after_domain_preserves_zero_copy only assert zero-copy but not
that comments weren't duplicated; update each test to also assert the
Parsed.comments collection contains the expected single entry (e.g.,
assert_eq!(p.comments.len(), 1) or assert the exact span/value) to ensure
skip_cfws() didn't record the same comment twice — locate these assertions near
the existing p.local_part_clean/p.domain_clean checks and reference
Parsed.comments and the tests named
obs_trailing_cfws_before_at_preserves_zero_copy and
obs_trailing_cfws_after_domain_preserves_zero_copy.
- Around line 341-343: The parser still accepts leading WSP because the code
trims the input earlier (see the input.trim() call) so the bare addr-spec branch
(the guard using eat_atext_run(parser) and try_quoted_string(parser)) never sees
leading whitespace; to fix, stop trimming the left side (remove or change the
earlier input.trim() to only trim trailing WSP) or explicitly reject leading
WSP/CFWS before entering the bare addr-spec logic: ensure the pre-parse
normalization does not remove leading whitespace and/or add a check before the
eat_atext_run(parser) / try_quoted_string(parser) branch that fails on any
leading WSP or leading CFWS/comment, and apply the same change at the other
affected region (the block around lines 1157-1169) so bare addr-specs never
accept leading CFWS/comments.

---

Outside diff comments:
In `@src/normalize.rs`:
- Around line 32-45: Tests currently only cover already-clean addr-specs and
miss obs-form CFWS cases; add an end-to-end test that exercises
parse_and_normalize() with CFWS (comments/folding) between local atoms and
between domain labels to ensure raw_local/raw_domain obtained via
parsed.local_part_str()/parsed.domain_str() are the cleaned values used for
quote detection (is_quoted), unescape_quoted_string, NFC normalization
(nfc_local/nfc_domain), and provider rules. The test should pass an addr-spec
containing CFWS around atoms/labels and assert the normalized output equals the
expected cleaned, NFC-normalized, IDNA/filtered result so regressions where
quote detection or NFC runs on unstripped input are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 07e5181f-1c05-4554-95af-9a3e6624c6f4

📥 Commits

Reviewing files that changed from the base of the PR and between 146dbdf and 1c0d641.

📒 Files selected for processing (3)
  • src/normalize.rs
  • src/parser.rs
  • src/validate.rs

- Use span-based String::with_capacity instead of fixed 64 for
  clean buffer seeding in both obs-local-part and obs-domain
- Rename raw_local/raw_domain to local/domain_str in normalize.rs
  to reflect CFWS-stripped semantic content
- Clarify leading-comment test wording: trim() handles WSP,
  parser rejects comments
- Assert comments.len() in backtrack tests to catch duplicates
- Add obs-form end-to-end normalization test (CFWS + lowercase)

Closes #13
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/parser.rs (1)

1186-1207: 🧹 Nitpick | 🔵 Trivial

Consider asserting comments.len() to strengthen backtrack tests.

These tests verify zero-copy behavior via *_clean.is_none(), but they don't assert that the comment deduplication (via comments.truncate() on backtrack) is working. Adding assert_eq!(p.comments.len(), 1) to one local-part and one domain test would catch regressions where tentative skip_cfws() re-records the same comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser.rs` around lines 1186 - 1207, The tests
obs_trailing_cfws_before_at_preserves_zero_copy and
obs_trailing_cfws_after_domain_preserves_zero_copy check zero-copy via
local_part_clean/domain_clean but don’t verify comment deduplication; update
each test to also assert that p.comments.len() == 1 to ensure skip_cfws()
backtracking and comments.truncate() correctly avoid re-recording the same
comment during tentative CFWS parsing (i.e., add an assert_eq!(p.comments.len(),
1) in each test after obtaining p).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/parser.rs`:
- Around line 1186-1207: The tests
obs_trailing_cfws_before_at_preserves_zero_copy and
obs_trailing_cfws_after_domain_preserves_zero_copy check zero-copy via
local_part_clean/domain_clean but don’t verify comment deduplication; update
each test to also assert that p.comments.len() == 1 to ensure skip_cfws()
backtracking and comments.truncate() correctly avoid re-recording the same
comment during tentative CFWS parsing (i.e., add an assert_eq!(p.comments.len(),
1) in each test after obtaining p).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 62f14757-1bfb-4cd4-bd69-a4e6343d4eba

📥 Commits

Reviewing files that changed from the base of the PR and between 9106ab6 and 1c0d641.

📒 Files selected for processing (1)
  • src/parser.rs

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 3, 2026

@coderabbitai Re: obs-form normalization test — added obs_cfws_stripped_before_normalization in normalize.rs that verifies CFWS-stripped content flows through NFC + case folding + IDNA correctly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me take a look at the new test to verify it covers all the required edge cases.


🏁 Script executed:

#!/bin/bash
# Find and display the obs_cfws_stripped_before_normalization test
rg -n "obs_cfws_stripped_before_normalization" --type rust -A 60

Length of output: 0


Let me try fetching the file directly from the PR branch.


🏁 Script executed:

#!/bin/bash
# Get the PR branch name and fetch normalize.rs from it
gh pr view 32 --repo structured-world/structured-email-address --json headRefName -q '.headRefName'

Length of output: 59


🏁 Script executed:

#!/bin/bash
# Fetch the normalize.rs from the PR branch
gh api repos/structured-world/structured-email-address/contents/src/normalize.rs \
  --jq '.content' \
  -f ref='feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d' \
  | base64 -d

Length of output: 46


🏁 Script executed:

#!/bin/bash
# Get the latest commit SHA on the PR head
SHA=$(gh api repos/structured-world/structured-email-address/pulls/32 --jq '.head.sha')
echo "SHA: $SHA"

# Fetch normalize.rs using the commit SHA
gh api "repos/structured-world/structured-email-address/contents/src/normalize.rs?ref=$SHA" \
  --jq '.content' | base64 -d

Length of output: 13430


@polaz Thanks for adding the test — CFWS stripping and case folding are clearly covered. However, looking at the actual test data, there are two gaps between what the test comment claims and what the data exercises:

  1. NFC not exercised — the input is pure ASCII (User (comment) . Name@...), so the .nfc().collect() call in normalize() is a no-op. To actually exercise the NFC path through the local_part_str() branch, the local-part would need a decomposed Unicode sequence that NFC would modify (e.g., a letter + combining diacritic that normalizes to a single precomposed character).

  2. IDNA not exercised — the domain is plain Example (c) . COM, which normalizes to example.com. The idna::domain_to_ascii_strict call is the fast path for ASCII-only domains. To exercise the IDNA path you'd need something like Üser (comment) . name@München (c) . DE, which would trigger punycode encoding.

The learning stored for this repo explicitly requires:

All normalization paths must have tests for edge cases (Unicode NFC, IDNA, confusables)

Suggested extension to make the test fully match its description:

#[test]
fn obs_cfws_stripped_before_normalization() {
    let config = Config::builder()
        .strictness(crate::Strictness::Lax)
        .lowercase_all()
        .build();

    // ASCII path: CFWS stripped + case folded.
    let n = parse_and_normalize("User (comment) . Name@Example (c) . COM", &config);
    assert_eq!(n.local_part, "user.name", "CFWS stripped + lowercased");
    assert_eq!(n.domain, "example.com", "domain CFWS stripped + lowercased");

    // IDNA path: obs-domain with CFWS around an IDN label.
    let n = parse_and_normalize("user@M\u{00FC}nchen (c) . DE", &config);
    assert_eq!(n.domain, "xn--mnchen-3ya.de", "IDN domain CFWS stripped + punycode encoded");
    assert_eq!(n.domain_unicode.as_deref(), Some("münchen.de"));

    // NFC path: obs-local with a decomposed Unicode character (e + combining acute → é).
    // 'e' + U+0301 (combining acute) should NFC-normalize to U+00E9 'é'.
    let n = parse_and_normalize("use\u{0301}r (c) . name@example.com", &config);
    assert_eq!(n.local_part, "usér.name", "obs-local NFC-normalized after CFWS strip");
}

The ASCII sub-case can stay as-is — just add the IDN and NFC sub-cases to make the test actually validate what its comment claims.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-email-address PR: 25
File: src/parser.rs:961-980
Timestamp: 2026-04-03T15:26:03.810Z
Learning: In `src/parser.rs` (structured-email-address crate, Rust), leading CFWS/comments before the local-part in bare `addr-spec` form (e.g. `(comment)userexample.com`) are NOT supported in any strictness mode. RFC 5322 §3.4.1 addr-spec grammar only allows CFWS around `@` and after the domain. Leading CFWS is only valid inside the angle-bracket `name-addr` form. Do not suggest tests or coverage for leading CFWS acceptance in bare addr-spec Standard mode.

Learnt from: polaz
Repo: structured-world/structured-email-address PR: 11
File: src/parser.rs:216-246
Timestamp: 2026-03-25T12:34:34.819Z
Learning: In `src/parser.rs` (structured-email-address crate, Rust), the unquoted display name parsing loop in `try_parse_display_name` intentionally accepts characters beyond RFC 5322's `phrase` production (e.g., `"`, `\`, `(`, `)`, etc.). This is a deliberate design decision for real-world tolerance, matching how major email clients handle display names. Do not flag this as a violation of RFC 5322 `phrase` grammar.

Learnt from: polaz
Repo: structured-world/structured-email-address PR: 11
File: src/normalize.rs:31-38
Timestamp: 2026-03-25T12:34:33.881Z
Learning: In `src/normalize.rs` (structured-email-address crate, Rust), the `is_quoted` heuristic `raw_local.starts_with('"') && raw_local.ends_with('"')` is intentionally used instead of an explicit flag on `Parsed`. It is sound for the current grammar: Standard quoted-string occupies the entire local-part span (both ends carry `"`), and Lax obs-local-part forms like `"a".b` don't end with `"`. Adding an `is_quoted_local` flag to `Parsed` was deliberately rejected as unnecessary complexity for a speculative future concern.

Learnt from: CR
Repo: structured-world/structured-email-address PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-25T15:33:47.930Z
Learning: Applies to **/*test*.rs : All normalization paths must have tests for edge cases (Unicode NFC, IDNA, confusables)

Learnt from: polaz
Repo: structured-world/structured-email-address PR: 11
File: src/normalize.rs:42-46
Timestamp: 2026-03-25T09:43:07.858Z
Learning: In `src/normalize.rs` (structured-email-address crate, Rust), `CasePolicy::Preserve` preserves original case only for the **local part**. The domain is **always** lowercased unconditionally, which is correct per RFC 5321 §2.4 (domain is case-insensitive). This is intentional behavior, not a bug. The `CasePolicy` variants (`All`, `Domain`, `Preserve`) only differ in how they handle the local-part case.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/normalize.rs`:
- Around line 360-371: The test obs_cfws_stripped_before_normalization currently
claims to verify NFC, case folding, and IDNA encoding but uses ASCII-only input
so NFC and IDNA paths aren't exercised; update the test either by changing the
comment to accurately reflect that it only checks CFWS stripping and
lowercasing, or extend it to include additional sub-cases that exercise NFC and
IDNA (e.g., call parse_and_normalize with a local part containing a decomposed é
like "cafe\u0301" to force .nfc() behavior and a domain with non-ASCII chars
like "münchen.de" to exercise domain_to_ascii_strict/punycode conversion),
keeping references to the existing test function
obs_cfws_stripped_before_normalization and helper parse_and_normalize so
reviewers can find and modify the right test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74c39307-f973-4751-9209-d3351a39e88a

📥 Commits

Reviewing files that changed from the base of the PR and between 9106ab6 and d24ad5a.

📒 Files selected for processing (2)
  • src/normalize.rs
  • src/parser.rs

- Size String::with_capacity to known clean prefix length instead of
  remaining input length — avoids over-allocation for long inputs
- Fix test wording: obs_cfws_stripped_before_normalization only tests
  case folding (ASCII input), not NFC/IDNA
- Add obs_cfws_stripped_with_idna test exercising CFWS stripping
  through IDNA punycode encoding (münchen → xn--mnchen-3ya)

Closes #13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@polaz polaz merged commit c2cb447 into main Apr 4, 2026
15 checks passed
@polaz polaz deleted the feat/#13-fixparser-strip-cfws-from-obs-local-part-and-obs-d branch April 4, 2026 06:14
@sw-release-bot sw-release-bot bot mentioned this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(parser): strip CFWS from obs-local-part and obs-domain spans

2 participants