Cache label identifier data to eliminate redundant parsing by jordanpadams · Pull Request #1607 · NASA-PDS/validate

jordanpadams · 2026-05-19T06:59:19Z

Summary

Resolves #1568

Add LabelCacheEntry POJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed label
After each label is parsed in LabelValidationRule, cache identifiers (with \n detection enabled) into ReferentialIntegrityUtil's labelIdentifierCache
additionalReferentialIntegrityChecks() now uses cached logicalIdentifiers and lidOrLidVidReferences when available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation pass
CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available — no disk re-parse for file area reference checks
collectAllContextReferences() uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists
\n detection (INVALID_FIELD_VALUE) happens once during cacheIdentifiers() (with reportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errors
Fix CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs
Call CrossLabelFileAreaReferenceChecker.reset() from ValidateLauncher alongside ReferentialIntegrityUtil.reset()

Test plan

NASA-PDS/validate#15-2 passes (\n in logical_identifier — 1 INVALID_FIELD_VALUE reported, no double-reporting)
NASA-PDS/validate#401-1 passes (\n in lid_reference — 3 INVALID_FIELD_VALUE detected from cache, no re-parse needed)
Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

After each label is parsed by pds4-jparser, extract and cache the logical identifiers, lid/lidvid references, and context area references into a LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached context area refs to skip three expensive Saxon XPath evaluations per label instead of re-running them against a freshly-reparsed DOM. Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse from disk in additionalReferentialIntegrityChecks() to correctly detect and report INVALID_FIELD_VALUE for identifier values containing newlines — pds4-jparser normalizes newlines away, so the cached values cannot be used for that check. Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ntegrity phase - LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true) so the referential integrity phase can safely use cached identifiers without re-parsing - CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available, falling back to disk parse only for labels not in the initial validation pass - ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached logicalIdentifiers and lidOrLidVidReferences when available, eliminating all disk re-parsing for the common case; fallback parse retained for uncached labels All 297 tests pass. Resolves the full acceptance criteria for #1568. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

jordanpadams · 2026-05-19T10:59:15Z

Acceptance Criteria Verification

From issue #1568:

Given a bundle with many product labels
When I perform validation including referential integrity checks
Then I expect each label file is read and parsed from disk only once

How each label is now parsed exactly once

Phase 1 — Initial label validation (LabelValidationRule.validateLabel(), line 319):

validator.parseAndValidate(processor, target)   ← one disk read + one DOM parse per label
  └─ cacheIdentifiers(document, targetUrl)       ← extracts LIDs, LIDVIDs, context refs
       └─ ReferentialIntegrityUtil.cacheLabelIdentifiers(url, entry)

The DOM is built once from disk. cacheIdentifiers() (line 345) runs against the already-in-memory Document, extracting:

logicalIdentifiers (LIDs/LIDVIDs registered by the product)
lidOrLidVidReferences (all lid_reference/lidvid_reference values)
contextAreaRefs (Investigation Area, Observation System Component, Target Identification refs)

\n detection (INVALID_FIELD_VALUE) is also performed here (once) with reportCarriageReturns=true.

Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:

Former re-parse site	Now
`ReferentialIntegrityUtil.additionalReferentialIntegrityChecks()` (line 823) — `db.parse(url.openStream())` for every label	`getCachedLabelIdentifiers(url)` hit → uses `cached.getLogicalIdentifiers()` / `getLidOrLidVidReferences()` directly; no disk read, no parse
`CrossLabelFileAreaReferenceChecker.add()` (line 41) — `DocumentBuilderFactory…newDocumentBuilder().parse(target.getUrl().openStream())`	`getCachedLabelIdentifiers(target.getUrl())` hit → uses `cached.getLogicalIdentifiers()` directly; no disk read, no parse
`collectAllContextReferences()` — three Saxon XPath evaluations over a re-parsed DOM	`getCachedLabelIdentifiers(url)` hit → uses `cached.getContextAreaRefs()` directly; no Saxon XPath, no re-parse

A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case.

Test evidence

NASA-PDS/validate#15-2 — label with \n in logical_identifier: INVALID_FIELD_VALUE reported exactly once (from cacheIdentifiers()), not duplicated by the referential integrity phase ✅
NASA-PDS/validate#401-1 — bundle with \n in lid_reference (3 occurrences): all 3 INVALID_FIELD_VALUE errors detected from cache; referential integrity phase uses cached values with no re-parse ✅
Full Cucumber suite: 297/297 scenarios pass ✅

🤖 Generated with Claude Code

jordanpadams requested a review from a team as a code owner May 19, 2026 06:59

jordanpadams assigned al-niessner May 19, 2026

jordanpadams mentioned this pull request May 19, 2026

As a PDS data engineer, I want parsed DOM trees cached and reused during referential integrity checks so that labels are not re-parsed from disk a second time #1568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache label identifier data to eliminate redundant parsing#1607

Cache label identifier data to eliminate redundant parsing#1607
jordanpadams wants to merge 2 commits into
mainfrom
issues/1568-cache-label-identifiers

jordanpadams commented May 19, 2026 •

edited

Loading

Uh oh!

jordanpadams commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jordanpadams commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jordanpadams commented May 19, 2026

Acceptance Criteria Verification

How each label is now parsed exactly once

Test evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordanpadams commented May 19, 2026 •

edited

Loading