Cache label identifier data to eliminate redundant parsing#1607
Cache label identifier data to eliminate redundant parsing#1607jordanpadams wants to merge 2 commits into
Conversation
After each label is parsed by pds4-jparser, extract and cache the logical identifiers, lid/lidvid references, and context area references into a LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached context area refs to skip three expensive Saxon XPath evaluations per label instead of re-running them against a freshly-reparsed DOM. Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse from disk in additionalReferentialIntegrityChecks() to correctly detect and report INVALID_FIELD_VALUE for identifier values containing newlines — pds4-jparser normalizes newlines away, so the cached values cannot be used for that check. Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ntegrity phase - LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true) so the referential integrity phase can safely use cached identifiers without re-parsing - CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available, falling back to disk parse only for labels not in the initial validation pass - ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached logicalIdentifiers and lidOrLidVidReferences when available, eliminating all disk re-parsing for the common case; fallback parse retained for uncached labels All 297 tests pass. Resolves the full acceptance criteria for #1568. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Acceptance Criteria VerificationFrom issue #1568:
How each label is now parsed exactly oncePhase 1 — Initial label validation ( The DOM is built once from disk.
Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:
A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case. Test evidence
🤖 Generated with Claude Code |
Summary
Resolves #1568
LabelCacheEntryPOJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed labelLabelValidationRule, cache identifiers (with\ndetection enabled) intoReferentialIntegrityUtil'slabelIdentifierCacheadditionalReferentialIntegrityChecks()now uses cachedlogicalIdentifiersandlidOrLidVidReferenceswhen available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation passCrossLabelFileAreaReferenceChecker.add()uses cachedlogicalIdentifierswhen available — no disk re-parse for file area reference checkscollectAllContextReferences()uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists\ndetection (INVALID_FIELD_VALUE) happens once duringcacheIdentifiers()(withreportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errorsCrossLabelFileAreaReferenceChecker.reset()to clear theisObservationalmap alongsideknownRefs, preventing static state from leaking across validation runsCrossLabelFileAreaReferenceChecker.reset()fromValidateLauncheralongsideReferentialIntegrityUtil.reset()Test plan
NASA-PDS/validate#15-2passes (\ninlogical_identifier— 1INVALID_FIELD_VALUEreported, no double-reporting)NASA-PDS/validate#401-1passes (\ninlid_reference— 3INVALID_FIELD_VALUEdetected from cache, no re-parse needed)🤖 Generated with Claude Code