Skip to content

As a PDS data engineer, I want parsed DOM trees cached and reused during referential integrity checks so that labels are not re-parsed from disk a second timeΒ #1568

@jordanpadams

Description

@jordanpadams

Checked for duplicates

Yes - I've already checked

πŸ§‘β€πŸ”¬ User Persona(s)

Data Engineer, Node Operator

πŸ’ͺ Motivation

...so that I can validate large bundles faster by eliminating redundant I/O and XML parsing that re-reads every label file from disk during the referential integrity phase.

πŸ“– Additional Details

Impact: Medium

Relevant files:

  • src/main/java/gov/nasa/pds/tools/util/ReferentialIntegrityUtil.java (lines 711-784)
  • src/main/java/gov/nasa/pds/tools/validate/CrossLabelFileAreaReferenceChecker.java (lines 37-57)

Problem:
ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() creates a new DocumentBuilder and re-parses every label file from disk for every label in the bundle, even though these labels were already fully parsed during the label validation pass. The DOM tree is discarded after initial validation and rebuilt from scratch. Similarly, CrossLabelFileAreaReferenceChecker.add() re-parses label XML from scratch.

Recommendation:
Cache the parsed DOM or extracted identifiers (LIDs, LIDVIDs, file references) from the initial label validation pass and reuse them during referential integrity checking.

For Internal Dev Team To Complete

Acceptance Criteria

Given a bundle with many product labels
When I perform validation including referential integrity checks
Then I expect each label file is read and parsed from disk only once

βš™οΈ Engineering Details

πŸŽ‰ I&T

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

Status

ToDo

Status

Review/QA

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions