Skip to content

Refactor archive ingestion parsing and surface HPC provenance#142

Merged
tomvothecoder merged 10 commits intoE3SM-Project:mainfrom
tomvothecoder:codex/remove-case-status-ingestion
Mar 18, 2026
Merged

Refactor archive ingestion parsing and surface HPC provenance#142
tomvothecoder merged 10 commits intoE3SM-Project:mainfrom
tomvothecoder:codex/remove-case-status-ingestion

Conversation

@tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Mar 14, 2026

Summary

This PR refactors archive ingestion parsing to use the correct source files for execution metadata, restores CaseStatus-based run status parsing, centralizes canonical config snapshot comparison, and surfaces HPC provenance more clearly in API responses and the frontend.

What changed

Backend ingestion and parsing

  • Refactored the parser to return typed ParsedSimulation records instead of loose metadata dicts.
  • Made env_run.xml a required CaseDocs input and use it as the source of:
    • initialization_type
    • simulation_start_date
    • simulation_end_date
  • Expanded env_case.xml parsing to extract:
    • case_name
    • case_group
    • machine
    • REALUSER as HPC username
    • compset_alias
    • derived campaign / experiment_type
  • Updated env_build.xml parsing to also extract grid_resolution.
  • Simplified e3sm_timing parsing so it now focuses on:
    • timing-file LID as execution_id
    • derived run_start_date
    • derived run_end_date
  • Restored CaseStatus parsing as an optional artifact and use it to:
    • determine completed / failed / running / unknown
    • override timing-derived run timestamps based on the latest case.run attempt
  • Tightened execution ID handling:
    • require a timing-file LID
    • skip incomplete runs when it is missing
    • fall back to the execution directory basename when the timing LID does not match the directory name, so distinct runs are still preserved

Backend ingest flow and canonical deltas

  • Refactored ingest mapping to use typed drafts before validating into SimulationCreate.
  • Introduced SimulationConfigSnapshot to centralize canonical delta comparison.
  • Normalized git URLs before delta comparison so SSH and HTTPS variants do not create false config changes.
  • Removed the old model-level config delta field constant in favor of the new snapshot abstraction.
  • Updated schema descriptions so execution_id is documented as coming from the timing-file LID.
  • Added hpc_username to SimulationOut.

Frontend

  • Added typed createdByUser / lastUpdatedByUser support to frontend simulation types.
  • Added HPC username to frontend simulation types and display.
  • Added browse filtering by HPC username.
  • Updated creator filter labels to use createdByUser.email when available.
  • Displayed HPC username in the simulation details view.

Tests

  • Expanded parser coverage for env_case, env_build, env_run, CaseStatus, e3sm_timing, parser merge behavior, incomplete-run skipping, and timing LID mismatch handling.
  • Expanded ingest coverage for typed drafts, canonical config snapshots, git URL normalization, duplicate/idempotent ingestion, and persisted canonical comparisons.
  • Added broad test coverage for the NERSC archive ingestor around discovery, retries, config parsing, state handling, and logging.
  • Removed the obsolete model test tied to the deleted config delta field constant.

User-facing impact

  • Archive ingests should produce more accurate run dates and statuses.
  • HPC provenance is now returned in simulation responses and visible/filterable in the UI.
  • Canonical config delta calculation is more stable, especially for equivalent git URLs.

Testing

  • uv run pytest tests/features/ingestion/parsers/test_case_docs.py tests/features/ingestion/parsers/test_case_status.py tests/features/ingestion/parsers/test_e3sm_timing.py tests/features/ingestion/parsers/test_parser.py tests/features/ingestion/test_ingest.py tests/features/ingestion/test_nersc_archive_ingestor.py
  • Result: 173 passed

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors ingestion metadata parsing to derive simulation timeline fields from
CaseDocs/env_run.xml.* plus timing-derived run metadata, removes the
CaseStatus parser, and significantly expands tests for the NERSC archive
ingestor and parser utilities.

Changes:

  • Replace CaseStatus-based ingestion metadata with env_run.xml.* +
    e3sm_timing.* parsing, and add a run-artifact status helper.
  • Tighten ingestion required files (env_run.xml.*, e3sm_timing.*, etc.) and
    make parser metadata assembly more tolerant to malformed parser output.
  • Add/expand tests covering NERSC ingestor behavior, parser file requirements,
    and parser utility helpers.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backend/app/features/ingestion/parsers/parser.py Updates FILE_SPECS to require env_run + timing, removes CaseStatus wiring, and merges run-artifact status into parsed metadata.
backend/app/features/ingestion/parsers/case_docs.py Expands env_case/env_build parsing and adds env_run parsing + artifact-derived status helper.
backend/app/features/ingestion/parsers/e3sm_timing.py Narrows timing parsing to execution/run timing metadata and computes run start/end timestamps.
backend/app/features/ingestion/parsers/case_status.py Removes CaseStatus parser implementation.
backend/tests/features/ingestion/test_nersc_archive_ingestor.py Adds broad unit coverage for config parsing, state handling, request failures, logging, and __main__.
backend/tests/features/ingestion/parsers/test_parser.py Updates parser integration tests to reflect required env_run/timing and removed CaseStatus.
backend/tests/features/ingestion/parsers/test_case_docs.py Adds tests for new env_run parsing, campaign/experiment derivation, and run-artifact status helper.
backend/tests/features/ingestion/parsers/test_e3sm_timing.py Reworks timing parser tests around execution_id and run start/end derivation behavior.
backend/tests/features/ingestion/parsers/test_case_status.py Removes CaseStatus parser tests.
backend/tests/features/ingestion/parsers/test_utils.py Adds tests for _open_text / _get_open_func gzip/plain handling.
Comments suppressed due to low confidence (1)

backend/app/features/ingestion/parsers/parser.py:413

  • _parse_all_files signature now takes exec_dir, but the docstring parameter list still only documents files. Update the docstring to include/describe exec_dir.
def _parse_all_files(exec_dir: str, files: dict[str, str | None]) -> SimulationMetadata:
    """Pass discovered files to their respective parser functions.

    Parameters
    ----------
    files : dict[str, str | None]
        Dictionary of file paths for each file type.

    Returns
    -------
    SimulationMetadata
        Dictionary with parsed results from each file type.
    """

You can also share your feedback on Copilot code review. Take the survey.

@tomvothecoder tomvothecoder marked this pull request as ready for review March 18, 2026 21:49
@tomvothecoder tomvothecoder requested a review from Copilot March 18, 2026 21:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors archive-ingestion metadata parsing to remove CaseStatus usage, derive
timeline metadata from env_run.xml.* + e3sm_timing.*, and expands test
coverage (notably for the NERSC archive ingestor).

Changes:

  • Replace CaseStatus parsing with env_run.xml.*-derived simulation dates and
    e3sm_timing.*-derived run metadata (start/end).
  • Introduce typed parser output (ParsedSimulation) and normalize canonical
    config delta comparison via SimulationConfigSnapshot.
  • Expand ingestion-related test coverage and remove case_status tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
frontend/src/types/simulation.ts Adds hpcUsername + created/updated user preview typing.
frontend/src/features/simulations/components/SimulationDetailsView.tsx Displays hpcUsername on the simulation details page.
frontend/src/features/browse/components/BrowseFiltersSidePanel.tsx Adds creator label options + HPC username filter UI.
frontend/src/features/browse/BrowsePage.tsx Adds hpcUsername filter state and creator display labels.
backend/app/features/ingestion/parsers/case_docs.py Adds env_run.xml parsing + run-artifact status helper.
backend/app/features/ingestion/parsers/e3sm_timing.py Narrows timing parsing to execution/run timing fields only.
backend/app/features/ingestion/parsers/parser.py Removes case_status wiring; requires env_run + returns ParsedSimulation.
backend/app/features/ingestion/parsers/types.py New typed parser output dataclass.
backend/app/features/ingestion/ingest.py Consumes typed parser output; refactors canonical-delta calculation.
backend/app/features/simulation/config_delta.py New normalized config snapshot + diff helper.
backend/app/features/simulation/models.py Uses SimulationConfigSnapshot as the source of delta field names.
backend/app/features/simulation/schemas.py Documentation updates; exposes hpc_username on SimulationOut.
backend/tests/features/ingestion/parsers/test_case_docs.py Adds env_run + run-artifact helper coverage.
backend/tests/features/ingestion/parsers/test_e3sm_timing.py Updates tests to new timing-derived run metadata behavior.
backend/tests/features/ingestion/parsers/test_parser.py Updates integration tests for new required files + typed output.
backend/tests/features/ingestion/parsers/test_utils.py Adds tests for text/gzip open helpers.
backend/tests/features/ingestion/parsers/test_case_status.py Removes CaseStatus parser tests.
backend/app/features/ingestion/parsers/case_status.py Removes CaseStatus parser implementation.
backend/tests/features/ingestion/test_ingest.py Updates ingestion tests for typed parser output + snapshot/delta semantics.
backend/tests/features/ingestion/test_nersc_archive_ingestor.py Expands unit coverage across config/state/retry/logging/main guard.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb245caec2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tomvothecoder tomvothecoder changed the title Refactor ingestion metadata parsing and ingestor coverage Refactor archive ingestion parsing and surface HPC provenance Mar 18, 2026
@tomvothecoder tomvothecoder merged commit 5ae814e into E3SM-Project:main Mar 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants