Skip to content

Support OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) #772

@jordanpadams

Description

@jordanpadams

Summary

As a data aggregator or metadata harvester, I want the PDS Registry API to expose an OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting, v2.0) endpoint, so that standard harvesting tools (OAIster, BASE, OpenDOAR, OCLC, etc.) can automatically discover and ingest PDS planetary science metadata without requiring custom API integration.

Motivation

OAI-PMH is the de facto standard protocol for metadata harvesting in the scholarly and scientific data repository community. Many discovery services and catalog aggregators require it to index repositories. Without OAI-PMH, PDS metadata is invisible to these harvesters, limiting data discoverability. The protocol is lightweight, well-understood, and a natural fit for the PDS registry's existing data model (LIDVIDs → identifiers, discipline nodes → sets, PDS4 labels → metadata records).

OAI-PMH Protocol Overview (v2.0)

The protocol defines six verbs, all served from a single base URL (/oai):

Verb Description
Identify Repository identity, granularity, deletedRecord policy
ListMetadataFormats Available metadata schemas (e.g., oai_dc, oai_pds4)
ListSets Hierarchical groupings for selective harvesting
ListIdentifiers Paged list of record headers (identifier + datestamp)
ListRecords Paged list of full metadata records
GetRecord Single record by identifier

Selective harvesting is via set (collection grouping) and from/until (datestamp range) parameters. Large result sets use a resumptionToken for cursor-based pagination.

API URL Namespace Design Rationale

Interaction models vs. query syntaxes

OAI-PMH, SQL, and the existing registry query language (?q=) are not interchangeable filter syntaxes over the same resources — they are distinct interaction models with fundamentally different contracts:

Aspect Registry API (/products) OAI-PMH Hypothetical SQL
Request style Resource-oriented GET Verb-dispatch GET/POST Query-body POST
Pagination ?start / ?limit resumptionToken cursor Cursor / offset
Response shape JSON or PDS4 XML OAI-PMH XML envelope Tabular
Error model HTTP 4xx + JSON body OAI-PMH XML <error> element Varies
Protocol version API semver OAI-PMH 2.0 SQL dialect

Because the interaction models differ this deeply, they cannot cleanly share a URL namespace. A query parameter like ?syntax=oai would be a flag that silently changes the entire behavior of an endpoint — worse for clients than an explicit URL distinction.

Recommended URL structure: separate path prefixes, same deployment

The cleanest approach is separate versioned path prefixes per interaction model, all served by the same application:

https://pds.nasa.gov/api/registry/1/   ← current resource API (/products, /classes, ...)
https://pds.nasa.gov/api/oai/2.0/      ← OAI-PMH (version tracks OAI-PMH protocol version)
https://pds.nasa.gov/api/sql/1/        ← hypothetical SQL interface (future)

This pattern is established precedent: OpenSearch itself exposes /_search, /_cat, and /_cluster as separate API surfaces with different conventions, all served by one process. Each prefix can carry its own version number, its own OpenAPI/spec document, and its own error semantics — without requiring separate deployments or repositories.

Why not /_oai? The leading underscore convention (from CouchDB/OpenSearch) signals internal or administrative endpoints, not alternate external protocols. A clean prefix like /oai/ is more accurate and client-friendly.

Why not separate services/repos? Operational overhead is not justified when the only difference is URL namespace and response serialization. The underlying data retrieval (OpenSearch, SearchRequestFactory) is shared. Separate deployments make sense only if the protocols have genuinely divergent scaling, SLA, or team ownership requirements.

This requirement scopes the OAI-PMH work to the /oai/2.0/ prefix and should be treated as the first instance of this multi-protocol pattern, establishing the convention for any future interfaces (SQL, SPARQL, etc.).


Proposed Implementation Options

Option A — Dedicated /oai Controller (Recommended)

A new OaiPmhController handles GET and POST to /oai, dispatching on the ?verb= query parameter. This follows the standard OAI-PMH base URL pattern expected by all harvesters.

Key design points:

  • OaiPmhController in gov.nasa.pds.api.registry.controllers — routes on ?verb= to handler methods; returns badVerb error for unknown verbs
  • New OaiPmhXmlSerializer in the view layer — produces the OAI-PMH XML response envelope (namespace http://www.openarchives.org/OAI/2.0/)
  • Reuses SearchRequestFactory and existing OpenSearch search infrastructure for data retrieval
  • PDS data model mappings:
    • LIDVIDs → OAI-PMH record identifiers (URN form: urn:nasa:pds:<lidvid>)
    • Discipline nodes → OAI-PMH Sets (setSpec = node prefix, e.g., geo, atm)
    • Product classes → sub-sets (e.g., geo:Bundle, geo:Collection)
    • ops:Harvest_Info/ops:harvest_date_time → OAI-PMH datestamp
  • Metadata formats to support:
    • oai_dc — Dublin Core (required by OAI-PMH spec); map PDS4 fields → 15 DC elements
    • oai_pds4 — Native PDS4 XML label as the metadata record (custom schema)
  • ResumptionToken: encode OpenSearch search_after cursor as base64 JSON; include completeListSize and cursor attributes per spec
  • Error responses: badVerb, badArgument, cannotDisseminateFormat, idDoesNotExist, noRecordsMatch, noSetHierarchy — all as OAI-PMH XML error envelopes (not HTTP 4xx)

Pros: Standard /oai URL pattern recognized by all harvesters; isolated controller keeps OAI-PMH logic separate; minimal new code; fully reuses existing search stack; matches codebase single-responsibility principle.

Cons: Verb dispatch is query-parameter-driven rather than path-based, which is slightly non-idiomatic for Spring MVC (but correct per OAI-PMH spec and easily implemented with @RequestParam).

Suggested Dublin Core field mapping:

DC Element PDS4 source field(s)
dc:identifier LIDVID
dc:title title
dc:description abstract_desc
dc:date ops:Harvest_Info/ops:harvest_date_time
dc:type product class (e.g., Product_Bundle)
dc:publisher discipline node label / "NASA Planetary Data System"
dc:creator Investigation_Area/name or Observing_System_Component/name
dc:subject Primary_Result_Summary/Science_Facets/discipline_name
dc:format File/file_name extension or Encoding_Type
dc:source parent LIDVID (from Internal_Reference)

Option B — Separate OAI-PMH Microservice

A standalone Spring Boot application (new repository) that calls existing registry-api REST endpoints and reformats responses into OAI-PMH XML.

Pros: Complete decoupling; independently versioned and deployed.

Cons: Extra network hop per request; must be kept in sync with registry-api changes; adds a second repo/deployment to maintain; duplicates search logic.


Option C — Content-Type Negotiation on Existing Endpoints

Add application/oai+xml as an Accept header option on /products endpoints, adding an OAI-PMH serializer alongside existing ones.

Pros: No new URL namespace; reuses existing API versioning.

Cons: OAI-PMH is verb-driven, not resource-driven — Identify, ListSets, and ListMetadataFormats have no natural mapping to /products; resumptionToken pagination semantics conflict with the existing ?start/?limit model; complicates existing serializers significantly.


Recommendation

Option A, served at /oai/2.0/ per the URL namespace rationale above. OAI-PMH has well-defined protocol semantics that are fundamentally verb-driven (not REST resource-driven). A dedicated OaiPmhController keeps those semantics cleanly isolated, fully reuses the existing OpenSearch/SearchRequestFactory infrastructure, and presents the /oai base URL that all standard harvesting tools expect. It is the minimal, correct, and maintainable approach — and establishes the path-prefix-per-interaction-model pattern for any future interfaces.

Acceptance Criteria

  • GET /oai/2.0/?verb=Identify returns repository identity XML per OAI-PMH 2.0 spec
  • GET /oai/2.0/?verb=ListMetadataFormats returns at minimum oai_dc; optionally oai_pds4
  • GET /oai/2.0/?verb=ListSets returns discipline nodes (and optionally product class sub-sets)
  • GET /oai/2.0/?verb=ListIdentifiers supports metadataPrefix, set, from, until, resumptionToken
  • GET /oai/2.0/?verb=ListRecords supports same parameters; returns full records
  • GET /oai/2.0/?verb=GetRecord&identifier=<id>&metadataPrefix=oai_dc returns a single record
  • All six OAI-PMH error conditions returned as OAI-PMH XML (not HTTP error codes)
  • resumptionToken enables full cursor-based pagination through large result sets
  • oai_dc metadata format passes OAI-PMH validator (e.g., OVAL)
  • POST /oai/2.0/ behaves identically to GET /oai/2.0/ per spec requirement
  • Integration test added to Postman collection in the registry repository per Integration Testing Guide
  • Repository validates against an OAI-PMH compliance test suite

Additional Context


For Internal Dev Team To Complete

⚙️ Engineering Details

🎉 Integration & Test

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status

ToDo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions