Summary
As a data aggregator or metadata harvester, I want the PDS Registry API to expose an OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting, v2.0) endpoint, so that standard harvesting tools (OAIster, BASE, OpenDOAR, OCLC, etc.) can automatically discover and ingest PDS planetary science metadata without requiring custom API integration.
Motivation
OAI-PMH is the de facto standard protocol for metadata harvesting in the scholarly and scientific data repository community. Many discovery services and catalog aggregators require it to index repositories. Without OAI-PMH, PDS metadata is invisible to these harvesters, limiting data discoverability. The protocol is lightweight, well-understood, and a natural fit for the PDS registry's existing data model (LIDVIDs → identifiers, discipline nodes → sets, PDS4 labels → metadata records).
OAI-PMH Protocol Overview (v2.0)
The protocol defines six verbs, all served from a single base URL (/oai):
| Verb |
Description |
Identify |
Repository identity, granularity, deletedRecord policy |
ListMetadataFormats |
Available metadata schemas (e.g., oai_dc, oai_pds4) |
ListSets |
Hierarchical groupings for selective harvesting |
ListIdentifiers |
Paged list of record headers (identifier + datestamp) |
ListRecords |
Paged list of full metadata records |
GetRecord |
Single record by identifier |
Selective harvesting is via set (collection grouping) and from/until (datestamp range) parameters. Large result sets use a resumptionToken for cursor-based pagination.
API URL Namespace Design Rationale
Interaction models vs. query syntaxes
OAI-PMH, SQL, and the existing registry query language (?q=) are not interchangeable filter syntaxes over the same resources — they are distinct interaction models with fundamentally different contracts:
| Aspect |
Registry API (/products) |
OAI-PMH |
Hypothetical SQL |
| Request style |
Resource-oriented GET |
Verb-dispatch GET/POST |
Query-body POST |
| Pagination |
?start / ?limit |
resumptionToken cursor |
Cursor / offset |
| Response shape |
JSON or PDS4 XML |
OAI-PMH XML envelope |
Tabular |
| Error model |
HTTP 4xx + JSON body |
OAI-PMH XML <error> element |
Varies |
| Protocol version |
API semver |
OAI-PMH 2.0 |
SQL dialect |
Because the interaction models differ this deeply, they cannot cleanly share a URL namespace. A query parameter like ?syntax=oai would be a flag that silently changes the entire behavior of an endpoint — worse for clients than an explicit URL distinction.
Recommended URL structure: separate path prefixes, same deployment
The cleanest approach is separate versioned path prefixes per interaction model, all served by the same application:
https://pds.nasa.gov/api/registry/1/ ← current resource API (/products, /classes, ...)
https://pds.nasa.gov/api/oai/2.0/ ← OAI-PMH (version tracks OAI-PMH protocol version)
https://pds.nasa.gov/api/sql/1/ ← hypothetical SQL interface (future)
This pattern is established precedent: OpenSearch itself exposes /_search, /_cat, and /_cluster as separate API surfaces with different conventions, all served by one process. Each prefix can carry its own version number, its own OpenAPI/spec document, and its own error semantics — without requiring separate deployments or repositories.
Why not /_oai? The leading underscore convention (from CouchDB/OpenSearch) signals internal or administrative endpoints, not alternate external protocols. A clean prefix like /oai/ is more accurate and client-friendly.
Why not separate services/repos? Operational overhead is not justified when the only difference is URL namespace and response serialization. The underlying data retrieval (OpenSearch, SearchRequestFactory) is shared. Separate deployments make sense only if the protocols have genuinely divergent scaling, SLA, or team ownership requirements.
This requirement scopes the OAI-PMH work to the /oai/2.0/ prefix and should be treated as the first instance of this multi-protocol pattern, establishing the convention for any future interfaces (SQL, SPARQL, etc.).
Proposed Implementation Options
Option A — Dedicated /oai Controller (Recommended)
A new OaiPmhController handles GET and POST to /oai, dispatching on the ?verb= query parameter. This follows the standard OAI-PMH base URL pattern expected by all harvesters.
Key design points:
OaiPmhController in gov.nasa.pds.api.registry.controllers — routes on ?verb= to handler methods; returns badVerb error for unknown verbs
- New
OaiPmhXmlSerializer in the view layer — produces the OAI-PMH XML response envelope (namespace http://www.openarchives.org/OAI/2.0/)
- Reuses
SearchRequestFactory and existing OpenSearch search infrastructure for data retrieval
- PDS data model mappings:
- LIDVIDs → OAI-PMH record identifiers (URN form:
urn:nasa:pds:<lidvid>)
- Discipline nodes → OAI-PMH Sets (setSpec = node prefix, e.g.,
geo, atm)
- Product classes → sub-sets (e.g.,
geo:Bundle, geo:Collection)
ops:Harvest_Info/ops:harvest_date_time → OAI-PMH datestamp
- Metadata formats to support:
oai_dc — Dublin Core (required by OAI-PMH spec); map PDS4 fields → 15 DC elements
oai_pds4 — Native PDS4 XML label as the metadata record (custom schema)
- ResumptionToken: encode OpenSearch
search_after cursor as base64 JSON; include completeListSize and cursor attributes per spec
- Error responses:
badVerb, badArgument, cannotDisseminateFormat, idDoesNotExist, noRecordsMatch, noSetHierarchy — all as OAI-PMH XML error envelopes (not HTTP 4xx)
Pros: Standard /oai URL pattern recognized by all harvesters; isolated controller keeps OAI-PMH logic separate; minimal new code; fully reuses existing search stack; matches codebase single-responsibility principle.
Cons: Verb dispatch is query-parameter-driven rather than path-based, which is slightly non-idiomatic for Spring MVC (but correct per OAI-PMH spec and easily implemented with @RequestParam).
Suggested Dublin Core field mapping:
| DC Element |
PDS4 source field(s) |
dc:identifier |
LIDVID |
dc:title |
title |
dc:description |
abstract_desc |
dc:date |
ops:Harvest_Info/ops:harvest_date_time |
dc:type |
product class (e.g., Product_Bundle) |
dc:publisher |
discipline node label / "NASA Planetary Data System" |
dc:creator |
Investigation_Area/name or Observing_System_Component/name |
dc:subject |
Primary_Result_Summary/Science_Facets/discipline_name |
dc:format |
File/file_name extension or Encoding_Type |
dc:source |
parent LIDVID (from Internal_Reference) |
Option B — Separate OAI-PMH Microservice
A standalone Spring Boot application (new repository) that calls existing registry-api REST endpoints and reformats responses into OAI-PMH XML.
Pros: Complete decoupling; independently versioned and deployed.
Cons: Extra network hop per request; must be kept in sync with registry-api changes; adds a second repo/deployment to maintain; duplicates search logic.
Option C — Content-Type Negotiation on Existing Endpoints
Add application/oai+xml as an Accept header option on /products endpoints, adding an OAI-PMH serializer alongside existing ones.
Pros: No new URL namespace; reuses existing API versioning.
Cons: OAI-PMH is verb-driven, not resource-driven — Identify, ListSets, and ListMetadataFormats have no natural mapping to /products; resumptionToken pagination semantics conflict with the existing ?start/?limit model; complicates existing serializers significantly.
Recommendation
Option A, served at /oai/2.0/ per the URL namespace rationale above. OAI-PMH has well-defined protocol semantics that are fundamentally verb-driven (not REST resource-driven). A dedicated OaiPmhController keeps those semantics cleanly isolated, fully reuses the existing OpenSearch/SearchRequestFactory infrastructure, and presents the /oai base URL that all standard harvesting tools expect. It is the minimal, correct, and maintainable approach — and establishes the path-prefix-per-interaction-model pattern for any future interfaces.
Acceptance Criteria
Additional Context
For Internal Dev Team To Complete
⚙️ Engineering Details
🎉 Integration & Test
Summary
As a data aggregator or metadata harvester, I want the PDS Registry API to expose an OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting, v2.0) endpoint, so that standard harvesting tools (OAIster, BASE, OpenDOAR, OCLC, etc.) can automatically discover and ingest PDS planetary science metadata without requiring custom API integration.
Motivation
OAI-PMH is the de facto standard protocol for metadata harvesting in the scholarly and scientific data repository community. Many discovery services and catalog aggregators require it to index repositories. Without OAI-PMH, PDS metadata is invisible to these harvesters, limiting data discoverability. The protocol is lightweight, well-understood, and a natural fit for the PDS registry's existing data model (LIDVIDs → identifiers, discipline nodes → sets, PDS4 labels → metadata records).
OAI-PMH Protocol Overview (v2.0)
The protocol defines six verbs, all served from a single base URL (
/oai):IdentifyListMetadataFormatsoai_dc,oai_pds4)ListSetsListIdentifiersListRecordsGetRecordSelective harvesting is via
set(collection grouping) andfrom/until(datestamp range) parameters. Large result sets use aresumptionTokenfor cursor-based pagination.API URL Namespace Design Rationale
Interaction models vs. query syntaxes
OAI-PMH, SQL, and the existing registry query language (
?q=) are not interchangeable filter syntaxes over the same resources — they are distinct interaction models with fundamentally different contracts:/products)?start/?limitresumptionTokencursor<error>elementBecause the interaction models differ this deeply, they cannot cleanly share a URL namespace. A query parameter like
?syntax=oaiwould be a flag that silently changes the entire behavior of an endpoint — worse for clients than an explicit URL distinction.Recommended URL structure: separate path prefixes, same deployment
The cleanest approach is separate versioned path prefixes per interaction model, all served by the same application:
This pattern is established precedent: OpenSearch itself exposes
/_search,/_cat, and/_clusteras separate API surfaces with different conventions, all served by one process. Each prefix can carry its own version number, its own OpenAPI/spec document, and its own error semantics — without requiring separate deployments or repositories.Why not
/_oai? The leading underscore convention (from CouchDB/OpenSearch) signals internal or administrative endpoints, not alternate external protocols. A clean prefix like/oai/is more accurate and client-friendly.Why not separate services/repos? Operational overhead is not justified when the only difference is URL namespace and response serialization. The underlying data retrieval (OpenSearch, SearchRequestFactory) is shared. Separate deployments make sense only if the protocols have genuinely divergent scaling, SLA, or team ownership requirements.
This requirement scopes the OAI-PMH work to the
/oai/2.0/prefix and should be treated as the first instance of this multi-protocol pattern, establishing the convention for any future interfaces (SQL, SPARQL, etc.).Proposed Implementation Options
Option A — Dedicated
/oaiController (Recommended)A new
OaiPmhControllerhandlesGETandPOSTto/oai, dispatching on the?verb=query parameter. This follows the standard OAI-PMH base URL pattern expected by all harvesters.Key design points:
OaiPmhControlleringov.nasa.pds.api.registry.controllers— routes on?verb=to handler methods; returnsbadVerberror for unknown verbsOaiPmhXmlSerializerin the view layer — produces the OAI-PMH XML response envelope (namespacehttp://www.openarchives.org/OAI/2.0/)SearchRequestFactoryand existing OpenSearch search infrastructure for data retrievalurn:nasa:pds:<lidvid>)geo,atm)geo:Bundle,geo:Collection)ops:Harvest_Info/ops:harvest_date_time→ OAI-PMH datestampoai_dc— Dublin Core (required by OAI-PMH spec); map PDS4 fields → 15 DC elementsoai_pds4— Native PDS4 XML label as the metadata record (custom schema)search_aftercursor as base64 JSON; includecompleteListSizeandcursorattributes per specbadVerb,badArgument,cannotDisseminateFormat,idDoesNotExist,noRecordsMatch,noSetHierarchy— all as OAI-PMH XML error envelopes (not HTTP 4xx)Pros: Standard
/oaiURL pattern recognized by all harvesters; isolated controller keeps OAI-PMH logic separate; minimal new code; fully reuses existing search stack; matches codebase single-responsibility principle.Cons: Verb dispatch is query-parameter-driven rather than path-based, which is slightly non-idiomatic for Spring MVC (but correct per OAI-PMH spec and easily implemented with
@RequestParam).Suggested Dublin Core field mapping:
dc:identifierdc:titletitledc:descriptionabstract_descdc:dateops:Harvest_Info/ops:harvest_date_timedc:typeProduct_Bundle)dc:publisherdc:creatorInvestigation_Area/nameorObserving_System_Component/namedc:subjectPrimary_Result_Summary/Science_Facets/discipline_namedc:formatFile/file_nameextension orEncoding_Typedc:sourceInternal_Reference)Option B — Separate OAI-PMH Microservice
A standalone Spring Boot application (new repository) that calls existing registry-api REST endpoints and reformats responses into OAI-PMH XML.
Pros: Complete decoupling; independently versioned and deployed.
Cons: Extra network hop per request; must be kept in sync with registry-api changes; adds a second repo/deployment to maintain; duplicates search logic.
Option C — Content-Type Negotiation on Existing Endpoints
Add
application/oai+xmlas anAcceptheader option on/productsendpoints, adding an OAI-PMH serializer alongside existing ones.Pros: No new URL namespace; reuses existing API versioning.
Cons: OAI-PMH is verb-driven, not resource-driven —
Identify,ListSets, andListMetadataFormatshave no natural mapping to/products;resumptionTokenpagination semantics conflict with the existing?start/?limitmodel; complicates existing serializers significantly.Recommendation
Option A, served at
/oai/2.0/per the URL namespace rationale above. OAI-PMH has well-defined protocol semantics that are fundamentally verb-driven (not REST resource-driven). A dedicatedOaiPmhControllerkeeps those semantics cleanly isolated, fully reuses the existing OpenSearch/SearchRequestFactory infrastructure, and presents the/oaibase URL that all standard harvesting tools expect. It is the minimal, correct, and maintainable approach — and establishes the path-prefix-per-interaction-model pattern for any future interfaces.Acceptance Criteria
GET /oai/2.0/?verb=Identifyreturns repository identity XML per OAI-PMH 2.0 specGET /oai/2.0/?verb=ListMetadataFormatsreturns at minimumoai_dc; optionallyoai_pds4GET /oai/2.0/?verb=ListSetsreturns discipline nodes (and optionally product class sub-sets)GET /oai/2.0/?verb=ListIdentifierssupportsmetadataPrefix,set,from,until,resumptionTokenGET /oai/2.0/?verb=ListRecordssupports same parameters; returns full recordsGET /oai/2.0/?verb=GetRecord&identifier=<id>&metadataPrefix=oai_dcreturns a single recordresumptionTokenenables full cursor-based pagination through large result setsoai_dcmetadata format passes OAI-PMH validator (e.g., OVAL)POST /oai/2.0/behaves identically toGET /oai/2.0/per spec requirementregistryrepository per Integration Testing GuideAdditional Context
deletedRecordpolicy should be declared inIdentify; recommendpersistentif the registry tracks deleted LIDVIDs, otherwisenoYYYY-MM-DDThh:mm:ssZ(seconds granularity preferred)For Internal Dev Team To Complete
⚙️ Engineering Details
🎉 Integration & Test