Batch-extract terminology from ISO standards published on the ISO Online Browsing Platform (OBP) into a Glossarist dataset.
Part of the Glossarist project.
Fetches STS XML from the OBP, extracts terminology via the Glossarist STS importer, and produces grouped YAML — one file per concept, all languages.
Works for any ISO technical committee, any set of languages, or individual documents.
Three-stage pipeline. Each stage is idempotent — safe to re-run.
ISO Open Data ──► catalog/ (metadata JSONL, ~50k deliverables)
│
▼ filter by committee / references
│
cache/ISO-19101-1-2014/en.xml ◄── persistent checkpoint
cache/ISO-19101-1-2014/fr.xml ◄── skip if exists
...
│
▼ always rebuilds from cache
│
dataset/*.yaml (grouped YAML: 1 file, all languages)| Stage | Class | Input | Output | Network |
|---|---|---|---|---|
Catalog |
|
ISO Open Data |
|
~1 min |
Fetch |
|
|
|
varies |
Build |
|
|
|
none |
Catalog downloads the ISO Open Data deliverables metadata — a single JSONL
file (~50k entries) cached at catalog/iso_deliverables_metadata.jsonl.
Selection by committee or reference list happens in memory after download;
switching committees does not require re-downloading. See Selection modes.
Fetch downloads STS XML from the ISO Online Browsing Platform for each
selected deliverable and language. Files land at
cache/<reference>/<lang>.xml. The OBP assigns per-deliverable version
segments (v1, v2, …) that are not knowable from metadata; the
Fetcher auto-discovers the correct version by probing v1 through v5.
Already-cached files are skipped — see Resumability.
Build reads cached STS XML, extracts terminology using the Glossarist
STS importer (Glossarist::Sts::TermExtractor + Glossarist::Sts::TermMapper),
groups concepts by ID across languages, prefixes each ID with the standard
key (see Concept ID scheme), and writes one grouped YAML file per
concept to dataset/.
IsoTerms::Pipeline is a thin composition root over the three stage
classes above. Domain rules live in value objects:
| Class | Responsibility |
|---|---|
|
Composition root. Owns the directory layout ( |
|
Value object. Encapsulates the deliverable-selection priority rule (references > committee > default). |
|
Value object. Parses an ISO reference string ( |
|
Value object. Pairs a |
All commands accept --tc, --languages, and positional REFERENCES.
# Full pipeline — default: ISO/TC 211, English + French
iso-terms all
# Any technical committee
iso-terms all --tc "ISO/TC 204"
iso-terms status --tc "ISO/TC 46"
# All ISO deliverables (no committee filter)
iso-terms status --tc all
# Single document
iso-terms fetch "ISO 19115-1:2014"
# Batch documents
iso-terms fetch "ISO 19115-1:2014" "ISO 19103:2015" "ISO 6709:2022"
# References from a file (one per line)
iso-terms fetch --from-file documents.txt
# Language selection
iso-terms all --tc "ISO/TC 211" --languages en,fr,ru
# Individual stages
iso-terms catalog # download ISO metadata
iso-terms fetch # fetch STS XML from OBP
iso-terms build # build Glossarist dataset
iso-terms status # show pipeline status
# Help (Thor convention: help before command name)
iso-terms help allrake # catalog → fetch → build
rake catalog # download ISO metadata
rake fetch # fetch STS XML
rake build # build dataset
rake status # show pipeline status
rake clean # delete dataset (keeps cache)
rake reset # delete cache + dataset
rake clobber # delete everything# Default: ISO/TC 211, English + French
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd)
pipeline.run # all stages
pipeline.refresh(force: true) # re-download catalog
pipeline.fetch # fetch missing XMLs
pipeline.build # rebuild dataset from cache
pipeline.status # print status
# Custom committee
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd, committee: "ISO/TC 204")
# Custom languages
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd, languages: %w[en ru])
# Ad-hoc document references
pipeline = IsoTerms::Pipeline.new(
root_dir: Dir.pwd,
references: ["ISO 19115-1:2014", "ISO 19103:2015"]
)The catalog selects deliverables in priority order. This rule is
encapsulated in IsoTerms::Selection; the rest of the codebase accepts a
Selection instance rather than re-encoding the rule.
-
References (
references:/ positional args) — exact match against ISO reference strings (e.g.,"ISO 19115-1:2014"). Case-insensitive. Committee is ignored. -
Committee (
committee:/--tc) — filter byownerCommittee(e.g.,"ISO/TC 211","ISO/TC 204"). Usenilor--tc allfor no filter. -
Default —
ISO/TC 211when neither is specified.
The cache directory is the pipeline’s checkpoint. Once an STS XML file lands on disk, it is never fetched again.
-
Skip a deliverable: it’s already cached
-
Re-fetch one: delete its directory under
cache/ -
Re-fetch all:
rake resetthenrake fetch -
Rebuild dataset:
rake build(fast, reads from cache)
The dataset uses Glossarist grouped YAML format — one .yaml file per
concept containing all language localizations:
# dataset/19101-1_2014_3.1.yaml
---
id: 19101-1:2014:3.1
data:
identifier: 19101-1:2014:3.1
localized_concepts:
eng: <uuid>
fra: <uuid>
status: valid
---
id: 19101-1:2014:3.1
data:
language_code: eng
terms:
- designation: geographic information
type: expression
normative_status: preferred
definition:
- content: reference model for geographic information
---
id: 19101-1:2014:3.1
data:
language_code: fra
terms:
- designation: information géographique
type: expression
normative_status: preferred
definition:
- content: "modèle de référence pour l'information géographique"Concept IDs are prefixed with the standard number for global uniqueness.
Implemented by IsoTerms::StandardKey (parses the ISO reference) and
IsoTerms::ConceptId (combines the standard key with the term label):
-
ISO 19101-1:2014term3.1→19101-1:2014:3.1 -
ISO 6709:2008term3.1.1→6709:2008:3.1.1 -
ISO/TS 19130-2:2014term3.1→TS-19130-2:2014:3.1
Of ~148 TC 211 deliverables in the cache, 18 yield zero terms. Three categories
(documented in catalog/gaps.yaml):
These standards have no terms available. The OBP preview is truncated to front matter only, and no newer edition exists:
Deliverable |
Title (from scope) |
ISO 19108:2002 |
|
Geographic information — Temporal schema |
|
ISO 19137:2007 |
Geographic information — Core profile of the spatial schema |
ISO 19141:2008 |
|
Geographic information — Schema for moving features |
|
ISO 19144-1:2009 |
Geographic information — Classification systems — Part 1 |
ISO/TR 19121:2000 |
|
Geographic information — Imagery, gridded and coverage data |
Terms from these older editions are available via superseding editions already in the dataset.
-
glossarist — Glossarist concept model and STS import
-
obp-access — ISO OBP content fetching