Skip to content

glossarist/iso-terms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iso-terms

Batch-extract terminology from ISO standards published on the ISO Online Browsing Platform (OBP) into a Glossarist dataset.

Part of the Glossarist project.

Fetches STS XML from the OBP, extracts terminology via the Glossarist STS importer, and produces grouped YAML — one file per concept, all languages.

Works for any ISO technical committee, any set of languages, or individual documents.

Architecture

Three-stage pipeline. Each stage is idempotent — safe to re-run.

ISO Open Data ──► catalog/  (metadata JSONL, ~50k deliverables)
                       │
                       ▼  filter by committee / references
                       │
                  cache/ISO-19101-1-2014/en.xml   ◄── persistent checkpoint
                  cache/ISO-19101-1-2014/fr.xml   ◄── skip if exists
                  ...
                       │
                       ▼  always rebuilds from cache
                       │
                  dataset/*.yaml  (grouped YAML: 1 file, all languages)
Stage Class Input Output Network

Catalog

IsoTerms::Catalog

ISO Open Data

catalog/

~1 min

Fetch

IsoTerms::Fetcher

catalog/

cache/

varies

Build

IsoTerms::Builder

cache/

dataset/

none

Stages

Catalog downloads the ISO Open Data deliverables metadata — a single JSONL file (~50k entries) cached at catalog/iso_deliverables_metadata.jsonl. Selection by committee or reference list happens in memory after download; switching committees does not require re-downloading. See Selection modes.

Fetch downloads STS XML from the ISO Online Browsing Platform for each selected deliverable and language. Files land at cache/<reference>/<lang>.xml. The OBP assigns per-deliverable version segments (v1, v2, …​) that are not knowable from metadata; the Fetcher auto-discovers the correct version by probing v1 through v5. Already-cached files are skipped — see Resumability.

Build reads cached STS XML, extracts terminology using the Glossarist STS importer (Glossarist::Sts::TermExtractor + Glossarist::Sts::TermMapper), groups concepts by ID across languages, prefixes each ID with the standard key (see Concept ID scheme), and writes one grouped YAML file per concept to dataset/.

Internal design

IsoTerms::Pipeline is a thin composition root over the three stage classes above. Domain rules live in value objects:

Class Responsibility

IsoTerms::Pipeline

Composition root. Owns the directory layout (catalog/, cache/, dataset/), constructs the three stage classes, and exposes run / refresh / fetch / build / status.

IsoTerms::Selection

Value object. Encapsulates the deliverable-selection priority rule (references > committee > default).

IsoTerms::StandardKey

Value object. Parses an ISO reference string ("ISO 19101-1:2014") into the canonical key form ("19101-1:2014").

IsoTerms::ConceptId

Value object. Pairs a StandardKey with a term label ("3.1") to form the full concept identifier ("19101-1:2014:3.1").

Installation

bundle install

Usage

CLI

All commands accept --tc, --languages, and positional REFERENCES.

# Full pipeline — default: ISO/TC 211, English + French
iso-terms all

# Any technical committee
iso-terms all --tc "ISO/TC 204"
iso-terms status --tc "ISO/TC 46"

# All ISO deliverables (no committee filter)
iso-terms status --tc all

# Single document
iso-terms fetch "ISO 19115-1:2014"

# Batch documents
iso-terms fetch "ISO 19115-1:2014" "ISO 19103:2015" "ISO 6709:2022"

# References from a file (one per line)
iso-terms fetch --from-file documents.txt

# Language selection
iso-terms all --tc "ISO/TC 211" --languages en,fr,ru

# Individual stages
iso-terms catalog          # download ISO metadata
iso-terms fetch            # fetch STS XML from OBP
iso-terms build            # build Glossarist dataset
iso-terms status           # show pipeline status

# Help (Thor convention: help before command name)
iso-terms help all

Rake

rake              # catalog → fetch → build
rake catalog      # download ISO metadata
rake fetch        # fetch STS XML
rake build        # build dataset
rake status       # show pipeline status
rake clean        # delete dataset (keeps cache)
rake reset        # delete cache + dataset
rake clobber      # delete everything

Ruby API

# Default: ISO/TC 211, English + French
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd)
pipeline.run                     # all stages
pipeline.refresh(force: true)    # re-download catalog
pipeline.fetch                   # fetch missing XMLs
pipeline.build                   # rebuild dataset from cache
pipeline.status                  # print status

# Custom committee
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd, committee: "ISO/TC 204")

# Custom languages
pipeline = IsoTerms::Pipeline.new(root_dir: Dir.pwd, languages: %w[en ru])

# Ad-hoc document references
pipeline = IsoTerms::Pipeline.new(
  root_dir: Dir.pwd,
  references: ["ISO 19115-1:2014", "ISO 19103:2015"]
)

Selection modes

The catalog selects deliverables in priority order. This rule is encapsulated in IsoTerms::Selection; the rest of the codebase accepts a Selection instance rather than re-encoding the rule.

  1. References (references: / positional args) — exact match against ISO reference strings (e.g., "ISO 19115-1:2014"). Case-insensitive. Committee is ignored.

  2. Committee (committee: / --tc) — filter by ownerCommittee (e.g., "ISO/TC 211", "ISO/TC 204"). Use nil or --tc all for no filter.

  3. DefaultISO/TC 211 when neither is specified.

Resumability

The cache directory is the pipeline’s checkpoint. Once an STS XML file lands on disk, it is never fetched again.

  • Skip a deliverable: it’s already cached

  • Re-fetch one: delete its directory under cache/

  • Re-fetch all: rake reset then rake fetch

  • Rebuild dataset: rake build (fast, reads from cache)

Output Format

The dataset uses Glossarist grouped YAML format — one .yaml file per concept containing all language localizations:

# dataset/19101-1_2014_3.1.yaml
---
id: 19101-1:2014:3.1
data:
  identifier: 19101-1:2014:3.1
  localized_concepts:
    eng: <uuid>
    fra: <uuid>
status: valid
---
id: 19101-1:2014:3.1
data:
  language_code: eng
  terms:
    - designation: geographic information
      type: expression
      normative_status: preferred
  definition:
    - content: reference model for geographic information
---
id: 19101-1:2014:3.1
data:
  language_code: fra
  terms:
    - designation: information géographique
      type: expression
      normative_status: preferred
  definition:
    - content: "modèle de référence pour l'information géographique"

Concept ID scheme

Concept IDs are prefixed with the standard number for global uniqueness. Implemented by IsoTerms::StandardKey (parses the ISO reference) and IsoTerms::ConceptId (combines the standard key with the term label):

  • ISO 19101-1:2014 term 3.119101-1:2014:3.1

  • ISO 6709:2008 term 3.1.16709:2008:3.1.1

  • ISO/TS 19130-2:2014 term 3.1TS-19130-2:2014:3.1

Known Gaps (ISO/TC 211)

Of ~148 TC 211 deliverables in the cache, 18 yield zero terms. Three categories (documented in catalog/gaps.yaml):

Permanent gaps — no data from any source (5 deliverables)

These standards have no terms available. The OBP preview is truncated to front matter only, and no newer edition exists:

Deliverable

Title (from scope)

ISO 19108:2002

Geographic information — Temporal schema

ISO 19137:2007

Geographic information — Core profile of the spatial schema

ISO 19141:2008

Geographic information — Schema for moving features

ISO 19144-1:2009

Geographic information — Classification systems — Part 1

ISO/TR 19121:2000

Geographic information — Imagery, gridded and coverage data

Covered by newer editions (10 deliverables)

Terms from these older editions are available via superseding editions already in the dataset.

Genuinely no terms defined (3 deliverables)

These standards correctly yield zero terms — they delegate definitions to other standards or explicitly state none are defined.

Dependencies

About

Batch-extract ISO terminology into Glossarist datasets from the ISO Online Browsing Platform (OBP)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages