Skip to content

Phase 2c follow-up: multi-selector InfoSpec extension #6

@gregoryfoster

Description

@gregoryfoster

Migrated from CannObserv/watcher#141 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.

Context

The current InfoSpec v1 JSON Schema (Archiver service) supports a single `extraction.selector` string. Real targets often need multiple selectors composed together (e.g. capture two distinct sections of a page, or a primary selector with a fallback).

Listed under "Open follow-ups (deferred)" in docs/plans/2026-05-04-watcher-phase2c-cutover-plan.md wrap-up.

What to do

  • Extend InfoSpec v1 JSON Schema (`src/core/specs/v1.json` or wherever it lives) so `extraction.selector` accepts `string | array`.
  • Decide and document semantics: union (concat all matches in DOM order), priority (first non-empty), or intersection. Recommended: union.
  • Update the InfoSpec validator + tests on the Archiver service.
  • Update Watcher's `_extraction_config_from_spec` (src/workers/pipeline.py) and HTML extractor (already accepts `selectors: list[str]`) to flow the array through.
  • Migration: existing single-selector specs are still valid (string is allowed); no DB migration needed.
  • Tests: round-trip a multi-selector spec, verify primary resolution + extraction.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions