Migrated from CannObserv/watcher#141 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.
Context
The current InfoSpec v1 JSON Schema (Archiver service) supports a single `extraction.selector` string. Real targets often need multiple selectors composed together (e.g. capture two distinct sections of a page, or a primary selector with a fallback).
Listed under "Open follow-ups (deferred)" in docs/plans/2026-05-04-watcher-phase2c-cutover-plan.md wrap-up.
What to do
- Extend InfoSpec v1 JSON Schema (`src/core/specs/v1.json` or wherever it lives) so `extraction.selector` accepts `string | array`.
- Decide and document semantics: union (concat all matches in DOM order), priority (first non-empty), or intersection. Recommended: union.
- Update the InfoSpec validator + tests on the Archiver service.
- Update Watcher's `_extraction_config_from_spec` (src/workers/pipeline.py) and HTML extractor (already accepts `selectors: list[str]`) to flow the array through.
- Migration: existing single-selector specs are still valid (string is allowed); no DB migration needed.
- Tests: round-trip a multi-selector spec, verify primary resolution + extraction.
References
Context
The current InfoSpec v1 JSON Schema (Archiver service) supports a single `extraction.selector` string. Real targets often need multiple selectors composed together (e.g. capture two distinct sections of a page, or a primary selector with a fallback).
Listed under "Open follow-ups (deferred)" in docs/plans/2026-05-04-watcher-phase2c-cutover-plan.md wrap-up.
What to do
References