Migrated from CannObserv/watcher#148 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.
Context
Phase 3a's `propose_selectors` (src/core/tools/propose_selectors.py) returns CSS selectors only. The InfoSpec v1 schema accepts `extraction.algorithm: enum [css, xpath, jsonpath, regex, full_page]` — for non-CSS algorithms, an LLM agent following the authoring loop (`fetch_and_render` → `propose_selectors` → `preview_extraction` → `create_info_spec`) currently has no proposer for the selector-shaped field.
What to build
Per-algorithm proposers, gated on a request `algorithm` parameter (default `css`):
- xpath: same DOM walk + same scoring; emit XPath expressions instead of CSS. Volatility heuristics carry over (hash-looking attribute values).
- jsonpath: applies when target is JSON (sniff via `Content-Type: application/json` from `fetch_and_render`). Walk the JSON tree, propose paths whose `str(value)` contains the description.
- regex: heuristic regex synthesis from the description + 1-2 surrounding tokens. Lower confidence than DOM-based proposers; flag with a lower base `stability_score`.
- full_page: trivial — returns one candidate with empty selector + score 1.0 (full_page extracts everything).
Why deferred
CSS covers >90% of currently-modelled targets in the wild. The other algorithms are real but each adds its own heuristic stack and test fixtures; bundling them into Phase 3a would have ballooned the slice.
Decision criteria
Land in priority order:
- xpath when an operator hits a target where CSS specificity is insufficient (deeply-nested or attribute-driven content).
- jsonpath when an InfoItem on a JSON API is needed (likely Phase 3b — Archive may consume JSON sources).
- regex / full_page as opportunistic adds.
References
Context
Phase 3a's `propose_selectors` (src/core/tools/propose_selectors.py) returns CSS selectors only. The InfoSpec v1 schema accepts `extraction.algorithm: enum [css, xpath, jsonpath, regex, full_page]` — for non-CSS algorithms, an LLM agent following the authoring loop (`fetch_and_render` → `propose_selectors` → `preview_extraction` → `create_info_spec`) currently has no proposer for the selector-shaped field.
What to build
Per-algorithm proposers, gated on a request `algorithm` parameter (default `css`):
Why deferred
CSS covers >90% of currently-modelled targets in the wild. The other algorithms are real but each adds its own heuristic stack and test fixtures; bundling them into Phase 3a would have ballooned the slice.
Decision criteria
Land in priority order:
References