Skip to content

Phase 3a follow-up: propose_selectors for non-CSS extraction algorithms #2

@gregoryfoster

Description

@gregoryfoster

Migrated from CannObserv/watcher#148 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.

Context

Phase 3a's `propose_selectors` (src/core/tools/propose_selectors.py) returns CSS selectors only. The InfoSpec v1 schema accepts `extraction.algorithm: enum [css, xpath, jsonpath, regex, full_page]` — for non-CSS algorithms, an LLM agent following the authoring loop (`fetch_and_render` → `propose_selectors` → `preview_extraction` → `create_info_spec`) currently has no proposer for the selector-shaped field.

What to build

Per-algorithm proposers, gated on a request `algorithm` parameter (default `css`):

  • xpath: same DOM walk + same scoring; emit XPath expressions instead of CSS. Volatility heuristics carry over (hash-looking attribute values).
  • jsonpath: applies when target is JSON (sniff via `Content-Type: application/json` from `fetch_and_render`). Walk the JSON tree, propose paths whose `str(value)` contains the description.
  • regex: heuristic regex synthesis from the description + 1-2 surrounding tokens. Lower confidence than DOM-based proposers; flag with a lower base `stability_score`.
  • full_page: trivial — returns one candidate with empty selector + score 1.0 (full_page extracts everything).

Why deferred

CSS covers >90% of currently-modelled targets in the wild. The other algorithms are real but each adds its own heuristic stack and test fixtures; bundling them into Phase 3a would have ballooned the slice.

Decision criteria

Land in priority order:

  1. xpath when an operator hits a target where CSS specificity is insufficient (deeply-nested or attribute-driven content).
  2. jsonpath when an InfoItem on a JSON API is needed (likely Phase 3b — Archive may consume JSON sources).
  3. regex / full_page as opportunistic adds.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions