Phase 3a follow-up: propose_selectors for non-CSS extraction algorithms

> _Migrated from https://github.com/CannObserv/watcher/issues/148 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout._

## Context

Phase 3a's \`propose_selectors\` ([src/core/tools/propose_selectors.py](src/core/tools/propose_selectors.py)) returns CSS selectors only. The InfoSpec v1 schema accepts \`extraction.algorithm: enum [css, xpath, jsonpath, regex, full_page]\` — for non-CSS algorithms, an LLM agent following the authoring loop (\`fetch_and_render\` → \`propose_selectors\` → \`preview_extraction\` → \`create_info_spec\`) currently has no proposer for the selector-shaped field.

## What to build

Per-algorithm proposers, gated on a request \`algorithm\` parameter (default \`css\`):

- **xpath**: same DOM walk + same scoring; emit XPath expressions instead of CSS. Volatility heuristics carry over (hash-looking attribute values).
- **jsonpath**: applies when target is JSON (sniff via \`Content-Type: application/json\` from \`fetch_and_render\`). Walk the JSON tree, propose paths whose \`str(value)\` contains the description.
- **regex**: heuristic regex synthesis from the description + 1-2 surrounding tokens. Lower confidence than DOM-based proposers; flag with a lower base \`stability_score\`.
- **full_page**: trivial — returns one candidate with empty selector + score 1.0 (full_page extracts everything).

## Why deferred

CSS covers >90% of currently-modelled targets in the wild. The other algorithms are real but each adds its own heuristic stack and test fixtures; bundling them into Phase 3a would have ballooned the slice.

## Decision criteria

Land in priority order:
1. **xpath** when an operator hits a target where CSS specificity is insufficient (deeply-nested or attribute-driven content).
2. **jsonpath** when an InfoItem on a JSON API is needed (likely Phase 3b — Archive may consume JSON sources).
3. **regex** / **full_page** as opportunistic adds.

## References

- [docs/plans/2026-05-05-information-service-phase3a-plan.md](docs/plans/2026-05-05-information-service-phase3a-plan.md) Task 7
- Parent: CannObserv/watcher#145

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3a follow-up: propose_selectors for non-CSS extraction algorithms #2

Context

What to build

Why deferred

Decision criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase 3a follow-up: propose_selectors for non-CSS extraction algorithms #2

Description

Context

What to build

Why deferred

Decision criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions