Research: replace heuristic propose_selectors ranker with a learned model

> _Migrated from https://github.com/CannObserv/watcher/issues/146 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout._

## Context

Phase 3a's \`propose_selectors\` tool ([docs/plans/2026-05-05-information-service-phase3a-plan.md](docs/plans/2026-05-05-information-service-phase3a-plan.md), Task 7) ranks candidate CSS selectors with a hand-tuned heuristic: substring match on the description, specificity bonus, text-length bonus, volatility penalty (selectors containing hash-y or random-looking class names).

This is intentionally crude — the tool is meant to narrow an LLM agent's search space, not finalize an extraction. Operators always validate via \`preview_extraction\` before committing.

## What to investigate

Once we have a corpus of authored InfoSpecs (~50+ across diverse target sites), we'll have ground-truth pairs of (URL, description, chosen-selector). Use that to evaluate whether a learned ranker outperforms the heuristic.

Concrete questions:

1. **Feature engineering** — what signals beat the current heuristic? Candidates: pretrained DOM-element embeddings, render-then-replay diff stability, ARIA landmark match, parent-child selector path frequency in HTML5 best practices, sibling-text disambiguation.
2. **Model family** — pointwise / pairwise / listwise ranker (\`lightgbm\`, \`xgboost\`, or a small neural ranker)? Cost-of-inference matters since this runs inline during InfoSpec authoring.
3. **Evaluation metric** — top-1 accuracy vs. top-5 recall vs. operator-edit-distance from suggested to chosen. Probably top-5 recall: an LLM agent picks from the ranked list.
4. **Training data labelling** — when an operator overrides a suggestion via \`preview_extraction\`, does that count as a negative for the suggestion and a positive for the override? How to log this passively without a UI change.
5. **Cold-start** — for InfoItems on novel domains where we have zero history, does the learned ranker degrade gracefully back to the heuristic?

## Out of scope for the research

- Productionizing the ranker (separate issue once research lands).
- A self-improving feedback loop (ranker retrains on operator overrides).
- Cross-tenant model sharing if Archiver service ever multi-tenants.

## Decision criteria

If a candidate model beats the heuristic on top-5 recall by ≥10% on a 100-spec held-out test set, it's worth productionizing. If not, ship operator UX improvements (better preview, faster iteration) instead.

## References

- [docs/plans/2026-05-05-information-service-phase3a-plan.md](docs/plans/2026-05-05-information-service-phase3a-plan.md) Task 7 + cross-cutting decision on heuristic
- Parent: CannObserv/watcher#145 (Phase 3a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: replace heuristic propose_selectors ranker with a learned model #4

Context

What to investigate

Out of scope for the research

Decision criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Research: replace heuristic propose_selectors ranker with a learned model #4

Description

Context

What to investigate

Out of scope for the research

Decision criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions