Skip to content

Research: replace heuristic propose_selectors ranker with a learned model #4

@gregoryfoster

Description

@gregoryfoster

Migrated from CannObserv/watcher#146 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.

Context

Phase 3a's `propose_selectors` tool (docs/plans/2026-05-05-information-service-phase3a-plan.md, Task 7) ranks candidate CSS selectors with a hand-tuned heuristic: substring match on the description, specificity bonus, text-length bonus, volatility penalty (selectors containing hash-y or random-looking class names).

This is intentionally crude — the tool is meant to narrow an LLM agent's search space, not finalize an extraction. Operators always validate via `preview_extraction` before committing.

What to investigate

Once we have a corpus of authored InfoSpecs (~50+ across diverse target sites), we'll have ground-truth pairs of (URL, description, chosen-selector). Use that to evaluate whether a learned ranker outperforms the heuristic.

Concrete questions:

  1. Feature engineering — what signals beat the current heuristic? Candidates: pretrained DOM-element embeddings, render-then-replay diff stability, ARIA landmark match, parent-child selector path frequency in HTML5 best practices, sibling-text disambiguation.
  2. Model family — pointwise / pairwise / listwise ranker (`lightgbm`, `xgboost`, or a small neural ranker)? Cost-of-inference matters since this runs inline during InfoSpec authoring.
  3. Evaluation metric — top-1 accuracy vs. top-5 recall vs. operator-edit-distance from suggested to chosen. Probably top-5 recall: an LLM agent picks from the ranked list.
  4. Training data labelling — when an operator overrides a suggestion via `preview_extraction`, does that count as a negative for the suggestion and a positive for the override? How to log this passively without a UI change.
  5. Cold-start — for InfoItems on novel domains where we have zero history, does the learned ranker degrade gracefully back to the heuristic?

Out of scope for the research

  • Productionizing the ranker (separate issue once research lands).
  • A self-improving feedback loop (ranker retrains on operator overrides).
  • Cross-tenant model sharing if Archiver service ever multi-tenants.

Decision criteria

If a candidate model beats the heuristic on top-5 recall by ≥10% on a 100-spec held-out test set, it's worth productionizing. If not, ship operator UX improvements (better preview, faster iteration) instead.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions