Migrated from CannObserv/watcher#146 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.
Context
Phase 3a's `propose_selectors` tool (docs/plans/2026-05-05-information-service-phase3a-plan.md, Task 7) ranks candidate CSS selectors with a hand-tuned heuristic: substring match on the description, specificity bonus, text-length bonus, volatility penalty (selectors containing hash-y or random-looking class names).
This is intentionally crude — the tool is meant to narrow an LLM agent's search space, not finalize an extraction. Operators always validate via `preview_extraction` before committing.
What to investigate
Once we have a corpus of authored InfoSpecs (~50+ across diverse target sites), we'll have ground-truth pairs of (URL, description, chosen-selector). Use that to evaluate whether a learned ranker outperforms the heuristic.
Concrete questions:
- Feature engineering — what signals beat the current heuristic? Candidates: pretrained DOM-element embeddings, render-then-replay diff stability, ARIA landmark match, parent-child selector path frequency in HTML5 best practices, sibling-text disambiguation.
- Model family — pointwise / pairwise / listwise ranker (`lightgbm`, `xgboost`, or a small neural ranker)? Cost-of-inference matters since this runs inline during InfoSpec authoring.
- Evaluation metric — top-1 accuracy vs. top-5 recall vs. operator-edit-distance from suggested to chosen. Probably top-5 recall: an LLM agent picks from the ranked list.
- Training data labelling — when an operator overrides a suggestion via `preview_extraction`, does that count as a negative for the suggestion and a positive for the override? How to log this passively without a UI change.
- Cold-start — for InfoItems on novel domains where we have zero history, does the learned ranker degrade gracefully back to the heuristic?
Out of scope for the research
- Productionizing the ranker (separate issue once research lands).
- A self-improving feedback loop (ranker retrains on operator overrides).
- Cross-tenant model sharing if Archiver service ever multi-tenants.
Decision criteria
If a candidate model beats the heuristic on top-5 recall by ≥10% on a 100-spec held-out test set, it's worth productionizing. If not, ship operator UX improvements (better preview, faster iteration) instead.
References
Context
Phase 3a's `propose_selectors` tool (docs/plans/2026-05-05-information-service-phase3a-plan.md, Task 7) ranks candidate CSS selectors with a hand-tuned heuristic: substring match on the description, specificity bonus, text-length bonus, volatility penalty (selectors containing hash-y or random-looking class names).
This is intentionally crude — the tool is meant to narrow an LLM agent's search space, not finalize an extraction. Operators always validate via `preview_extraction` before committing.
What to investigate
Once we have a corpus of authored InfoSpecs (~50+ across diverse target sites), we'll have ground-truth pairs of (URL, description, chosen-selector). Use that to evaluate whether a learned ranker outperforms the heuristic.
Concrete questions:
Out of scope for the research
Decision criteria
If a candidate model beats the heuristic on top-5 recall by ≥10% on a 100-spec held-out test set, it's worth productionizing. If not, ship operator UX improvements (better preview, faster iteration) instead.
References