Skip to content

Add adaptive page-aware retrieval pipeline with graded evaluation #115

Draft
Shrey1306 wants to merge 4 commits into
georgia-tech-db:mainfrom
Shrey1306:feature/sgupta736-retrieval-quality
Draft

Add adaptive page-aware retrieval pipeline with graded evaluation #115
Shrey1306 wants to merge 4 commits into
georgia-tech-db:mainfrom
Shrey1306:feature/sgupta736-retrieval-quality

Conversation

@Shrey1306
Copy link
Copy Markdown

@Shrey1306 Shrey1306 commented Apr 15, 2026

Replace static chunk retrieval with adaptive query routing, hierarchical section-to-chunk retrieval, page-aware reranking, and a graded benchmark suite. Measured on 21 judged textbook questions: chunk nDCG@10 = 0.359, chunk Recall@10 = 0.698, page-hit@10 = 0.857.

Key changes:

  • Manifest-backed artifact bundle with SHA256 validation (src/artifacts.py)
  • Adaptive query planner with 5-type routing (src/retrieval_pipeline.py)
  • Hierarchical section-to-chunk retrieval with section-prior boosting
  • Page-aware reranking (base score + lexical overlap + page specificity)
  • Multi-part query decomposition with coverage-aware result merging
  • Hardened GGUF embedder: single-input default, no zero-vector fallback
  • Hardened query enhancement: try-catch fallbacks on all LLM calls
  • build_index() decomposed into extract, embed, persist stages
  • get_answer() split into adaptive and legacy retrieval helpers
  • 21 graded benchmarks with 3-level chunk/page relevance labels
  • Ranked IR metrics: nDCG, Recall, MRR, MAP, page-hit variants
  • Benchmark runner with --mode baseline for fair comparison
  • Ruff lint, docstrings, 36 unit tests passing

This (update) 04/25/2026 (FINAL REPORT)

Consolidates adaptive routing policy, tightens the retrieval + artifact stack, and adds a reproducible local eval path—plus benchmarks and tests aligned to the current textbook index.

What changed

  • src/planning/rules.py (new)
    Single place for query-type classification, follow-up reference patterns, heuristic decomposition, and shared QUERY_TYPE_* constants. HeuristicQueryPlanner, query_enhancement, and retrieval_pipeline all route through the same policy.

  • src/retrieval_pipeline.py & src/query_enhancement.py
    Refactor to use shared rules: named score weights, multi-part merge and anchor rerank behavior, confidence widening traced on RetrievalTrace, and deterministic/consistent follow-up handling.

  • src/retriever.py & src/config.py
    Stricter bundle loading (e.g. manifest / artifact version expectations), type and config cleanups in support of hierarchical + page-aware behavior.

  • scripts/run_evals.sh (new)
    One script to run preflight, optional index build, artifact validation, retrieval benchmark, pytest, and make lint, with timestamped logs under eval_runs/.

  • Index metadata checked in
    index/sections/textbook_index_manifest.json plus updated textbook_index_page_to_chunk_map.json for the current textbook_index build (large diff is the page map).

  • tests/benchmarks.yaml
    Graded retrieval_gold (chunks/pages + grades) and query-type fields updated for the current artifact set.

  • Tests
    tests/test_planning_rules.py (new), extended test_retrieval_pipeline, test_artifacts, and other suites for the above behavior.

  • Tooling
    Makefile Ruff list includes rules.py and test_planning_rules.py. .gitignore: local conda dirs, eval_runs/, generated index/sections/*.npy and *_info.json, and * 2.{py,yaml,yml to avoid duplicate-file accidents.

  • config/config.yaml
    Default generator model path adjusted (e.g. smaller/faster local instruct model); teams should still align paths with their own models/.

How to verify

make lint
pytest tests/ -q
# Optional full local run (needs models + data per config):
# bash scripts/run_evals.sh

Replace static chunk retrieval with adaptive query routing, hierarchical
section-to-chunk retrieval, page-aware reranking, and a graded benchmark
suite. Measured on 21 judged textbook questions: chunk nDCG@10 = 0.359,
chunk Recall@10 = 0.698, page-hit@10 = 0.857.

Key changes:
- Manifest-backed artifact bundle with SHA256 validation (src/artifacts.py)
- Adaptive query planner with 5-type routing (src/retrieval_pipeline.py)
- Hierarchical section-to-chunk retrieval with section-prior boosting
- Page-aware reranking (base score + lexical overlap + page specificity)
- Multi-part query decomposition with coverage-aware result merging
- Hardened GGUF embedder: single-input default, no zero-vector fallback
- Hardened query enhancement: try-catch fallbacks on all LLM calls
- build_index() decomposed into extract, embed, persist stages
- get_answer() split into adaptive and legacy retrieval helpers
- 21 graded benchmarks with 3-level chunk/page relevance labels
- Ranked IR metrics: nDCG, Recall, MRR, MAP, page-hit variants
- Benchmark runner with --mode baseline for fair comparison
- Ruff lint, docstrings, 36 unit tests passing
Resolve conflicts keeping improved retrieval pipeline while incorporating
main's new fields (embedding_model_context_window, enable_topic_extraction,
uuid4, flash_attn, LlamaRAMCache). Fix duplicate numpy import and unused
deepcopy import introduced during merge.
@Shrey1306 Shrey1306 marked this pull request as draft April 15, 2026 07:35
…rieval-quality

# Conflicts:
#	Makefile
#	config/config.yaml
#	src/api_server.py
#	src/config.py
#	src/index_builder.py
#	src/main.py
#	tests/conftest.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant