2.Canonicalization, Section Tree and KG Retriever#98
Open
santo0 wants to merge 15 commits into
Open
Conversation
…DME with new commands
…ics calculations in benchmark retrieval script
Contributor
Author
|
I'm trying to simplify this PR, I will notify when I'm done. |
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever
Builds on the
kg-query-difficultypipeline with four major additions: LLM-based keyword canonicalization, a section-tree retrieval layer, a formal query difficulty model, and a RAG-compatibleKGRetriever.PRs structure
The PRs depend on the previous ones.
1. Keyword Canonicalization (
canonicalizer.py)Reduces synonym noise in the KG by collapsing semantically equivalent keywords into a single canonical form before graph construction.
The process runs in four stages:
all-MiniLM-L6-v2).max_group_sizelimit.ExtractionResultobjects are rewritten to use canonical forms; any keyword not resolved by the LLM falls back to itself.At query time, a
CanonicalLookupobject resolves new keywords: first via an O(1) synonym table lookup, then via embedding nearest-neighbor search gated by a configurablefallback_threshold.2. Section Tree (
section_tree.py)Provides a structural retrieval signal by mirroring the textbook's heading hierarchy as a tree of
SectionNodeobjects, each carrying aggregated KG keyword sets.Construction (
build_section_tree):sectionfield).13.1→ parent13).chunk_idsnode attributes.Scoring (
get_chunk_scores): hybrid signal per section node blending two components:A top-down inheritance pass then propagates effective scores to children:
This ensures that relevance detected at a section level flows through to its subsections. Final scores are normalized to [0, 1].
The tree serializes to/from
section_tree.jsonin each run directory.3. Query Difficulty Analysis (
analysis.py,models.py)Estimates how hard a query is to answer based on graph-structural properties of the subgraph induced by the query's matched nodes.
Features (
QueryFeatures):query_node_countmax_path_lengthavg_path_lengthcomponent_countavg_degree/max_degreesubgraph_node_count/subgraph_edge_countdoc_countScoring (
DifficultyScore): five orthogonal dimensions each contribute 0–2 points:multihopfragmentationsubgraph_sizebranchingdispersionTotal score (0–10) maps to
EASY/MEDIUM/HARD. Theanalyze_queryscript outputs results as JSON.4. KG Retriever (
query.py)A
KGRetrieverclass implementing the duck-typedRetrieverinterface (nameattribute +get_scoresmethod), making it a drop-in component for the RAGEnsembleRanker.Node-match scoring (
retrieve_from_kg):num_hopspropagates neighbor contributions with geometric decay:neighbor_weight^hop × (edge_weight / max_edge_weight).Section blending: when a
SectionTreeis provided, the final score is a weighted combination:Set
beta = 0.0to fall back to pure node-match scoring.CanonicalLookupis threaded throughextract_query_nodesso synonym-aware matching is available at retrieval time.Supporting Changes
io.py— clean I/O utilities:load_graph,load_run_chunks,resolve_run_dir(handles both specific run dirs andlatestsymlinks),load_graph_and_chunks,load_graph_chunks_and_tree,load_canonicalization_data.utils/normalizer.py,utils/ngrams.py— shared text normalization and n-gram extraction used across canonicalization, section tree, and query matching.scripts/— scripts reorganized into a subdirectory;analyze_query.pyandinspect_run.pyadded.inspect_run.pyprints graph statistics, degree distribution, and section tree summaries for a given run directory.persisters/networkx_json_persister.py— extended to optionally persist canonicalization artifacts (synonym_table.json,canonical_keywords.json,canonical_embeddings.npy) and the section tree.pipeline.py— canonicalization step wired in as an optional stage between extraction and linking.tests/test_knowledge_graph.py— unit tests covering the new modules.