2.Canonicalization, Section Tree and KG Retriever by santo0 · Pull Request #98 · georgia-tech-db/TokenSmith

santo0 · 2026-03-30T18:17:04Z

Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever

Builds on the kg-query-difficulty pipeline with four major additions: LLM-based keyword canonicalization, a section-tree retrieval layer, a formal query difficulty model, and a RAG-compatible KGRetriever.

PRs structure

The PRs depend on the previous ones.

1. Keyword Canonicalization (`canonicalizer.py`)

Reduces synonym noise in the KG by collapsing semantically equivalent keywords into a single canonical form before graph construction.

The process runs in four stages:

Embed — all unique keywords are encoded with a sentence-transformer (all-MiniLM-L6-v2).
Cluster — complete-linkage hierarchical clustering with a cosine similarity threshold groups likely synonyms. Oversized clusters are force-split to respect the max_group_size limit.
LLM verification — groups are sent to an OpenRouter LLM in batches. The model decides which keywords are true synonyms and elects a canonical form per group. Small groups (≤5 keywords) are batched together to reduce API calls.
Apply — ExtractionResult objects are rewritten to use canonical forms; any keyword not resolved by the LLM falls back to itself.

At query time, a CanonicalLookup object resolves new keywords: first via an O(1) synonym table lookup, then via embedding nearest-neighbor search gated by a configurable fallback_threshold.

2. Section Tree (`section_tree.py`)

Provides a structural retrieval signal by mirroring the textbook's heading hierarchy as a tree of SectionNode objects, each carrying aggregated KG keyword sets.

Construction (build_section_tree):

Unique sections are extracted from chunk metadata (section field).
Parent–child relationships are inferred from section numbers (e.g., 13.1 → parent 13).
Chunk IDs are assigned to their leaf section nodes.
Leaf keyword sets are populated from the KG graph's chunk_ids node attributes.
Keyword sets are propagated bottom-up so every ancestor contains the union of all descendant keywords.
Heading keywords (normalized n-grams) are extracted for each section to enable heading-level matching independent of KG vocabulary.

Scoring (get_chunk_scores): hybrid signal per section node blending two components:

KG keyword overlap (coverage × α + specificity × (1 − α)) using the node's aggregated keyword set.
Heading keyword match — same formula applied to independently-tokenized query tokens against the heading keyword set. Captures queries phrased differently from KG vocabulary.

A top-down inheritance pass then propagates effective scores to children:

effective(node) = own_score(node) + inheritance_decay × effective(parent)

This ensures that relevance detected at a section level flows through to its subsections. Final scores are normalized to [0, 1].

The tree serializes to/from section_tree.json in each run directory.

3. Query Difficulty Analysis (`analysis.py`, `models.py`)

Estimates how hard a query is to answer based on graph-structural properties of the subgraph induced by the query's matched nodes.

Features (QueryFeatures):

Feature	Description
`query_node_count`	Number of matched graph nodes
`max_path_length`	Longest shortest path between any two query nodes
`avg_path_length`	Average shortest path length
`component_count`	Connected components in the induced subgraph
`avg_degree` / `max_degree`	Fan-out of the subgraph
`subgraph_node_count` / `subgraph_edge_count`	Size of the concept subgraph
`doc_count`	Number of distinct source chunks touched

Scoring (DifficultyScore): five orthogonal dimensions each contribute 0–2 points:

Dimension	Proxy for
`multihop`	Reasoning chain length (path hops)
`fragmentation`	Disconnected concept clusters
`subgraph_size`	Breadth of knowledge needed
`branching`	Ambiguity / fan-out
`dispersion`	Spread across source documents

Total score (0–10) maps to EASY / MEDIUM / HARD. The analyze_query script outputs results as JSON.

4. KG Retriever (`query.py`)

A KGRetriever class implementing the duck-typed Retriever interface (name attribute + get_scores method), making it a drop-in component for the RAG EnsembleRanker.

Node-match scoring (retrieve_from_kg):

Chunks referenced by directly-matched query nodes receive +1.0.
BFS traversal up to num_hops propagates neighbor contributions with geometric decay: neighbor_weight^hop × (edge_weight / max_edge_weight).
Each graph node is visited only once at its closest hop. Scores are normalized to [0, 1].

Section blending: when a SectionTree is provided, the final score is a weighted combination:

combined = beta × section_score + (1 - beta) × node_score

Set beta = 0.0 to fall back to pure node-match scoring.

CanonicalLookup is threaded through extract_query_nodes so synonym-aware matching is available at retrieval time.

Supporting Changes

io.py — clean I/O utilities: load_graph, load_run_chunks, resolve_run_dir (handles both specific run dirs and latest symlinks), load_graph_and_chunks, load_graph_chunks_and_tree, load_canonicalization_data.
utils/normalizer.py, utils/ngrams.py — shared text normalization and n-gram extraction used across canonicalization, section tree, and query matching.
scripts/ — scripts reorganized into a subdirectory; analyze_query.py and inspect_run.py added. inspect_run.py prints graph statistics, degree distribution, and section tree summaries for a given run directory.
persisters/networkx_json_persister.py — extended to optionally persist canonicalization artifacts (synonym_table.json, canonical_keywords.json, canonical_embeddings.npy) and the section tree.
pipeline.py — canonicalization step wired in as an optional stage between extraction and linking.
tests/test_knowledge_graph.py — unit tests covering the new modules.

…peline

…onicalization

…tion features

…luation

…query processing

…DME with new commands

…ancement

…ics calculations in benchmark retrieval script

… results

santo0 · 2026-04-14T19:18:07Z

I'm trying to simplify this PR, I will notify when I'm done.

feat: Add LLM-based canonicalization, section tree knowledge graph pi…

e2e1d88

…peline

santo0 changed the title ~~feat: Add LLM-based canonicalization, section tree knowledge graph~~ Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever Apr 3, 2026

santo0 changed the title ~~Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever~~ Canonicalization, Section Tree, Summary Tree and KG Retriever Apr 3, 2026

santo0 added 7 commits April 8, 2026 22:20

refactor: Simplify synonym handling and enhance prompt clarity in can…

70e0fdf

…onicalization

feat: Enhance canonicalization process with new caching and configura…

440fdc2

…tion features

refactor: KGNodeRetriever and SectionTreeRetriever

bf7a1e9

feat: Add benchmark retrieval script with LLM grading and metrics eva…

cd6dcea

…luation

feat: Enhance difficulty analysis by integrating canonical lookup in …

c8a7bc3

…query processing

feat: Update embedding model to Sentence-Transformers and enhance REA…

0dc609f

…DME with new commands

Merge remote-tracking branch 'origin/kg-query-difficulty' into kg-enh…

d05c7ce

…ancement

santo0 changed the title ~~Canonicalization, Section Tree, Summary Tree and KG Retriever~~ Canonicalization, Section Tree and KG Retriever Apr 10, 2026

santo0 marked this pull request as ready for review April 10, 2026 13:59

santo0 added 3 commits April 10, 2026 10:11

Merge branch 'kg-query-difficulty' into kg-enhancement

8583c4b

refactor: Remove commented-out print statements and unused ideal metr…

fc14262

…ics calculations in benchmark retrieval script

fix: Update JSON dump to handle non-serializable objects in benchmark…

3d51cfa

… results

santo0 changed the title ~~Canonicalization, Section Tree and KG Retriever~~ 2.Canonicalization, Section Tree and KG Retriever Apr 15, 2026

This was referenced Apr 15, 2026

1.Knowledge Graph Building Pipeline #97

Open

3.Summary Tree & Retriever Integration #105

Open

santo0 added 4 commits April 15, 2026 15:58

Merge branch 'kg-query-difficulty' into kg-enhancement

7c0d997

Merge branch 'kg-query-difficulty' into kg-enhancement

f1c2719

fix: persist canonicalization results correctly

04ea0f2

fix: import errors and deprecate tests

9702a0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.Canonicalization, Section Tree and KG Retriever#98

2.Canonicalization, Section Tree and KG Retriever#98
santo0 wants to merge 15 commits into
kg-query-difficultyfrom
kg-enhancement

santo0 commented Mar 30, 2026 •

edited

Loading

Uh oh!

santo0 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

santo0 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever

PRs structure

1. Keyword Canonicalization (canonicalizer.py)

2. Section Tree (section_tree.py)

3. Query Difficulty Analysis (analysis.py, models.py)

4. KG Retriever (query.py)

Supporting Changes

Uh oh!

santo0 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

santo0 commented Mar 30, 2026 •

edited

Loading

1. Keyword Canonicalization (`canonicalizer.py`)

2. Section Tree (`section_tree.py`)

3. Query Difficulty Analysis (`analysis.py`, `models.py`)

4. KG Retriever (`query.py`)