Skip to content

2.Canonicalization, Section Tree and KG Retriever#98

Open
santo0 wants to merge 15 commits into
kg-query-difficultyfrom
kg-enhancement
Open

2.Canonicalization, Section Tree and KG Retriever#98
santo0 wants to merge 15 commits into
kg-query-difficultyfrom
kg-enhancement

Conversation

@santo0
Copy link
Copy Markdown
Contributor

@santo0 santo0 commented Mar 30, 2026

Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever

Builds on the kg-query-difficulty pipeline with four major additions: LLM-based keyword canonicalization, a section-tree retrieval layer, a formal query difficulty model, and a RAG-compatible KGRetriever.

PRs structure

The PRs depend on the previous ones.

  1. 1.Knowledge Graph Building Pipeline  #97
  2. Current
  3. 3.Summary Tree & Retriever Integration #105

1. Keyword Canonicalization (canonicalizer.py)

Reduces synonym noise in the KG by collapsing semantically equivalent keywords into a single canonical form before graph construction.

The process runs in four stages:

  1. Embed — all unique keywords are encoded with a sentence-transformer (all-MiniLM-L6-v2).
  2. Cluster — complete-linkage hierarchical clustering with a cosine similarity threshold groups likely synonyms. Oversized clusters are force-split to respect the max_group_size limit.
  3. LLM verification — groups are sent to an OpenRouter LLM in batches. The model decides which keywords are true synonyms and elects a canonical form per group. Small groups (≤5 keywords) are batched together to reduce API calls.
  4. ApplyExtractionResult objects are rewritten to use canonical forms; any keyword not resolved by the LLM falls back to itself.

At query time, a CanonicalLookup object resolves new keywords: first via an O(1) synonym table lookup, then via embedding nearest-neighbor search gated by a configurable fallback_threshold.


2. Section Tree (section_tree.py)

Provides a structural retrieval signal by mirroring the textbook's heading hierarchy as a tree of SectionNode objects, each carrying aggregated KG keyword sets.

Construction (build_section_tree):

  1. Unique sections are extracted from chunk metadata (section field).
  2. Parent–child relationships are inferred from section numbers (e.g., 13.1 → parent 13).
  3. Chunk IDs are assigned to their leaf section nodes.
  4. Leaf keyword sets are populated from the KG graph's chunk_ids node attributes.
  5. Keyword sets are propagated bottom-up so every ancestor contains the union of all descendant keywords.
  6. Heading keywords (normalized n-grams) are extracted for each section to enable heading-level matching independent of KG vocabulary.

Scoring (get_chunk_scores): hybrid signal per section node blending two components:

  • KG keyword overlap (coverage × α + specificity × (1 − α)) using the node's aggregated keyword set.
  • Heading keyword match — same formula applied to independently-tokenized query tokens against the heading keyword set. Captures queries phrased differently from KG vocabulary.

A top-down inheritance pass then propagates effective scores to children:

effective(node) = own_score(node) + inheritance_decay × effective(parent)

This ensures that relevance detected at a section level flows through to its subsections. Final scores are normalized to [0, 1].

The tree serializes to/from section_tree.json in each run directory.


3. Query Difficulty Analysis (analysis.py, models.py)

Estimates how hard a query is to answer based on graph-structural properties of the subgraph induced by the query's matched nodes.

Features (QueryFeatures):

Feature Description
query_node_count Number of matched graph nodes
max_path_length Longest shortest path between any two query nodes
avg_path_length Average shortest path length
component_count Connected components in the induced subgraph
avg_degree / max_degree Fan-out of the subgraph
subgraph_node_count / subgraph_edge_count Size of the concept subgraph
doc_count Number of distinct source chunks touched

Scoring (DifficultyScore): five orthogonal dimensions each contribute 0–2 points:

Dimension Proxy for
multihop Reasoning chain length (path hops)
fragmentation Disconnected concept clusters
subgraph_size Breadth of knowledge needed
branching Ambiguity / fan-out
dispersion Spread across source documents

Total score (0–10) maps to EASY / MEDIUM / HARD. The analyze_query script outputs results as JSON.


4. KG Retriever (query.py)

A KGRetriever class implementing the duck-typed Retriever interface (name attribute + get_scores method), making it a drop-in component for the RAG EnsembleRanker.

Node-match scoring (retrieve_from_kg):

  • Chunks referenced by directly-matched query nodes receive +1.0.
  • BFS traversal up to num_hops propagates neighbor contributions with geometric decay: neighbor_weight^hop × (edge_weight / max_edge_weight).
  • Each graph node is visited only once at its closest hop. Scores are normalized to [0, 1].

Section blending: when a SectionTree is provided, the final score is a weighted combination:

combined = beta × section_score + (1 - beta) × node_score

Set beta = 0.0 to fall back to pure node-match scoring.

CanonicalLookup is threaded through extract_query_nodes so synonym-aware matching is available at retrieval time.


Supporting Changes

  • io.py — clean I/O utilities: load_graph, load_run_chunks, resolve_run_dir (handles both specific run dirs and latest symlinks), load_graph_and_chunks, load_graph_chunks_and_tree, load_canonicalization_data.
  • utils/normalizer.py, utils/ngrams.py — shared text normalization and n-gram extraction used across canonicalization, section tree, and query matching.
  • scripts/ — scripts reorganized into a subdirectory; analyze_query.py and inspect_run.py added. inspect_run.py prints graph statistics, degree distribution, and section tree summaries for a given run directory.
  • persisters/networkx_json_persister.py — extended to optionally persist canonicalization artifacts (synonym_table.json, canonical_keywords.json, canonical_embeddings.npy) and the section tree.
  • pipeline.py — canonicalization step wired in as an optional stage between extraction and linking.
  • tests/test_knowledge_graph.py — unit tests covering the new modules.

@santo0 santo0 changed the title feat: Add LLM-based canonicalization, section tree knowledge graph Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever Apr 3, 2026
@santo0 santo0 changed the title Knowledge Graph Enhancements: Canonicalization, Section Tree, Query Difficulty and KG Retriever Canonicalization, Section Tree, Summary Tree and KG Retriever Apr 3, 2026
@santo0 santo0 changed the title Canonicalization, Section Tree, Summary Tree and KG Retriever Canonicalization, Section Tree and KG Retriever Apr 10, 2026
@santo0 santo0 marked this pull request as ready for review April 10, 2026 13:59
@santo0
Copy link
Copy Markdown
Contributor Author

santo0 commented Apr 14, 2026

I'm trying to simplify this PR, I will notify when I'm done.

@santo0 santo0 changed the title Canonicalization, Section Tree and KG Retriever 2.Canonicalization, Section Tree and KG Retriever Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant