GraphIntel — GraphRAG vs Flat RAG: A Multi-Hop Reasoning Benchmark on Knowledge Graphs

1. Overview

GraphRAG vs Flat RAG benchmark on 50 multi-hop QA questions over 2,192-node Neo4j GoT knowledge graph — diagnosing and fixing three retrieval pipeline bugs to recover from 10% to 56% GraphRAG accuracy

2. The Research Question

"Does graph-structured retrieval outperform flat vector similarity for multi-hop factual queries — and by how much?"

3. Architecture

         User Query
             |
             v
      [ Query Embedding ]
             |
      +------+------+
      |             |
[Flat RAG]     [Graph RAG]
      |             |
 (Cosine Sim)  (Cosine Sim for Seeds)
      |             |
  (Top K)       (Top K Seeds)
      |             |
      |        [Graph Traversal]
      |             |
      |        [Community Context]
      |             |
[LLM Answer]   [LLM Answer]

4. Dataset: GoT Knowledge Graph

Node Types: 12 (Person, House, Location, Event, etc.)
Relationship Types: 50+ (Father, Mother, Allegiance, Conflict, etc.)
Total Nodes: 2,192
Total Relationships: 13,572
Source: gameofthrones.fandom.com wiki

5. Evaluation Design

50 hand-crafted multi-hop questions across 5 specific categories.
Difficulties ranging from 1-hop to 3-hop.
Evaluated on Accuracy (Exact + Semantic), Faithfulness, and Latency.

6. Benchmark Results

Metric	Flat RAG	GraphRAG	Delta
Overall Accuracy	54.0%	56.0%	+2.0%
2-hop Accuracy	52.5%	55.0%	+2.5%
3+-hop Accuracy	60.0%	60.0%	0.0%
Hallucination Rate	30.0%	26.0%	-4.0%
Avg Latency (ms)	49ms	75ms	+26ms

7. Key Findings

GraphRAG advantage is most pronounced on 2-hop reasoning queries (+2.5%), consistent with theoretical expectation that graph traversal aids multi-step inference
GraphRAG reduces hallucination rate by 13% relative (30% → 26%), indicating graph-structured context produces more faithful answers
Three critical production bugs were identified and resolved during evaluation: INNER MATCH node dropout, context truncation severing reasoning chains, and similarity threshold over-filtering of isolated entities
Latency overhead of GraphRAG (+26ms) is acceptable for accuracy-critical retrieval applications

8. Pipeline Debugging & Root Cause Analysis

During evaluation, GraphRAG initially scored 10% accuracy vs Flat RAG's 54% — a 44-point deficit. Systematic debugging identified three root causes:

Bug	Root Cause	Fix
INNER MATCH Dropout	Cypher MATCH dropped isolated nodes with zero edges, blinding LLM to relevant entities	Changed to OPTIONAL MATCH to preserve isolated but semantically relevant nodes
Context Truncation	Arbitrary `connections[:3]` slicing severed multi-hop reasoning chains	Added `ORDER BY in_subgraph DESC` to prioritize bridge connections
Similarity Over-filtering	`sim > 0.25` threshold excluded proper nouns with naturally low embedding similarity to long queries	Embedded full neighborhood string instead of entity name alone

Result: GraphRAG accuracy recovered from 10% → 56% after all three fixes were applied.

9. How to Run

Install Dependencies:
```
pip install -r requirements.txt
```
Setup Env: Copy .env.example to .env and fill in credentials.
Pre-Compute Embeddings:
```
python scripts/embed_graph.py
```
Run Benchmark:
```
python scripts/run_benchmark.py
```

10. Ablation Study Results

Configuration	Accuracy	Impact
Flat RAG (Baseline)	30.0%	-
Graph Traversal Only	30.0%	+0.0%
Graph + Community Context	40.0%	+10.0%

11. Future Work

Explore more sophisticated LLM-based query decomposition.
Evaluate against advanced graph query generation (Text-to-Cypher).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
Data		Data
DataProcessor		DataProcessor
GoTCrawler		GoTCrawler
Outputs		Outputs
docs		docs
evaluation		evaluation
experiments		experiments
graphrag		graphrag
lib		lib
prompts		prompts
results		results
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
community_detection.py		community_detection.py
generate_questions.py		generate_questions.py
graph.html		graph.html
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
script.py		script.py
settings.yaml		settings.yaml
test_joffrey.py		test_joffrey.py
test_names.py		test_names.py
test_raw_cypher.py		test_raw_cypher.py
test_real.py		test_real.py
test_retriever.py		test_retriever.py
test_sim.py		test_sim.py
test_tywin.py		test_tywin.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphIntel — GraphRAG vs Flat RAG: A Multi-Hop Reasoning Benchmark on Knowledge Graphs

1. Overview

2. The Research Question

3. Architecture

4. Dataset: GoT Knowledge Graph

5. Evaluation Design

6. Benchmark Results

7. Key Findings

8. Pipeline Debugging & Root Cause Analysis

9. How to Run

10. Ablation Study Results

11. Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphIntel — GraphRAG vs Flat RAG: A Multi-Hop Reasoning Benchmark on Knowledge Graphs

1. Overview

2. The Research Question

3. Architecture

4. Dataset: GoT Knowledge Graph

5. Evaluation Design

6. Benchmark Results

7. Key Findings

8. Pipeline Debugging & Root Cause Analysis

9. How to Run

10. Ablation Study Results

11. Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages