Context Graph Explorer

A graph-based data exploration system with a natural language query interface. Business entities (orders, deliveries, invoices, payments) are modelled as a graph, visualised with React Flow, and queryable through an LLM-powered chat interface (Gemini or Groq).

Demo

I've made a few UI changes as you can see from the first video, the second video details the design decisions/thoughts. Click here to see the entire new UI demo : click here

Recording.2026-03-27.000203.mp4

Untitled.design.1.mp4

Architecture

┌─────────────────────────────────────────────────────────┐
│                     React (Vite)                        │
│  ┌──────────────────────┐  ┌───────────────────────┐   │
│  │   GraphCanvas        │  │     ChatPanel         │   │
│  │   (React Flow)       │  │   (NL Query UI)       │   │
│  │   EntityNode         │  │   SQL disclosure      │   │
│  │   NodeDetailPanel    │  │   Node highlighting   │   │
│  └──────────────────────┘  └───────────────────────┘   │
│                    global store                        │
└────────────────────────┬────────────────────────────────┘
                         │ REST /api/*
┌────────────────────────▼────────────────────────────────┐
│                  FastAPI (Python)                        │
│  /api/graph/           → graph_builder.py               │
│  /api/chat/            → llm_service.py                 │
│  /api/query/execute    → db.py (SELECT only)            │
└────────────────────────┬────────────────────────────────┘
                         │
        ┌────────────────┴────────────────┐
        │           SQLite                │
        │  (context_graph.db)             │
        └─────────────────────────────────┘
                         │
              Google Gemini/Groq

Design Decisions I made:

caching graph nodes server side so they load faster -> this was especially important because I used a free tier web service ( I added this strategy for graph generation to reduce repeated load on the backend server and it really helps with much faster reloads after the first request.)
adding a small mini index for all major hubs (nodes with the most connections) for quick access, and neighbourhood highlighting ( upto 2nd degree neighbours have been considered by me for this purpose)
I kept node meta data attached to each node for easy viewing.
entity types have specific colors and badges so you can differentiate between them
I added test coverage for graph nodes to assert unique node ids, valid endpoints, and no malformed nodes during/ from ingestion

Database Choice

I chose SQLite for local data portability (data would be accessible locally), I prioritized simplicity while validating product behavior since I wanted to reduce infrastructure complexity early and focus on graph logic, UX, and query behavior first.
It is a good fit for this dataset size, the data pipeline works naturally with a file-based DB I ingest JSONL data and convert it into relational tables. A single SQLite file makes this easy to manage, reset, back up, and move.
My workload is mostly read-heavy my backend reads relational data, builds a graph view, and serves it. SQLite performs well for this kind of local read-focused workload
I still kept a migration path open by using standard SQL patterns, so I can move to PostgreSQL later if I need higher concurrency or production-scale features.

The graph is constructed in-memory from relational queries in graph_builder.py, which means you get the flexibility of SQL (for the LLM query path) AND a graph representation (for the UI) from the same source of truth. For a production deployment with millions of rows, swap to PostgreSQL (same query layer) or add a Neo4j graph layer on top.

Graph Modelling

Nodes represent business entities: Customer, Order, OrderItem, Product, Delivery, Invoice, Payment, Address.

Edges capture the real flow of data/ relationships :

Customer → Order (placed)
Order → OrderItem (contains)
OrderItem → Product (references)
Order → Delivery (delivered_via)
Delivery → Invoice (triggers)
Order → Invoice (billed_as)
Invoice → Payment (settled_by)

LLM Prompting Strategy

The Gemini prompt is structured in two stages:

Stage 1 — Classification + SQL generation A system prompt containing the full schema is sent with every user message. The model is instructed to return a strict JSON object with two paths:

{"relevant": false, "message": "..."} — for off-topic queries
{"relevant": true, "sql": "...", "explanation": "..."} — for valid queries

This separation makes guardrails reliable: the model explicitly labels relevance, so the backend can reject without ever executing a query.

Stage 2 — Answer synthesis The raw query results (capped at 50 rows for context) are sent back to Gemini with the original question. The model is asked to produce a 2–4 sentence data-backed answer without mentioning SQL or database internals.

Guardrails

I only allow read queries (no writes) the advantage here is that the AI cannot change or delete important data.
I restricted the scope of questions that the AI accepts to only the dataset. This helps guard against prompt injections.
SQL validation before execution (check table+column existence). I validate table and column names before running SQL queries and test before full execution
Auto-retry: feed DB error back to LLM once and ask it to fix query
Data grounding guardrail: response generation happens only after SQL execution; if no rows, system returns explicit no-result message
The backend's execute_query() function enforces SELECT-only at the string level & any other statement raises a ValueError before execution.
Query guardrail: backend enforces SELECT-only execution and rejects mutating SQL.
Domain guardrail: out-of-scope prompts are rejected with a restrict to dataset-only message.
Provider/ops guardrail: LLM failures (quota/model/access) are caught and returned as non-crashing API responses.

Setup

Prerequisites

Python 3.11+
Node.js 18+
One API key for either:
- Google Gemini
- Groq

Backend

cd backend
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env

# For Groq (recommended free option)
# LLM_PROVIDER=groq
# GROQ_API_KEY=...
# GROQ_MODEL=llama-3.1-8b-instant

# Or for Gemini
# LLM_PROVIDER=gemini
# GEMINI_API_KEY=...
# GEMINI_MODEL=gemini-1.5-flash-latest

Frontend

cd frontend
npm install

Dataset Ingestion

Place the extracted SAP O2C dataset folder (with subfolders like sales_order_headers/, billing_document_items/, etc.) anywhere on your machine.
Run ingestion with the dataset root path (JSONL-based, no mapping edits needed):

cd backend
python -m services.ingest --data-dir ../sap-o2c-data

Notes:

The loader truncates each mapped table before importing, so reruns are idempotent.
Missing folders are skipped safely.

Running

Terminal 1 — backend:

cd backend
source venv/bin/activate
uvicorn main:app --reload --port 8000

Terminal 2 — frontend:

cd frontend
npm run dev

Example Queries

Question	What it tests
Which products are associated with the highest number of billing documents?	JOIN across order_items, invoices, products
trace the data flow of product 3001456	Multi-table JOIN: suggests a centralized production and storage system for product 3001456
Show sales orders that were delivered but never billed	LEFT JOIN + NULL check
Which customers have the most orders?	GROUP BY + ORDER BY
What is the total payment amount by method?	Aggregation on payments

Project Structure

context-graph/
├── backend/
│   ├── main.py                  # FastAPI app entry point
│   ├── requirements.txt
│   ├── .env.example
│   ├── routers/
│   │   ├── graph.py             # /api/graph/*
│   │   ├── chat.py              # /api/chat/
│   │   └── query.py             # /api/query/execute
│   ├── services/
│   │   ├── db.py                # SQLite connection + execute_query
│   │   ├── ingest.py            # CSV → SQLite loader
│   │   ├── graph_builder.py     # Node/edge construction
│   │   └── llm_service.py       # Gemini NL→SQL pipeline
│   └── data/                    # context_graph.db (gitignored)
│
└── frontend/
    ├── index.html
    ├── vite.config.js
    ├── package.json
    └── src/
        ├── App.jsx
        ├── main.jsx
        ├── styles/global.css
        ├── store/useStore.js    # Zustand state
        ├── services/
        │   ├── api.js           # Axios wrappers
        │   └── layout.js        # Dagre auto-layout
        └── components/
            ├── Navbar.jsx
            ├── GraphCanvas.jsx  # React Flow canvas
            ├── EntityNode.jsx   # Custom RF node
            ├── NodeDetailPanel.jsx
            └── ChatPanel.jsx    # Chat + SQL disclosure

Live Deployment Link:

https://context-graph-for-sales-data-front-end.onrender.com/

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
context-graph		context-graph
sap-o2c-data		sap-o2c-data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context Graph Explorer

Table of Contents

Demo

Architecture

Design Decisions I made:

Database Choice

Graph Modelling

LLM Prompting Strategy

Guardrails

Setup

Prerequisites

Backend

Frontend

Dataset Ingestion

Running

Example Queries

Project Structure

Live Deployment Link:

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Context Graph Explorer

Table of Contents

Demo

Architecture

Design Decisions I made:

Database Choice

Graph Modelling

LLM Prompting Strategy

Guardrails

Setup

Prerequisites

Backend

Frontend

Dataset Ingestion

Running

Example Queries

Project Structure

Live Deployment Link:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages