The VOC GraphRAG Engine is a computational framework designed to convert chemical ecology datasets, GC-MS experiments, and bioassay results into a structured scientific knowledge graph that can be queried using modern GraphRAG (Graph Retrieval-Augmented Generation) techniques.
Traditional analysis workflows rely on spreadsheets, static plots, and manual literature review. This project aims to transform these disconnected data sources into a machine-readable scientific knowledge base capable of answering complex biological questions.
The system integrates:
- GC-MS peak data (retention time, peak area, compound identification)
- Plant volatile emissions
- Insect behavioral bioassays
- Scientific literature and experimental notes
into a Neo4j knowledge graph enriched with LLM-powered information extraction and reasoning.
This repository provides the core engine for building such a system.
In chemical ecology research, important relationships often span multiple data sources:
- GC-MS chromatograms
- Bioassay experiments
- Experimental metadata
- Scientific publications
Answering questions such as:
- Which volatile compounds emitted by a plant attract a specific aphid species?
- Which GC-MS peaks correspond to compounds that show behavioral activity in bioassays?
- Which compounds consistently appear across sampling methods or timepoints?
requires linking chemical data, biological responses, and experimental context.
The VOC GraphRAG Engine enables this integration through a structured graph representation.
Raw Data Sources
│
├── GC-MS tables (RT, Area, Quality)
├── Bioassay results
├── Experimental notes
└── Scientific literature
│
▼
Information Extraction
│
├── LLM extraction of entities and relationships
└── Structured data ingestion (Excel / CSV)
│
▼
Scientific Knowledge Graph (Neo4j)
│
├── Plant nodes
├── Aphid nodes
├── VOC compound nodes
├── GC-MS peak nodes
└── Bioassay nodes
│
▼
GraphRAG Query Engine
│
└── Evidence-based scientific answers
The knowledge graph is built around a domain-specific ontology.
| Entity | Description |
|---|---|
| Plant | Plant species or cultivar emitting VOCs |
| Aphid | Target insect species |
| VOCCompound | Identified volatile organic compounds |
| GCMSPeak | Observed GC-MS peak with RT and abundance |
| Sample | Experimental sample or extraction |
| Bioassay | Behavioral experiment testing insect responses |
| Document / Chunk | Source literature or experimental notes |
Plant ── EMITS ── VOCCompound
VOCCompound ── IDENTIFIED_FROM ── GCMSPeak
GCMSPeak ── OBSERVED_IN ── Sample
Sample ── FROM_PLANT ── Plant
Bioassay ── TESTS ── VOCCompound
Bioassay ── TARGETS ── Aphid
Aphid ── RESPONDS_TO ── VOCCompound
This structure enables multi-hop reasoning across chemical, biological, and experimental layers.
Example question:
Which volatile compounds emitted by date palm attract aphids?
Graph traversal:
Plant: Date Palm
│
├── EMITS → Limonene
├── EMITS → β-Ocimene
└── EMITS → Hexanal
β-Ocimene
│
└── tested_in → Y-tube Bioassay
│
└── effect → attraction
The system returns:
- relevant compounds
- supporting GC-MS peaks
- associated bioassay evidence
- source documentation
voc-graphrag-engine
│
├── ingestion
│ ├── excel_gcms_loader.py
│ └── dataset_ingestion_pipeline.py
│
├── extraction
│ ├── llm_entity_extractor.py
│ └── relation_extraction.py
│
├── graph_schema
│ └── neo4j_schema.cypher
│
├── graphrag
│ ├── graph_query_engine.py
│ └── rag_pipeline.py
│
├── examples
│ └── aphid_voc_demo.ipynb
│
└── README.md
- Neo4j – graph database for scientific knowledge representation
- LangChain – orchestration of LLM workflows
- OpenAI / LLMs – information extraction and reasoning
- Python – data ingestion and analysis
- GraphRAG – graph-aware retrieval for explainable answers
Retrieve the most abundant compounds emitted by a plant:
MATCH (s:Sample)-[:FROM_PLANT]->(p:Plant {name:"date palm"})
MATCH (s)-[:HAS_PEAK]->(pk:GCMSPeak)-[:IDENTIFIED_AS]->(v:VOCCompound)
RETURN v.name, sum(pk.area) AS total_area
ORDER BY total_area DESC
LIMIT 10This engine is used to build the following scientific knowledge bases:
- Aphid VOC Library
- Insect Volatile Compound Libraries
- Plant–Insect Chemical Interaction Datasets
These repositories store domain-specific datasets, while the present repository provides the computational framework for constructing and querying the knowledge graph.
- Chemical ecology research
- Insect behavioral studies
- Plant–insect interaction analysis
- Biomarker discovery in VOC datasets
- Literature-aware chemical knowledge graphs
Planned improvements include:
- CAS / InChI / SMILES integration for compound identity resolution
- Automated GC-MS peak annotation pipelines
- Hybrid Graph + Vector retrieval
- Interactive scientific query interface
- Integration with chemical structure databases
If you use this framework in research or publications, please cite:
VOC GraphRAG Engine: A Knowledge Graph Framework for Chemical Ecology and GC-MS Data Integration.
This work is developed as part of ongoing efforts to create open computational infrastructure for chemical ecology and insect–plant interaction research.
MIT License
For collaboration or inquiries, please open an issue in this repository.