Skip to content

Taif-ChemEcoLab/voc-graphrag-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

VOC GraphRAG Engine

A Knowledge Graph + GraphRAG Framework for Chemical Ecology and GC-MS Data Analysis

Overview

The VOC GraphRAG Engine is a computational framework designed to convert chemical ecology datasets, GC-MS experiments, and bioassay results into a structured scientific knowledge graph that can be queried using modern GraphRAG (Graph Retrieval-Augmented Generation) techniques.

Traditional analysis workflows rely on spreadsheets, static plots, and manual literature review. This project aims to transform these disconnected data sources into a machine-readable scientific knowledge base capable of answering complex biological questions.

The system integrates:

  • GC-MS peak data (retention time, peak area, compound identification)
  • Plant volatile emissions
  • Insect behavioral bioassays
  • Scientific literature and experimental notes

into a Neo4j knowledge graph enriched with LLM-powered information extraction and reasoning.

This repository provides the core engine for building such a system.

Scientific Motivation

In chemical ecology research, important relationships often span multiple data sources:

  • GC-MS chromatograms
  • Bioassay experiments
  • Experimental metadata
  • Scientific publications

Answering questions such as:

  • Which volatile compounds emitted by a plant attract a specific aphid species?
  • Which GC-MS peaks correspond to compounds that show behavioral activity in bioassays?
  • Which compounds consistently appear across sampling methods or timepoints?

requires linking chemical data, biological responses, and experimental context.

The VOC GraphRAG Engine enables this integration through a structured graph representation.

System Architecture

Raw Data Sources
   │
   ├── GC-MS tables (RT, Area, Quality)
   ├── Bioassay results
   ├── Experimental notes
   └── Scientific literature
   │
   ▼
Information Extraction
   │
   ├── LLM extraction of entities and relationships
   └── Structured data ingestion (Excel / CSV)
   │
   ▼
Scientific Knowledge Graph (Neo4j)
   │
   ├── Plant nodes
   ├── Aphid nodes
   ├── VOC compound nodes
   ├── GC-MS peak nodes
   └── Bioassay nodes
   │
   ▼
GraphRAG Query Engine
   │
   └── Evidence-based scientific answers

Scientific Ontology

The knowledge graph is built around a domain-specific ontology.

Core Entities

Entity Description
Plant Plant species or cultivar emitting VOCs
Aphid Target insect species
VOCCompound Identified volatile organic compounds
GCMSPeak Observed GC-MS peak with RT and abundance
Sample Experimental sample or extraction
Bioassay Behavioral experiment testing insect responses
Document / Chunk Source literature or experimental notes

Core Relationships

Plant ── EMITS ── VOCCompound
VOCCompound ── IDENTIFIED_FROM ── GCMSPeak
GCMSPeak ── OBSERVED_IN ── Sample
Sample ── FROM_PLANT ── Plant

Bioassay ── TESTS ── VOCCompound
Bioassay ── TARGETS ── Aphid

Aphid ── RESPONDS_TO ── VOCCompound

This structure enables multi-hop reasoning across chemical, biological, and experimental layers.

Example Scientific Query

Example question:

Which volatile compounds emitted by date palm attract aphids?

Graph traversal:

Plant: Date Palm
   │
   ├── EMITS → Limonene
   ├── EMITS → β-Ocimene
   └── EMITS → Hexanal

β-Ocimene
   │
   └── tested_in → Y-tube Bioassay
          │
          └── effect → attraction

The system returns:

  • relevant compounds
  • supporting GC-MS peaks
  • associated bioassay evidence
  • source documentation

Repository Structure

voc-graphrag-engine
│
├── ingestion
│   ├── excel_gcms_loader.py
│   └── dataset_ingestion_pipeline.py
│
├── extraction
│   ├── llm_entity_extractor.py
│   └── relation_extraction.py
│
├── graph_schema
│   └── neo4j_schema.cypher
│
├── graphrag
│   ├── graph_query_engine.py
│   └── rag_pipeline.py
│
├── examples
│   └── aphid_voc_demo.ipynb
│
└── README.md

Technologies Used

  • Neo4j – graph database for scientific knowledge representation
  • LangChain – orchestration of LLM workflows
  • OpenAI / LLMs – information extraction and reasoning
  • Python – data ingestion and analysis
  • GraphRAG – graph-aware retrieval for explainable answers

Example Graph Query (Cypher)

Retrieve the most abundant compounds emitted by a plant:

MATCH (s:Sample)-[:FROM_PLANT]->(p:Plant {name:"date palm"})
MATCH (s)-[:HAS_PEAK]->(pk:GCMSPeak)-[:IDENTIFIED_AS]->(v:VOCCompound)
RETURN v.name, sum(pk.area) AS total_area
ORDER BY total_area DESC
LIMIT 10

Related Datasets

This engine is used to build the following scientific knowledge bases:

  • Aphid VOC Library
  • Insect Volatile Compound Libraries
  • Plant–Insect Chemical Interaction Datasets

These repositories store domain-specific datasets, while the present repository provides the computational framework for constructing and querying the knowledge graph.

Potential Applications

  • Chemical ecology research
  • Insect behavioral studies
  • Plant–insect interaction analysis
  • Biomarker discovery in VOC datasets
  • Literature-aware chemical knowledge graphs

Future Development

Planned improvements include:

  • CAS / InChI / SMILES integration for compound identity resolution
  • Automated GC-MS peak annotation pipelines
  • Hybrid Graph + Vector retrieval
  • Interactive scientific query interface
  • Integration with chemical structure databases

Citation

If you use this framework in research or publications, please cite:

VOC GraphRAG Engine: A Knowledge Graph Framework for Chemical Ecology and GC-MS Data Integration.

Acknowledgements

This work is developed as part of ongoing efforts to create open computational infrastructure for chemical ecology and insect–plant interaction research.

License

MIT License

Contact

For collaboration or inquiries, please open an issue in this repository.

About

VOC GraphRAG Engine A Knowledge Graph + GraphRAG pipeline for chemical ecology and GC-MS datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors