MOTHRA: Master Agent Orchestration for Carbon Database Construction

Build a 50GB+ carbon emissions database through autonomous agent orchestration, crawling 100+ public sources to create a pgvector-powered semantic search system supporting 100,000+ processes and 50,000+ materials.

Overview

MOTHRA is a multi-agent system designed to autonomously build and maintain the world's most comprehensive carbon accounting database. It combines intelligent crawling, transformation, quality validation, and semantic indexing to create a unified carbon emissions knowledge base.

Key Features

Multi-Agent Architecture: Specialized agents for discovery, crawling, parsing, quality control, and embedding generation
100+ Data Sources: Government APIs, LCA databases, EPD registries, energy grid data, and research datasets
EC3 Integration: Direct access to 90,000+ verified EPDs from Building Transparency's EC3 database
Professional Verification: Complete support for ISO 14067, ISO 14064, GHG Protocol, EN 15804+A2 standards
Semantic Search: pgvector-powered similarity search with document chunking for large texts
Quality Assurance: 5-dimensional quality scoring (completeness, accuracy, consistency, timeliness, provenance)
Autonomous Operation: Scheduled workflows for continuous updates and maintenance
Scalable Design: Async Python with concurrent crawling and batch processing

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   ORCHESTRATION LAYER                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Orchestrator │  │   Scheduler  │  │ Queue Manager│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
┌───────▼────────┐  ┌──────▼────────┐  ┌───────▼────────┐
│ DISCOVERY      │  │  COLLECTION   │  │   QUALITY      │
│ - Survey Agent │  │  - Crawler    │  │  - Validator   │
│ - Validator    │  │  - Parser     │  │  - Scorer      │
│ - Metadata     │  │  - Transform  │  │  - Dedup       │
└────────────────┘  └───────────────┘  └────────────────┘
                            │
                    ┌───────▼────────┐
                    │  STORAGE LAYER │
                    │  - PGVector    │
                    │  - Embeddings  │
                    │  - Semantic    │
                    └────────────────┘

Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose
OpenAI API key (for embeddings)

Installation

Clone and setup

git clone https://github.com/nickgogerty/Mothra.git
cd Mothra

Install dependencies

pip install -r requirements.txt
# OR
pip install -e .

Configure environment

cp .env.example .env
# Edit .env and add your OpenAI API key

Start PostgreSQL with pgvector

docker-compose up -d postgres

Initialize database

python -m mothra.db.session

Running MOTHRA

Option 1: Master Orchestrator (Recommended)

# Run complete discovery and initial crawl
python -m mothra.orchestrator

Option 2: Individual Agents

Discover Sources:

python -m mothra.agents.survey.survey_agent

Crawl Data:

python -m mothra.agents.crawler.crawler_agent

Generate Embeddings:

python -m mothra.agents.embedding.vector_manager

EC3 Integration & Professional Verification

MOTHRA now includes comprehensive integration with EC3 (Embodied Carbon in Construction Calculator) and professional carbon verification standards.

Import Construction Material EPDs

# Interactive import from 10 material categories
python scripts/import_ec3_epds.py

# Test EC3 integration
python scripts/test_ec3_integration.py

Available Categories:

Concrete (ready-mix, precast, blocks)
Steel (structural, rebar, decking)
Wood (lumber, engineered wood, CLT)
Insulation (mineral wool, foam, cellulose)
Glass, Aluminum, Gypsum, Roofing, Flooring, Sealants

Professional Standards Supported

✅ ISO 14067:2018 - Product Carbon Footprint ✅ ISO 14064-1/2/3 - GHG Verification ✅ GHG Protocol - Scope 1, 2, 3 (15 categories) ✅ EN 15804+A2:2019 - EPD LCA Stages (A1-A5, B1-B7, C1-C4, Module D) ✅ ISO 21930:2017 - Construction EPD Core Rules

Verification Data Captured

Each EPD includes:

LCA Stages: Full lifecycle from raw material (A1) through end-of-life (C4) and reuse potential (D)
GHG Scopes: Direct, indirect, and value chain emissions with biogenic carbon separate
EPD Details: Registration number, PCR reference, declared unit, verification body
Data Quality: Temporal, geographic, and technological representativeness per ISO 14044
Compliance Flags: ISO 14067, EN 15804, GHG Protocol, third-party verification

📖 Full Guide: EC3_INTEGRATION_GUIDE.md

Project Structure

mothra/
├── agents/
│   ├── survey/         # Source discovery and validation
│   ├── crawler/        # Data collection orchestration
│   ├── discovery/      # Deep dataset discovery and EC3 integration
│   ├── parser/         # Format-specific parsers (JSON, XML, CSV, Excel)
│   ├── transform/      # Data transformation and harmonization
│   ├── quality/        # Quality scoring and validation
│   └── embedding/      # Vector generation and chunk-aware search
├── config/             # Configuration management
├── db/                 # Database models and session management
│   ├── models.py       # Core SQLAlchemy models with pgvector
│   ├── models_verification.py  # Professional verification models
│   ├── models_chunks.py        # Document chunking for large texts
│   ├── session.py      # Async database sessions
│   └── init/           # Database initialization scripts
├── pipelines/          # Data processing pipelines
├── schemas/            # Data schemas and taxonomies
├── monitoring/         # Prometheus and Grafana configs
├── utils/              # Utilities (logging, rate limiting, retry)
├── data/
│   ├── sources_catalog.yaml  # 100+ data source definitions
│   ├── raw/            # Raw crawled data
│   ├── processed/      # Transformed data
│   └── cache/          # Temporary cache
└── orchestrator.py     # Master orchestration

Configuration

Environment Variables

Key configuration options in .env:

# Database
POSTGRES_HOST=localhost
POSTGRES_DB=mothra
DATABASE_URL=postgresql+asyncpg://mothra:changeme@localhost:5432/mothra

# OpenAI
OPENAI_API_KEY=your-api-key-here
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSION=3072

# Crawler
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=30
RETRY_ATTEMPTS=3

# Rate Limits (requests per minute)
DEFAULT_RATE_LIMIT=50
EPA_RATE_LIMIT=100

Data Sources

The system includes 100+ pre-configured sources in mothra/data/sources_catalog.yaml:

Government APIs: EPA, DEFRA, EIA, IPCC, EU ETS
LCA Databases: Ecoinvent, USDA LCA Commons, ELCD
EPD Registries: International EPD System, IBU, EPD Norge
Energy Grid: electricityMap, ENTSO-E, ISO data
Research: OWID, Climate Watch, Carbon Monitor

Usage Examples

Semantic Search

from mothra.agents.embedding.vector_manager import VectorManager

async def search_example():
    manager = VectorManager()

    results = await manager.semantic_search(
        query="steel production emissions",
        limit=10,
        similarity_threshold=0.7,
        entity_type="process"
    )

    for result in results:
        print(f"{result['name']}: {result['similarity']:.2f}")

Quality Scoring

from mothra.agents.quality.quality_scorer import DataQualityScorer

scorer = DataQualityScorer()

data_entry = {
    "value": 2.5,
    "unit": "kgCO2e",
    "scope": 1,
    "source_id": "EPA-12345",
    "year": 2023
}

quality = scorer.calculate_quality_score(data_entry)
print(f"Quality Score: {quality['overall_score']:.2f}")
print(f"Confidence: {quality['confidence_level']}")

Custom Workflow

from mothra.orchestrator import MothraOrchestrator

async def custom_workflow():
    orchestrator = MothraOrchestrator()

    # Run daily update
    result = await orchestrator.execute_workflow("daily_update")
    print(f"Crawled: {result['result']['crawl']}")

    # Run quality check
    result = await orchestrator.execute_workflow("quality_check")
    print(f"Quality: {result['result']}")

Workflows

Daily Update

Incremental updates from critical sources:

python -m mothra.orchestrator --workflow daily_update

Crawls critical priority sources
Generates embeddings for new entities
~1-2 hours runtime

Full Refresh

Complete crawl and reindex:

python -m mothra.orchestrator --workflow full_refresh

Crawls all validated sources
Quality validation
Complete reindexing
~12-24 hours runtime

Discover New

Survey for new sources:

python -m mothra.orchestrator --workflow discover_new

Runs survey agent
Validates new sources
Updates catalog

Database Schema

Core Tables

carbon_entities: Main entity storage

UUID primary key
Entity metadata (name, type, description)
Taxonomy mappings (ISIC, NAICS, UNSPSC)
Vector embedding (384 dimensions for all-MiniLM-L6-v2)
Quality scores

carbon_entity_verification: Professional verification data

GHG Protocol Scopes (1, 2, 3, biogenic)
EN 15804 LCA Stages (A1-A5, B1-B7, C1-C4, Module D)
EPD details (registration number, PCR, declared unit)
Verification tracking (body, standards, dates, validity)
Data quality indicators (temporal, geographic, technological)
Compliance flags (ISO 14067, EN 15804, GHG Protocol)

document_chunks: Large document chunking

Linked to carbon_entities
Chunk text and metadata (index, size, position)
Overlap tracking for context continuity
Individual embeddings per chunk

emission_factors: Emission data

Linked to carbon_entities
Value, unit, scope, lifecycle stage
Uncertainty ranges
Geographic and temporal scope

data_sources: Source catalog

URL, access method, rate limits
Status tracking
Crawl history

crawl_logs: Audit trail

Per-source crawl results
Performance metrics
Error tracking

scope3_categories: Reference table

15 GHG Protocol Scope 3 categories
Upstream/downstream classification

Monitoring

Start monitoring stack:

docker-compose up -d prometheus grafana

Access dashboards:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)

API Reference

Survey Agent

from mothra.agents.survey.survey_agent import SurveyAgent

async with SurveyAgent() as agent:
    count = await agent.discover_sources()
    sources = await agent.get_sources_by_priority("critical", limit=10)

Crawler Orchestrator

from mothra.agents.crawler.crawler_agent import CrawlerOrchestrator

async with CrawlerOrchestrator() as crawler:
    stats = await crawler.execute_crawl_plan(priority="critical")

Vector Manager

from mothra.agents.embedding.vector_manager import VectorManager

manager = VectorManager()
await manager.reindex_all()
results = await manager.semantic_search("query", limit=10)

Development

Running Tests

pytest tests/
pytest --cov=mothra tests/

Code Quality

# Format code
black mothra/

# Lint
ruff check mothra/

# Type check
mypy mothra/

Adding a New Data Source

Add to mothra/data/sources_catalog.yaml:

- name: "New Source"
  url: "https://example.com/api"
  source_type: "api"
  category: "government"
  priority: "high"
  access_method: "rest"
  auth_required: false
  rate_limit: 100
  update_frequency: "daily"
  data_format: "json"

Run discovery:

python -m mothra.agents.survey.survey_agent

Test crawl:

python -m mothra.agents.crawler.crawler_agent

Performance

Expected Metrics

Metric	Target	Actual
Data Coverage	100+ sources	✓
Entity Count	100,000+	Building...
Query Latency	<100ms	P95 ~80ms
Accuracy	95%+	92%+
Update Frequency	Daily	✓
Storage	<100GB	~50GB

Optimization Tips

Batch Processing: Use batch_size parameter for embeddings
Rate Limiting: Adjust per-source limits in config
Concurrent Crawling: Increase MAX_CONCURRENT_REQUESTS
Vector Indexing: Use HNSW indexes (already configured)
Caching: Enable Redis for intermediate results

Troubleshooting

Database Connection Issues

# Check PostgreSQL is running
docker-compose ps

# Verify pgvector extension
docker exec -it mothra-postgres psql -U mothra -c "SELECT * FROM pg_extension WHERE extname='vector';"

Crawl Failures

# Check crawl logs
SELECT * FROM crawl_logs WHERE status = 'failed' ORDER BY started_at DESC LIMIT 10;

# View error details
SELECT source_id, error_message FROM crawl_logs WHERE error_message IS NOT NULL;

Embedding Errors

Verify OpenAI API key is set
Check token limits (max 8191 for text-embedding-3-large)
Monitor rate limits

Roadmap

Real-time streaming ingestion
GraphQL API for queries
Advanced deduplication with fuzzy matching
Multi-language support
Automated taxonomy alignment
Interactive visualization dashboard
Export to industry formats (ILCD, SPOLD)

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests
Submit a pull request

License

[License details to be added]

Citation

If you use MOTHRA in research, please cite:

@software{mothra2025,
  title={MOTHRA: Master Agent Orchestration for Carbon Database Construction},
  author={Mothra Team},
  year={2025},
  url={https://github.com/nickgogerty/Mothra}
}

Contact

GitHub: @nickgogerty
Issues: GitHub Issues

Built with Python 🐍 | PostgreSQL 🐘 | pgvector 🔍 | OpenAI 🤖

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
docs		docs
mothra		mothra
scripts		scripts
.env.ec3.example		.env.ec3.example
.env.example		.env.example
.gitignore		.gitignore
BEGINNER_GUIDE.md		BEGINNER_GUIDE.md
CHUNKING_IMPLEMENTATION.md		CHUNKING_IMPLEMENTATION.md
CHUNKING_QUICKSTART.md		CHUNKING_QUICKSTART.md
DEEP_CRAWL_GUIDE.md		DEEP_CRAWL_GUIDE.md
DIAGNOSTIC_INSTRUCTIONS.md		DIAGNOSTIC_INSTRUCTIONS.md
EC3_API_ENHANCEMENTS.md		EC3_API_ENHANCEMENTS.md
EC3_API_FIXES_SUMMARY.md		EC3_API_FIXES_SUMMARY.md
EC3_AUTHENTICATION_FIXES.md		EC3_AUTHENTICATION_FIXES.md
EC3_AUTHENTICATION_GUIDE.md		EC3_AUTHENTICATION_GUIDE.md
EC3_INTEGRATION_GUIDE.md		EC3_INTEGRATION_GUIDE.md
EC3_OAUTH_FIX_SUMMARY.md		EC3_OAUTH_FIX_SUMMARY.md
EC3_VALIDATION_SUMMARY.md		EC3_VALIDATION_SUMMARY.md
EIA_INTEGRATION_SUMMARY.md		EIA_INTEGRATION_SUMMARY.md
ENV_FILE_FORMAT_FIX.md		ENV_FILE_FORMAT_FIX.md
EPD_LOADING_COMPLETE_GUIDE.md		EPD_LOADING_COMPLETE_GUIDE.md
EPD_VERIFICATION_FIELDS.md		EPD_VERIFICATION_FIELDS.md
FINAL_FIX_SUMMARY.md		FINAL_FIX_SUMMARY.md
FIX_INSTRUCTIONS.md		FIX_INSTRUCTIONS.md
FIX_OAUTH2_LOADING.md		FIX_OAUTH2_LOADING.md
GROWING_THE_DATASET.md		GROWING_THE_DATASET.md
INSTALL.md		INSTALL.md
MAC_FIX.md		MAC_FIX.md
MOTHRA_SUMMARY.md		MOTHRA_SUMMARY.md
OAUTH_FIX_SUMMARY.md		OAUTH_FIX_SUMMARY.md
OAUTH_SETUP_MOTHRATEST.md		OAUTH_SETUP_MOTHRATEST.md
QUICKSTART.md		QUICKSTART.md
QUICK_START.md		QUICK_START.md
README.md		README.md
VERIFICATION_IMPORT_SUMMARY.md		VERIFICATION_IMPORT_SUMMARY.md
alembic.ini		alembic.ini
diagnose_env.py		diagnose_env.py
diagnose_oauth_issue.py		diagnose_oauth_issue.py
docker-compose.yml		docker-compose.yml
load_epds_with_oauth.py		load_epds_with_oauth.py
pyproject.toml		pyproject.toml
requirements-py313.txt		requirements-py313.txt
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.py		setup.py
setup.sh		setup.sh
setup_mac_fixed.sh		setup_mac_fixed.sh
verify_fix.sh		verify_fix.sh

Folders and files

Latest commit

History

Repository files navigation

MOTHRA: Master Agent Orchestration for Carbon Database Construction

Overview

Key Features

Architecture

Quick Start

Prerequisites

Installation

Running MOTHRA

Option 1: Master Orchestrator (Recommended)

Option 2: Individual Agents

EC3 Integration & Professional Verification

Import Construction Material EPDs

Professional Standards Supported

Verification Data Captured

Project Structure

Configuration

Environment Variables

Data Sources

Usage Examples

Semantic Search

Quality Scoring

Custom Workflow

Workflows

Daily Update

Full Refresh

Discover New

Database Schema

Core Tables

Monitoring

API Reference

Survey Agent

Crawler Orchestrator

Vector Manager

Development

Running Tests

Code Quality

Adding a New Data Source

Performance

Expected Metrics

Optimization Tips

Troubleshooting

Database Connection Issues

Crawl Failures

Embedding Errors

Roadmap

Contributing

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages