Skip to content

nickgogerty/Mothra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

90 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MOTHRA: Master Agent Orchestration for Carbon Database Construction

Python 3.11+ PostgreSQL pgvector

Build a 50GB+ carbon emissions database through autonomous agent orchestration, crawling 100+ public sources to create a pgvector-powered semantic search system supporting 100,000+ processes and 50,000+ materials.

Overview

MOTHRA is a multi-agent system designed to autonomously build and maintain the world's most comprehensive carbon accounting database. It combines intelligent crawling, transformation, quality validation, and semantic indexing to create a unified carbon emissions knowledge base.

Key Features

  • Multi-Agent Architecture: Specialized agents for discovery, crawling, parsing, quality control, and embedding generation
  • 100+ Data Sources: Government APIs, LCA databases, EPD registries, energy grid data, and research datasets
  • EC3 Integration: Direct access to 90,000+ verified EPDs from Building Transparency's EC3 database
  • Professional Verification: Complete support for ISO 14067, ISO 14064, GHG Protocol, EN 15804+A2 standards
  • Semantic Search: pgvector-powered similarity search with document chunking for large texts
  • Quality Assurance: 5-dimensional quality scoring (completeness, accuracy, consistency, timeliness, provenance)
  • Autonomous Operation: Scheduled workflows for continuous updates and maintenance
  • Scalable Design: Async Python with concurrent crawling and batch processing

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   ORCHESTRATION LAYER                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ Orchestrator β”‚  β”‚   Scheduler  β”‚  β”‚ Queue Managerβ”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                   β”‚                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DISCOVERY      β”‚  β”‚  COLLECTION   β”‚  β”‚   QUALITY      β”‚
β”‚ - Survey Agent β”‚  β”‚  - Crawler    β”‚  β”‚  - Validator   β”‚
β”‚ - Validator    β”‚  β”‚  - Parser     β”‚  β”‚  - Scorer      β”‚
β”‚ - Metadata     β”‚  β”‚  - Transform  β”‚  β”‚  - Dedup       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  STORAGE LAYER β”‚
                    β”‚  - PGVector    β”‚
                    β”‚  - Embeddings  β”‚
                    β”‚  - Semantic    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • OpenAI API key (for embeddings)

Installation

  1. Clone and setup
git clone https://github.com/nickgogerty/Mothra.git
cd Mothra
  1. Install dependencies
pip install -r requirements.txt
# OR
pip install -e .
  1. Configure environment
cp .env.example .env
# Edit .env and add your OpenAI API key
  1. Start PostgreSQL with pgvector
docker-compose up -d postgres
  1. Initialize database
python -m mothra.db.session

Running MOTHRA

Option 1: Master Orchestrator (Recommended)

# Run complete discovery and initial crawl
python -m mothra.orchestrator

Option 2: Individual Agents

Discover Sources:

python -m mothra.agents.survey.survey_agent

Crawl Data:

python -m mothra.agents.crawler.crawler_agent

Generate Embeddings:

python -m mothra.agents.embedding.vector_manager

EC3 Integration & Professional Verification

MOTHRA now includes comprehensive integration with EC3 (Embodied Carbon in Construction Calculator) and professional carbon verification standards.

Import Construction Material EPDs

# Interactive import from 10 material categories
python scripts/import_ec3_epds.py

# Test EC3 integration
python scripts/test_ec3_integration.py

Available Categories:

  • Concrete (ready-mix, precast, blocks)
  • Steel (structural, rebar, decking)
  • Wood (lumber, engineered wood, CLT)
  • Insulation (mineral wool, foam, cellulose)
  • Glass, Aluminum, Gypsum, Roofing, Flooring, Sealants

Professional Standards Supported

βœ… ISO 14067:2018 - Product Carbon Footprint βœ… ISO 14064-1/2/3 - GHG Verification βœ… GHG Protocol - Scope 1, 2, 3 (15 categories) βœ… EN 15804+A2:2019 - EPD LCA Stages (A1-A5, B1-B7, C1-C4, Module D) βœ… ISO 21930:2017 - Construction EPD Core Rules

Verification Data Captured

Each EPD includes:

  • LCA Stages: Full lifecycle from raw material (A1) through end-of-life (C4) and reuse potential (D)
  • GHG Scopes: Direct, indirect, and value chain emissions with biogenic carbon separate
  • EPD Details: Registration number, PCR reference, declared unit, verification body
  • Data Quality: Temporal, geographic, and technological representativeness per ISO 14044
  • Compliance Flags: ISO 14067, EN 15804, GHG Protocol, third-party verification

πŸ“– Full Guide: EC3_INTEGRATION_GUIDE.md

Project Structure

mothra/
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ survey/         # Source discovery and validation
β”‚   β”œβ”€β”€ crawler/        # Data collection orchestration
β”‚   β”œβ”€β”€ discovery/      # Deep dataset discovery and EC3 integration
β”‚   β”œβ”€β”€ parser/         # Format-specific parsers (JSON, XML, CSV, Excel)
β”‚   β”œβ”€β”€ transform/      # Data transformation and harmonization
β”‚   β”œβ”€β”€ quality/        # Quality scoring and validation
β”‚   └── embedding/      # Vector generation and chunk-aware search
β”œβ”€β”€ config/             # Configuration management
β”œβ”€β”€ db/                 # Database models and session management
β”‚   β”œβ”€β”€ models.py       # Core SQLAlchemy models with pgvector
β”‚   β”œβ”€β”€ models_verification.py  # Professional verification models
β”‚   β”œβ”€β”€ models_chunks.py        # Document chunking for large texts
β”‚   β”œβ”€β”€ session.py      # Async database sessions
β”‚   └── init/           # Database initialization scripts
β”œβ”€β”€ pipelines/          # Data processing pipelines
β”œβ”€β”€ schemas/            # Data schemas and taxonomies
β”œβ”€β”€ monitoring/         # Prometheus and Grafana configs
β”œβ”€β”€ utils/              # Utilities (logging, rate limiting, retry)
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sources_catalog.yaml  # 100+ data source definitions
β”‚   β”œβ”€β”€ raw/            # Raw crawled data
β”‚   β”œβ”€β”€ processed/      # Transformed data
β”‚   └── cache/          # Temporary cache
└── orchestrator.py     # Master orchestration

Configuration

Environment Variables

Key configuration options in .env:

# Database
POSTGRES_HOST=localhost
POSTGRES_DB=mothra
DATABASE_URL=postgresql+asyncpg://mothra:changeme@localhost:5432/mothra

# OpenAI
OPENAI_API_KEY=your-api-key-here
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSION=3072

# Crawler
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=30
RETRY_ATTEMPTS=3

# Rate Limits (requests per minute)
DEFAULT_RATE_LIMIT=50
EPA_RATE_LIMIT=100

Data Sources

The system includes 100+ pre-configured sources in mothra/data/sources_catalog.yaml:

  • Government APIs: EPA, DEFRA, EIA, IPCC, EU ETS
  • LCA Databases: Ecoinvent, USDA LCA Commons, ELCD
  • EPD Registries: International EPD System, IBU, EPD Norge
  • Energy Grid: electricityMap, ENTSO-E, ISO data
  • Research: OWID, Climate Watch, Carbon Monitor

Usage Examples

Semantic Search

from mothra.agents.embedding.vector_manager import VectorManager

async def search_example():
    manager = VectorManager()

    results = await manager.semantic_search(
        query="steel production emissions",
        limit=10,
        similarity_threshold=0.7,
        entity_type="process"
    )

    for result in results:
        print(f"{result['name']}: {result['similarity']:.2f}")

Quality Scoring

from mothra.agents.quality.quality_scorer import DataQualityScorer

scorer = DataQualityScorer()

data_entry = {
    "value": 2.5,
    "unit": "kgCO2e",
    "scope": 1,
    "source_id": "EPA-12345",
    "year": 2023
}

quality = scorer.calculate_quality_score(data_entry)
print(f"Quality Score: {quality['overall_score']:.2f}")
print(f"Confidence: {quality['confidence_level']}")

Custom Workflow

from mothra.orchestrator import MothraOrchestrator

async def custom_workflow():
    orchestrator = MothraOrchestrator()

    # Run daily update
    result = await orchestrator.execute_workflow("daily_update")
    print(f"Crawled: {result['result']['crawl']}")

    # Run quality check
    result = await orchestrator.execute_workflow("quality_check")
    print(f"Quality: {result['result']}")

Workflows

Daily Update

Incremental updates from critical sources:

python -m mothra.orchestrator --workflow daily_update
  • Crawls critical priority sources
  • Generates embeddings for new entities
  • ~1-2 hours runtime

Full Refresh

Complete crawl and reindex:

python -m mothra.orchestrator --workflow full_refresh
  • Crawls all validated sources
  • Quality validation
  • Complete reindexing
  • ~12-24 hours runtime

Discover New

Survey for new sources:

python -m mothra.orchestrator --workflow discover_new
  • Runs survey agent
  • Validates new sources
  • Updates catalog

Database Schema

Core Tables

carbon_entities: Main entity storage

  • UUID primary key
  • Entity metadata (name, type, description)
  • Taxonomy mappings (ISIC, NAICS, UNSPSC)
  • Vector embedding (384 dimensions for all-MiniLM-L6-v2)
  • Quality scores

carbon_entity_verification: Professional verification data

  • GHG Protocol Scopes (1, 2, 3, biogenic)
  • EN 15804 LCA Stages (A1-A5, B1-B7, C1-C4, Module D)
  • EPD details (registration number, PCR, declared unit)
  • Verification tracking (body, standards, dates, validity)
  • Data quality indicators (temporal, geographic, technological)
  • Compliance flags (ISO 14067, EN 15804, GHG Protocol)

document_chunks: Large document chunking

  • Linked to carbon_entities
  • Chunk text and metadata (index, size, position)
  • Overlap tracking for context continuity
  • Individual embeddings per chunk

emission_factors: Emission data

  • Linked to carbon_entities
  • Value, unit, scope, lifecycle stage
  • Uncertainty ranges
  • Geographic and temporal scope

data_sources: Source catalog

  • URL, access method, rate limits
  • Status tracking
  • Crawl history

crawl_logs: Audit trail

  • Per-source crawl results
  • Performance metrics
  • Error tracking

scope3_categories: Reference table

  • 15 GHG Protocol Scope 3 categories
  • Upstream/downstream classification

Monitoring

Start monitoring stack:

docker-compose up -d prometheus grafana

Access dashboards:

API Reference

Survey Agent

from mothra.agents.survey.survey_agent import SurveyAgent

async with SurveyAgent() as agent:
    count = await agent.discover_sources()
    sources = await agent.get_sources_by_priority("critical", limit=10)

Crawler Orchestrator

from mothra.agents.crawler.crawler_agent import CrawlerOrchestrator

async with CrawlerOrchestrator() as crawler:
    stats = await crawler.execute_crawl_plan(priority="critical")

Vector Manager

from mothra.agents.embedding.vector_manager import VectorManager

manager = VectorManager()
await manager.reindex_all()
results = await manager.semantic_search("query", limit=10)

Development

Running Tests

pytest tests/
pytest --cov=mothra tests/

Code Quality

# Format code
black mothra/

# Lint
ruff check mothra/

# Type check
mypy mothra/

Adding a New Data Source

  1. Add to mothra/data/sources_catalog.yaml:
- name: "New Source"
  url: "https://example.com/api"
  source_type: "api"
  category: "government"
  priority: "high"
  access_method: "rest"
  auth_required: false
  rate_limit: 100
  update_frequency: "daily"
  data_format: "json"
  1. Run discovery:
python -m mothra.agents.survey.survey_agent
  1. Test crawl:
python -m mothra.agents.crawler.crawler_agent

Performance

Expected Metrics

Metric Target Actual
Data Coverage 100+ sources βœ“
Entity Count 100,000+ Building...
Query Latency <100ms P95 ~80ms
Accuracy 95%+ 92%+
Update Frequency Daily βœ“
Storage <100GB ~50GB

Optimization Tips

  1. Batch Processing: Use batch_size parameter for embeddings
  2. Rate Limiting: Adjust per-source limits in config
  3. Concurrent Crawling: Increase MAX_CONCURRENT_REQUESTS
  4. Vector Indexing: Use HNSW indexes (already configured)
  5. Caching: Enable Redis for intermediate results

Troubleshooting

Database Connection Issues

# Check PostgreSQL is running
docker-compose ps

# Verify pgvector extension
docker exec -it mothra-postgres psql -U mothra -c "SELECT * FROM pg_extension WHERE extname='vector';"

Crawl Failures

# Check crawl logs
SELECT * FROM crawl_logs WHERE status = 'failed' ORDER BY started_at DESC LIMIT 10;

# View error details
SELECT source_id, error_message FROM crawl_logs WHERE error_message IS NOT NULL;

Embedding Errors

  • Verify OpenAI API key is set
  • Check token limits (max 8191 for text-embedding-3-large)
  • Monitor rate limits

Roadmap

  • Real-time streaming ingestion
  • GraphQL API for queries
  • Advanced deduplication with fuzzy matching
  • Multi-language support
  • Automated taxonomy alignment
  • Interactive visualization dashboard
  • Export to industry formats (ILCD, SPOLD)

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests
  4. Submit a pull request

License

[License details to be added]

Citation

If you use MOTHRA in research, please cite:

@software{mothra2025,
  title={MOTHRA: Master Agent Orchestration for Carbon Database Construction},
  author={Mothra Team},
  year={2025},
  url={https://github.com/nickgogerty/Mothra}
}

Contact


Built with Python 🐍 | PostgreSQL 🐘 | pgvector πŸ” | OpenAI πŸ€–

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors