Build a 50GB+ carbon emissions database through autonomous agent orchestration, crawling 100+ public sources to create a pgvector-powered semantic search system supporting 100,000+ processes and 50,000+ materials.
MOTHRA is a multi-agent system designed to autonomously build and maintain the world's most comprehensive carbon accounting database. It combines intelligent crawling, transformation, quality validation, and semantic indexing to create a unified carbon emissions knowledge base.
- Multi-Agent Architecture: Specialized agents for discovery, crawling, parsing, quality control, and embedding generation
- 100+ Data Sources: Government APIs, LCA databases, EPD registries, energy grid data, and research datasets
- EC3 Integration: Direct access to 90,000+ verified EPDs from Building Transparency's EC3 database
- Professional Verification: Complete support for ISO 14067, ISO 14064, GHG Protocol, EN 15804+A2 standards
- Semantic Search: pgvector-powered similarity search with document chunking for large texts
- Quality Assurance: 5-dimensional quality scoring (completeness, accuracy, consistency, timeliness, provenance)
- Autonomous Operation: Scheduled workflows for continuous updates and maintenance
- Scalable Design: Async Python with concurrent crawling and batch processing
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORCHESTRATION LAYER β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Orchestrator β β Scheduler β β Queue Managerβ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
βββββββββΌβββββββββ ββββββββΌβββββββββ βββββββββΌβββββββββ
β DISCOVERY β β COLLECTION β β QUALITY β
β - Survey Agent β β - Crawler β β - Validator β
β - Validator β β - Parser β β - Scorer β
β - Metadata β β - Transform β β - Dedup β
ββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
β
βββββββββΌβββββββββ
β STORAGE LAYER β
β - PGVector β
β - Embeddings β
β - Semantic β
ββββββββββββββββββ
- Python 3.11+
- Docker & Docker Compose
- OpenAI API key (for embeddings)
- Clone and setup
git clone https://github.com/nickgogerty/Mothra.git
cd Mothra- Install dependencies
pip install -r requirements.txt
# OR
pip install -e .- Configure environment
cp .env.example .env
# Edit .env and add your OpenAI API key- Start PostgreSQL with pgvector
docker-compose up -d postgres- Initialize database
python -m mothra.db.session# Run complete discovery and initial crawl
python -m mothra.orchestratorDiscover Sources:
python -m mothra.agents.survey.survey_agentCrawl Data:
python -m mothra.agents.crawler.crawler_agentGenerate Embeddings:
python -m mothra.agents.embedding.vector_managerMOTHRA now includes comprehensive integration with EC3 (Embodied Carbon in Construction Calculator) and professional carbon verification standards.
# Interactive import from 10 material categories
python scripts/import_ec3_epds.py
# Test EC3 integration
python scripts/test_ec3_integration.pyAvailable Categories:
- Concrete (ready-mix, precast, blocks)
- Steel (structural, rebar, decking)
- Wood (lumber, engineered wood, CLT)
- Insulation (mineral wool, foam, cellulose)
- Glass, Aluminum, Gypsum, Roofing, Flooring, Sealants
β ISO 14067:2018 - Product Carbon Footprint β ISO 14064-1/2/3 - GHG Verification β GHG Protocol - Scope 1, 2, 3 (15 categories) β EN 15804+A2:2019 - EPD LCA Stages (A1-A5, B1-B7, C1-C4, Module D) β ISO 21930:2017 - Construction EPD Core Rules
Each EPD includes:
- LCA Stages: Full lifecycle from raw material (A1) through end-of-life (C4) and reuse potential (D)
- GHG Scopes: Direct, indirect, and value chain emissions with biogenic carbon separate
- EPD Details: Registration number, PCR reference, declared unit, verification body
- Data Quality: Temporal, geographic, and technological representativeness per ISO 14044
- Compliance Flags: ISO 14067, EN 15804, GHG Protocol, third-party verification
π Full Guide: EC3_INTEGRATION_GUIDE.md
mothra/
βββ agents/
β βββ survey/ # Source discovery and validation
β βββ crawler/ # Data collection orchestration
β βββ discovery/ # Deep dataset discovery and EC3 integration
β βββ parser/ # Format-specific parsers (JSON, XML, CSV, Excel)
β βββ transform/ # Data transformation and harmonization
β βββ quality/ # Quality scoring and validation
β βββ embedding/ # Vector generation and chunk-aware search
βββ config/ # Configuration management
βββ db/ # Database models and session management
β βββ models.py # Core SQLAlchemy models with pgvector
β βββ models_verification.py # Professional verification models
β βββ models_chunks.py # Document chunking for large texts
β βββ session.py # Async database sessions
β βββ init/ # Database initialization scripts
βββ pipelines/ # Data processing pipelines
βββ schemas/ # Data schemas and taxonomies
βββ monitoring/ # Prometheus and Grafana configs
βββ utils/ # Utilities (logging, rate limiting, retry)
βββ data/
β βββ sources_catalog.yaml # 100+ data source definitions
β βββ raw/ # Raw crawled data
β βββ processed/ # Transformed data
β βββ cache/ # Temporary cache
βββ orchestrator.py # Master orchestration
Key configuration options in .env:
# Database
POSTGRES_HOST=localhost
POSTGRES_DB=mothra
DATABASE_URL=postgresql+asyncpg://mothra:changeme@localhost:5432/mothra
# OpenAI
OPENAI_API_KEY=your-api-key-here
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSION=3072
# Crawler
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=30
RETRY_ATTEMPTS=3
# Rate Limits (requests per minute)
DEFAULT_RATE_LIMIT=50
EPA_RATE_LIMIT=100The system includes 100+ pre-configured sources in mothra/data/sources_catalog.yaml:
- Government APIs: EPA, DEFRA, EIA, IPCC, EU ETS
- LCA Databases: Ecoinvent, USDA LCA Commons, ELCD
- EPD Registries: International EPD System, IBU, EPD Norge
- Energy Grid: electricityMap, ENTSO-E, ISO data
- Research: OWID, Climate Watch, Carbon Monitor
from mothra.agents.embedding.vector_manager import VectorManager
async def search_example():
manager = VectorManager()
results = await manager.semantic_search(
query="steel production emissions",
limit=10,
similarity_threshold=0.7,
entity_type="process"
)
for result in results:
print(f"{result['name']}: {result['similarity']:.2f}")from mothra.agents.quality.quality_scorer import DataQualityScorer
scorer = DataQualityScorer()
data_entry = {
"value": 2.5,
"unit": "kgCO2e",
"scope": 1,
"source_id": "EPA-12345",
"year": 2023
}
quality = scorer.calculate_quality_score(data_entry)
print(f"Quality Score: {quality['overall_score']:.2f}")
print(f"Confidence: {quality['confidence_level']}")from mothra.orchestrator import MothraOrchestrator
async def custom_workflow():
orchestrator = MothraOrchestrator()
# Run daily update
result = await orchestrator.execute_workflow("daily_update")
print(f"Crawled: {result['result']['crawl']}")
# Run quality check
result = await orchestrator.execute_workflow("quality_check")
print(f"Quality: {result['result']}")Incremental updates from critical sources:
python -m mothra.orchestrator --workflow daily_update- Crawls critical priority sources
- Generates embeddings for new entities
- ~1-2 hours runtime
Complete crawl and reindex:
python -m mothra.orchestrator --workflow full_refresh- Crawls all validated sources
- Quality validation
- Complete reindexing
- ~12-24 hours runtime
Survey for new sources:
python -m mothra.orchestrator --workflow discover_new- Runs survey agent
- Validates new sources
- Updates catalog
carbon_entities: Main entity storage
- UUID primary key
- Entity metadata (name, type, description)
- Taxonomy mappings (ISIC, NAICS, UNSPSC)
- Vector embedding (384 dimensions for all-MiniLM-L6-v2)
- Quality scores
carbon_entity_verification: Professional verification data
- GHG Protocol Scopes (1, 2, 3, biogenic)
- EN 15804 LCA Stages (A1-A5, B1-B7, C1-C4, Module D)
- EPD details (registration number, PCR, declared unit)
- Verification tracking (body, standards, dates, validity)
- Data quality indicators (temporal, geographic, technological)
- Compliance flags (ISO 14067, EN 15804, GHG Protocol)
document_chunks: Large document chunking
- Linked to carbon_entities
- Chunk text and metadata (index, size, position)
- Overlap tracking for context continuity
- Individual embeddings per chunk
emission_factors: Emission data
- Linked to carbon_entities
- Value, unit, scope, lifecycle stage
- Uncertainty ranges
- Geographic and temporal scope
data_sources: Source catalog
- URL, access method, rate limits
- Status tracking
- Crawl history
crawl_logs: Audit trail
- Per-source crawl results
- Performance metrics
- Error tracking
scope3_categories: Reference table
- 15 GHG Protocol Scope 3 categories
- Upstream/downstream classification
Start monitoring stack:
docker-compose up -d prometheus grafanaAccess dashboards:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
from mothra.agents.survey.survey_agent import SurveyAgent
async with SurveyAgent() as agent:
count = await agent.discover_sources()
sources = await agent.get_sources_by_priority("critical", limit=10)from mothra.agents.crawler.crawler_agent import CrawlerOrchestrator
async with CrawlerOrchestrator() as crawler:
stats = await crawler.execute_crawl_plan(priority="critical")from mothra.agents.embedding.vector_manager import VectorManager
manager = VectorManager()
await manager.reindex_all()
results = await manager.semantic_search("query", limit=10)pytest tests/
pytest --cov=mothra tests/# Format code
black mothra/
# Lint
ruff check mothra/
# Type check
mypy mothra/- Add to
mothra/data/sources_catalog.yaml:
- name: "New Source"
url: "https://example.com/api"
source_type: "api"
category: "government"
priority: "high"
access_method: "rest"
auth_required: false
rate_limit: 100
update_frequency: "daily"
data_format: "json"- Run discovery:
python -m mothra.agents.survey.survey_agent- Test crawl:
python -m mothra.agents.crawler.crawler_agent| Metric | Target | Actual |
|---|---|---|
| Data Coverage | 100+ sources | β |
| Entity Count | 100,000+ | Building... |
| Query Latency | <100ms | P95 ~80ms |
| Accuracy | 95%+ | 92%+ |
| Update Frequency | Daily | β |
| Storage | <100GB | ~50GB |
- Batch Processing: Use
batch_sizeparameter for embeddings - Rate Limiting: Adjust per-source limits in config
- Concurrent Crawling: Increase
MAX_CONCURRENT_REQUESTS - Vector Indexing: Use HNSW indexes (already configured)
- Caching: Enable Redis for intermediate results
# Check PostgreSQL is running
docker-compose ps
# Verify pgvector extension
docker exec -it mothra-postgres psql -U mothra -c "SELECT * FROM pg_extension WHERE extname='vector';"# Check crawl logs
SELECT * FROM crawl_logs WHERE status = 'failed' ORDER BY started_at DESC LIMIT 10;
# View error details
SELECT source_id, error_message FROM crawl_logs WHERE error_message IS NOT NULL;- Verify OpenAI API key is set
- Check token limits (max 8191 for text-embedding-3-large)
- Monitor rate limits
- Real-time streaming ingestion
- GraphQL API for queries
- Advanced deduplication with fuzzy matching
- Multi-language support
- Automated taxonomy alignment
- Interactive visualization dashboard
- Export to industry formats (ILCD, SPOLD)
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests
- Submit a pull request
[License details to be added]
If you use MOTHRA in research, please cite:
@software{mothra2025,
title={MOTHRA: Master Agent Orchestration for Carbon Database Construction},
author={Mothra Team},
year={2025},
url={https://github.com/nickgogerty/Mothra}
}- GitHub: @nickgogerty
- Issues: GitHub Issues
Built with Python π | PostgreSQL π | pgvector π | OpenAI π€