Python parser for GAEB DA XML construction data exchange files, with LLM-powered item classification.
pyGAEB parses, validates, classifies, and writes GAEB DA XML files (versions 2.0 through 3.3), producing a unified Pydantic v2 domain model from all inputs. It supports the full GAEB exchange phase spectrum — procurement (X80–X89), trade (X93–X97), cost & calculation (X50–X52), and quantity determination (X31).
An optional LLM classification layer enriches each item with a semantic construction element type via LiteLLM (100+ providers), with pluggable caching and customisable taxonomy.
- Multi-version — DA XML 2.0, 2.1, 3.0, 3.1, 3.2, 3.3 auto-detected
- All exchange phases — Procurement, Trade, Cost & Calculation, Quantity Determination
- Security-hardened — XXE prevention, Billion Laughs protection, file size guards, recursion depth limits
- Extensible — Custom validators, post-parse hooks, raw XML data collection, custom LLM taxonomy
- LLM classification — 100+ provider support via LiteLLM with cost estimation and persistent caching
- Cross-phase validation — X83→X84 structural identity, X86→X89 unit price matching, X86→X88 addendum traceability
- Document diff — Compare two BoQs with significance-classified field changes, structural diff, and financial impact
- BoQ Builder — Programmatic document construction with auto OZ, Decimal convenience, phase rules, and version checks
- Excel export — Structured .xlsx workbooks with hierarchy-aware layout, phase-specific columns, and multi-sheet mode
- Round-trip — Parse → modify → write back to any DA XML version
- Version conversion — Upgrade/downgrade between DA XML 2.0–3.3
# Core parser + writer + export (zero LLM dependencies)
pip install pyGAEB
# With LLM classification (supports 100+ providers via LiteLLM)
pip install pyGAEB[llm]from pygaeb import GAEBParser
doc = GAEBParser.parse("tender.X83") # DA XML 3.x
doc = GAEBParser.parse("old.D83") # DA XML 2.x — same call
print(doc.source_version) # SourceVersion.DA_XML_33
print(doc.exchange_phase) # ExchangePhase.X83
print(doc.grand_total) # Decimal("1234567.89")Works for all document kinds — procurement, trade, cost, and quantity:
for item in doc.iter_items():
print(item.oz) # "01.02.0030"
print(item.short_text) # "Mauerwerk der Innenwand…"
print(item.qty) # Decimal("1170.000")
print(item.unit) # "m2"
print(item.unit_price) # Decimal("45.50")
print(item.total_price) # Decimal("53235.00")
print(item.item_type) # ItemType.NORMALfrom pygaeb import GAEBParser, ValidationMode
# Lenient (default) — collect warnings, keep parsing
doc = GAEBParser.parse("tender.X83")
for issue in doc.validation_results:
print(issue.severity, issue.message)
# Strict — raise on first ERROR
doc = GAEBParser.parse("tender.X83", validation=ValidationMode.STRICT)Register project-specific validation rules:
from pygaeb import register_validator, clear_validators
from pygaeb.models.item import ValidationResult
from pygaeb.models.enums import ValidationSeverity
def require_unit(doc):
issues = []
for item in doc.iter_items():
if not item.unit:
issues.append(
ValidationResult(
severity=ValidationSeverity.WARNING,
message=f"{item.oz}: missing unit",
)
)
return issues
register_validator(require_unit)
doc = GAEBParser.parse("tender.X83")
# require_unit results are now in doc.validation_results
# Or per-call (not added to the global registry):
doc = GAEBParser.parse("tender.X83", extra_validators=[require_unit])from pygaeb import GAEBWriter, ExchangePhase
from decimal import Decimal
doc = GAEBParser.parse("tender.X83")
item = doc.award.boq.get_item("01.02.0030")
item.unit_price = Decimal("48.00")
GAEBWriter.write(doc, "bid.X84", phase=ExchangePhase.X84)from pygaeb import GAEBConverter, SourceVersion
# Upgrade 2.x → 3.3
report = GAEBConverter.convert("old.D83", "modern.X83")
# Downgrade 3.3 → 3.2 for compatibility
report = GAEBConverter.convert(
"tender.X83", "compat.X83",
target_version=SourceVersion.DA_XML_32,
)
print(f"Converted {report.items_converted} items, data loss: {report.has_data_loss}")from pygaeb.convert import to_json, to_csv
to_json(doc, "boq.json") # full nested BoQ tree
to_csv(doc, "items.csv") # flat item table with classification columnsdoc = GAEBParser.parse("order.X96")
print(doc.document_kind) # DocumentKind.TRADE
print(doc.is_trade) # True
for item in doc.order.items:
print(item.art_no, item.short_text, item.net_price)
print(doc.order.supplier_info.address.name)doc = GAEBParser.parse("costing.X50")
print(doc.document_kind) # DocumentKind.COST
for elem in doc.elemental_costing.body.iter_cost_elements():
print(elem.ele_no, elem.short_text, elem.total_cost)doc = GAEBParser.parse("measurements.X31")
print(doc.document_kind) # DocumentKind.QUANTITY
for item in doc.qty_determination.boq.iter_items():
print(item.oz, item.qty_determ_items)doc = GAEBParser.parse("tender.X86")
# BoQ-level totals
totals = doc.award.boq.info.totals
print(totals.total_net, totals.total_gross, totals.vat_amount)
# Per-VAT-rate breakdown
for vp in totals.vat_parts:
print(f"{vp.vat_pcnt}%: net {vp.net_amount} → gross {vp.gross_amount}")
# Project metadata
print(doc.award.prj_id, doc.award.description, doc.award.currency_label)Navigate the BoQ with parent references, depth tracking, and indexed lookups:
from pygaeb import BoQTree, NodeKind
tree = BoQTree(doc.award.boq)
# Find an item and navigate up
node = tree.find_item("01.01.0010")
print(node.parent.label) # "Mauerwerk"
print(node.depth) # level in tree
print(node.label_path) # ["BoQ", "Default", "Rohbau", "Mauerwerk", "..."]
print(node.siblings) # sibling nodes
# Walk the hierarchy
for node in tree.walk():
indent = " " * node.depth
print(f"{indent}{node.kind.value}: {node.label}")
# Subtree queries
expensive = tree.root.find_all(
lambda n: n.kind == NodeKind.ITEM
and n.item.total_price
and n.item.total_price > 50000
)Compare two GAEB documents and get structured, significance-classified changes:
from pygaeb import GAEBParser, BoQDiff, DiffMode, Significance
doc_a = GAEBParser.parse("tender_v1.X83")
doc_b = GAEBParser.parse("tender_v2.X83")
result = BoQDiff.compare(doc_a, doc_b)
# Top-level summary
print(result.summary.total_changes) # 12
print(result.summary.financial_impact) # Decimal("45230.00")
print(result.summary.max_significance) # Significance.CRITICAL
# Items added / removed / modified
for item in result.items.added:
print(f"+ {item.oz}: {item.short_text}")
for item in result.items.removed:
print(f"- {item.oz}: {item.short_text}")
# Field-level changes with significance
for mod in result.items.modified:
for change in mod.changes:
print(f" {mod.oz} {change.field}: {change.old_value} → {change.new_value}"
f" [{change.significance.value}]")
# Filter by significance
critical_only = result.items.filter_modified(Significance.CRITICAL)
# Structural changes (sections added/removed/renamed)
for sec in result.structure.sections_added:
print(f"New section: {sec.label}")
# Strict mode: raises ValueError if documents are from different projects
result = BoQDiff.compare(doc_a, doc_b, mode=DiffMode.STRICT)from pygaeb import BoQBuilder, GAEBWriter
builder = BoQBuilder(phase="X83", version="3.3")
builder.project(no="PRJ-001", name="School Renovation", currency="EUR")
lot = builder.add_lot("1", "Structural Work")
concrete = lot.add_category("01", "Concrete")
concrete.add_item("01.0010", "Foundation", qty=120, unit="m3", unit_price=85)
concrete.add_item("01.0020", "Columns", qty=40, unit="m3", unit_price=95)
doc = builder.build() # GAEBDocument with auto totals
GAEBWriter.write(doc, "output.X83") # Write to GAEB XMLfrom pygaeb import GAEBParser
from pygaeb.convert import to_excel
doc = GAEBParser.parse("tender.X83")
# Single structured sheet with hierarchy
to_excel(doc, "tender.xlsx")
# Multi-sheet workbook (BoQ + Items + Summary + Info)
to_excel(doc, "tender_full.xlsx", mode="full")
# With optional columns
to_excel(doc, "detailed.xlsx", include_long_text=True, include_classification=True)from pygaeb import LLMClassifier
# Default: in-memory cache (no disk I/O, session-scoped)
classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
# classifier = LLMClassifier(model="gpt-4o")
# classifier = LLMClassifier(model="ollama/llama3") # local, free, private
# Opt-in: persistent SQLite cache (survives across runs)
from pygaeb import SQLiteCache
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))
# Custom taxonomy and prompt
classifier = LLMClassifier(
model="openai/gpt-4o",
taxonomy={"Electrical": {"Cable": ["Ladder", "Perforated"]}},
prompt_template="You are a specialist classifying MEP items...",
)
# Check cost before running
estimate = await classifier.estimate_cost(doc)
print(f"Will classify {estimate.items_to_classify} items for ~${estimate.estimated_cost_usd:.2f}")
# Classify all items
await classifier.enrich(doc)
# Or synchronous
classifier.enrich_sync(doc)
for item in doc.iter_items():
if item.classification:
print(item.oz, item.classification.element_type, item.classification.confidence)After classification, extract typed attributes into your own Pydantic schema:
from pydantic import BaseModel, Field
from typing import Optional
from pygaeb import StructuredExtractor
class DoorSpec(BaseModel):
door_type: str = Field("", description="single, double, sliding")
width_mm: Optional[int] = Field(None, description="Width in mm")
fire_rating: Optional[str] = Field(None, description="T30, T60, T90")
glazing: bool = Field(False, description="Has glass panels")
material: str = Field("", description="wood, steel, aluminium")
extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")
# Extract from all items classified as "Door"
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
for item, spec in doors:
print(item.oz, spec.door_type, spec.fire_rating, spec.width_mm)
# Filter by trade (broad) or sub_type (narrow)
pipes = await extractor.extract(doc, schema=PipeSpec, trade="MEP-Plumbing")
fire_doors = await extractor.extract(doc, schema=DoorSpec, sub_type="Fire Door")
# Or synchronous
doors = extractor.extract_sync(doc, schema=DoorSpec, element_type="Door")Built-in starter schemas: DoorSpec, WindowSpec, WallSpec, PipeSpec — or define your own.
Extract vendor-specific XML elements during parsing:
def extract_vendor_codes(item, el):
if el is None:
return
ns = {"g": "http://www.gaeb.de/GAEB_DA_XML/DA86/3.3"}
codes = el.findall(".//g:VendorCostCode", ns)
if codes:
item.raw_data = item.raw_data or {}
item.raw_data["vendor_codes"] = [c.text for c in codes]
doc = GAEBParser.parse("file.X83", post_parse_hook=extract_vendor_codes)Or automatically collect all unknown XML elements:
doc = GAEBParser.parse("file.X83", collect_raw_data=True)
for item in doc.iter_items():
if item.raw_data:
print(f"{item.oz}: extra fields = {item.raw_data}")doc = GAEBParser.parse("vendor_file.X83", keep_xml=True)
# XPath across the whole document
codes = doc.xpath("//g:VendorCostCode/text()")
# Per-item raw element access
for item in doc.iter_items():
el = item.source_element # original lxml element
# Free memory when done
doc.discard_xml()from pygaeb import CacheBackend, InMemoryCache, SQLiteCache
# Default: in-memory (LRU-bounded, session-scoped)
classifier = LLMClassifier()
# Persistent: SQLite
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))
# Bring your own: implement CacheBackend protocol
class RedisCache:
def get(self, key: str) -> str | None: ...
def put(self, key: str, value: str) -> None: ...
def delete(self, key: str) -> None: ...
def keys(self) -> list[str]: ...
def clear(self) -> None: ...
def close(self) -> None: ...
classifier = LLMClassifier(cache=RedisCache())from pygaeb import GAEBParser, CrossPhaseValidator
# Tender → Bid: structural identity check
tender = GAEBParser.parse("tender.X83")
bid = GAEBParser.parse("bid.X84")
issues = CrossPhaseValidator.check(source=tender, response=bid)
# Contract → Invoice: unit prices must match
contract = GAEBParser.parse("contract.X86")
invoice = GAEBParser.parse("invoice.X89")
issues = CrossPhaseValidator.check(source=contract, response=invoice)
# Contract → Addendum: change order traceability
addendum = GAEBParser.parse("nachtrag.X88")
issues = CrossPhaseValidator.check(source=contract, response=addendum)
for issue in issues:
print(issue.severity, issue.message)| Version | Parser Track | Status |
|---|---|---|
| DA XML 2.0 | Track A (German elements) | v1.0 |
| DA XML 2.1 | Track A (German elements) | v1.0 |
| DA XML 3.0 | Track B (English elements) | v1.0 |
| DA XML 3.1 | Track B (English elements) | v1.0 |
| DA XML 3.2 | Track B (English elements) | v1.0 |
| DA XML 3.3 | Track B (English elements) | v1.0 |
| GAEB 90 | Track C (fixed-width) | Planned |
| Phase | Description | Since |
|---|---|---|
| X31 | Quantity Determination | v1.4.0 |
| X50, X51, X52 | Cost & Calculation | v1.3.0 |
| X80–X86 | Procurement (tender, bid, award) | v1.0.0 |
| X88 | Addendum / Nachtrag (claims & variations) | v1.12.0 |
| X89, X89B | Invoice / extended invoice | v1.0.0 |
| X93, X94, X96, X97 | Trade (material ordering) | v1.2.0 |
from pygaeb import configure
configure(
default_model="ollama/llama3", # LLM model for classification
classifier_concurrency=10, # parallel LLM calls
xsd_dir="/opt/gaeb-schemas", # optional XSD validation
log_level="DEBUG", # applied to pygaeb.* loggers
max_file_size_mb=200, # input file size limit
)Or via environment variables:
export PYGAEB_DEFAULT_MODEL=ollama/llama3
export PYGAEB_XSD_DIR=/opt/gaeb-schemas
export PYGAEB_LOG_LEVEL=DEBUG
export PYGAEB_MAX_FILE_SIZE_MB=200pyGAEB includes security hardening since v1.6.0:
- XXE prevention — All XML parsing uses hardened parsers with
resolve_entities=Falseandno_network=True - Billion Laughs protection — Entity expansion bombs are blocked
- File size guard — Configurable limit (default 100 MB) prevents memory exhaustion
- Recursion depth limits — Hierarchy walkers cap at 50 levels to prevent stack overflow
- Bounded caching —
InMemoryCacheuses LRU eviction (default 10,000 entries)
Full documentation is available at Read the Docs.
- Quick Start
- Parsing Guide
- Trade Phases
- Cost & Calculation
- Quantity Determination
- Tree Navigation
- Document Diff
- BoQ Builder
- Excel Export
- Extensibility
- Classification
- Version Conversion
- API Reference
MIT — see LICENSE for details.