Skip to content

frameIQ/pygaeb

Repository files navigation

pyGAEB

Python parser for GAEB DA XML construction data exchange files, with LLM-powered item classification.

CI codecov PyPI version Python 3.9+ License: MIT

Deutsche Version (README.de.md)

pyGAEB parses, validates, classifies, and writes GAEB DA XML files (versions 2.0 through 3.3), producing a unified Pydantic v2 domain model from all inputs. It supports the full GAEB exchange phase spectrum — procurement (X80–X89), trade (X93–X97), cost & calculation (X50–X52), and quantity determination (X31).

An optional LLM classification layer enriches each item with a semantic construction element type via LiteLLM (100+ providers), with pluggable caching and customisable taxonomy.

Highlights

  • Multi-version — DA XML 2.0, 2.1, 3.0, 3.1, 3.2, 3.3 auto-detected
  • All exchange phases — Procurement, Trade, Cost & Calculation, Quantity Determination
  • Security-hardened — XXE prevention, Billion Laughs protection, file size guards, recursion depth limits
  • Extensible — Custom validators, post-parse hooks, raw XML data collection, custom LLM taxonomy
  • LLM classification — 100+ provider support via LiteLLM with cost estimation and persistent caching
  • Cross-phase validation — X83→X84 structural identity, X86→X89 unit price matching, X86→X88 addendum traceability
  • Document diff — Compare two BoQs with significance-classified field changes, structural diff, and financial impact
  • BoQ Builder — Programmatic document construction with auto OZ, Decimal convenience, phase rules, and version checks
  • Excel export — Structured .xlsx workbooks with hierarchy-aware layout, phase-specific columns, and multi-sheet mode
  • Round-trip — Parse → modify → write back to any DA XML version
  • Version conversion — Upgrade/downgrade between DA XML 2.0–3.3

Installation

# Core parser + writer + export (zero LLM dependencies)
pip install pyGAEB

# With LLM classification (supports 100+ providers via LiteLLM)
pip install pyGAEB[llm]

Quick Start

Parse any GAEB file

from pygaeb import GAEBParser

doc = GAEBParser.parse("tender.X83")    # DA XML 3.x
doc = GAEBParser.parse("old.D83")       # DA XML 2.x — same call

print(doc.source_version)               # SourceVersion.DA_XML_33
print(doc.exchange_phase)               # ExchangePhase.X83
print(doc.grand_total)                  # Decimal("1234567.89")

Iterate items

Works for all document kinds — procurement, trade, cost, and quantity:

for item in doc.iter_items():
    print(item.oz)              # "01.02.0030"
    print(item.short_text)      # "Mauerwerk der Innenwand…"
    print(item.qty)             # Decimal("1170.000")
    print(item.unit)            # "m2"
    print(item.unit_price)      # Decimal("45.50")
    print(item.total_price)     # Decimal("53235.00")
    print(item.item_type)       # ItemType.NORMAL

Validation

from pygaeb import GAEBParser, ValidationMode

# Lenient (default) — collect warnings, keep parsing
doc = GAEBParser.parse("tender.X83")
for issue in doc.validation_results:
    print(issue.severity, issue.message)

# Strict — raise on first ERROR
doc = GAEBParser.parse("tender.X83", validation=ValidationMode.STRICT)

Custom Validators

Register project-specific validation rules:

from pygaeb import register_validator, clear_validators
from pygaeb.models.item import ValidationResult
from pygaeb.models.enums import ValidationSeverity

def require_unit(doc):
    issues = []
    for item in doc.iter_items():
        if not item.unit:
            issues.append(
                ValidationResult(
                    severity=ValidationSeverity.WARNING,
                    message=f"{item.oz}: missing unit",
                )
            )
    return issues

register_validator(require_unit)
doc = GAEBParser.parse("tender.X83")
# require_unit results are now in doc.validation_results

# Or per-call (not added to the global registry):
doc = GAEBParser.parse("tender.X83", extra_validators=[require_unit])

Write / Round-trip

from pygaeb import GAEBWriter, ExchangePhase
from decimal import Decimal

doc = GAEBParser.parse("tender.X83")
item = doc.award.boq.get_item("01.02.0030")
item.unit_price = Decimal("48.00")

GAEBWriter.write(doc, "bid.X84", phase=ExchangePhase.X84)

Version Conversion

from pygaeb import GAEBConverter, SourceVersion

# Upgrade 2.x → 3.3
report = GAEBConverter.convert("old.D83", "modern.X83")

# Downgrade 3.3 → 3.2 for compatibility
report = GAEBConverter.convert(
    "tender.X83", "compat.X83",
    target_version=SourceVersion.DA_XML_32,
)
print(f"Converted {report.items_converted} items, data loss: {report.has_data_loss}")

Export to JSON / CSV

from pygaeb.convert import to_json, to_csv

to_json(doc, "boq.json")     # full nested BoQ tree
to_csv(doc, "items.csv")     # flat item table with classification columns

Trade Phases (X93–X97)

doc = GAEBParser.parse("order.X96")
print(doc.document_kind)    # DocumentKind.TRADE
print(doc.is_trade)         # True

for item in doc.order.items:
    print(item.art_no, item.short_text, item.net_price)

print(doc.order.supplier_info.address.name)

Cost & Calculation Phases (X50–X52)

doc = GAEBParser.parse("costing.X50")
print(doc.document_kind)    # DocumentKind.COST

for elem in doc.elemental_costing.body.iter_cost_elements():
    print(elem.ele_no, elem.short_text, elem.total_cost)

Quantity Determination (X31)

doc = GAEBParser.parse("measurements.X31")
print(doc.document_kind)    # DocumentKind.QUANTITY

for item in doc.qty_determination.boq.iter_items():
    print(item.oz, item.qty_determ_items)

Financial Summaries & Project Info

doc = GAEBParser.parse("tender.X86")

# BoQ-level totals
totals = doc.award.boq.info.totals
print(totals.total_net, totals.total_gross, totals.vat_amount)

# Per-VAT-rate breakdown
for vp in totals.vat_parts:
    print(f"{vp.vat_pcnt}%: net {vp.net_amount} → gross {vp.gross_amount}")

# Project metadata
print(doc.award.prj_id, doc.award.description, doc.award.currency_label)

Tree Navigation (BoQ Hierarchy)

Navigate the BoQ with parent references, depth tracking, and indexed lookups:

from pygaeb import BoQTree, NodeKind

tree = BoQTree(doc.award.boq)

# Find an item and navigate up
node = tree.find_item("01.01.0010")
print(node.parent.label)       # "Mauerwerk"
print(node.depth)              # level in tree
print(node.label_path)         # ["BoQ", "Default", "Rohbau", "Mauerwerk", "..."]
print(node.siblings)           # sibling nodes

# Walk the hierarchy
for node in tree.walk():
    indent = "  " * node.depth
    print(f"{indent}{node.kind.value}: {node.label}")

# Subtree queries
expensive = tree.root.find_all(
    lambda n: n.kind == NodeKind.ITEM
    and n.item.total_price
    and n.item.total_price > 50000
)

Document Diff (Compare Two BoQs)

Compare two GAEB documents and get structured, significance-classified changes:

from pygaeb import GAEBParser, BoQDiff, DiffMode, Significance

doc_a = GAEBParser.parse("tender_v1.X83")
doc_b = GAEBParser.parse("tender_v2.X83")

result = BoQDiff.compare(doc_a, doc_b)

# Top-level summary
print(result.summary.total_changes)      # 12
print(result.summary.financial_impact)   # Decimal("45230.00")
print(result.summary.max_significance)   # Significance.CRITICAL

# Items added / removed / modified
for item in result.items.added:
    print(f"+ {item.oz}: {item.short_text}")

for item in result.items.removed:
    print(f"- {item.oz}: {item.short_text}")

# Field-level changes with significance
for mod in result.items.modified:
    for change in mod.changes:
        print(f"  {mod.oz} {change.field}: {change.old_value}{change.new_value}"
              f" [{change.significance.value}]")

# Filter by significance
critical_only = result.items.filter_modified(Significance.CRITICAL)

# Structural changes (sections added/removed/renamed)
for sec in result.structure.sections_added:
    print(f"New section: {sec.label}")

# Strict mode: raises ValueError if documents are from different projects
result = BoQDiff.compare(doc_a, doc_b, mode=DiffMode.STRICT)

Build a Document from Scratch

from pygaeb import BoQBuilder, GAEBWriter

builder = BoQBuilder(phase="X83", version="3.3")
builder.project(no="PRJ-001", name="School Renovation", currency="EUR")

lot = builder.add_lot("1", "Structural Work")
concrete = lot.add_category("01", "Concrete")
concrete.add_item("01.0010", "Foundation", qty=120, unit="m3", unit_price=85)
concrete.add_item("01.0020", "Columns",   qty=40,  unit="m3", unit_price=95)

doc = builder.build()               # GAEBDocument with auto totals
GAEBWriter.write(doc, "output.X83") # Write to GAEB XML

Excel Export

from pygaeb import GAEBParser
from pygaeb.convert import to_excel

doc = GAEBParser.parse("tender.X83")

# Single structured sheet with hierarchy
to_excel(doc, "tender.xlsx")

# Multi-sheet workbook (BoQ + Items + Summary + Info)
to_excel(doc, "tender_full.xlsx", mode="full")

# With optional columns
to_excel(doc, "detailed.xlsx", include_long_text=True, include_classification=True)

LLM Classification

from pygaeb import LLMClassifier

# Default: in-memory cache (no disk I/O, session-scoped)
classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
# classifier = LLMClassifier(model="gpt-4o")
# classifier = LLMClassifier(model="ollama/llama3")  # local, free, private

# Opt-in: persistent SQLite cache (survives across runs)
from pygaeb import SQLiteCache
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))

# Custom taxonomy and prompt
classifier = LLMClassifier(
    model="openai/gpt-4o",
    taxonomy={"Electrical": {"Cable": ["Ladder", "Perforated"]}},
    prompt_template="You are a specialist classifying MEP items...",
)

# Check cost before running
estimate = await classifier.estimate_cost(doc)
print(f"Will classify {estimate.items_to_classify} items for ~${estimate.estimated_cost_usd:.2f}")

# Classify all items
await classifier.enrich(doc)

# Or synchronous
classifier.enrich_sync(doc)

for item in doc.iter_items():
    if item.classification:
        print(item.oz, item.classification.element_type, item.classification.confidence)

Structured Extraction — Custom Schemas

After classification, extract typed attributes into your own Pydantic schema:

from pydantic import BaseModel, Field
from typing import Optional
from pygaeb import StructuredExtractor

class DoorSpec(BaseModel):
    door_type: str = Field("", description="single, double, sliding")
    width_mm: Optional[int] = Field(None, description="Width in mm")
    fire_rating: Optional[str] = Field(None, description="T30, T60, T90")
    glazing: bool = Field(False, description="Has glass panels")
    material: str = Field("", description="wood, steel, aluminium")

extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")

# Extract from all items classified as "Door"
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
for item, spec in doors:
    print(item.oz, spec.door_type, spec.fire_rating, spec.width_mm)

# Filter by trade (broad) or sub_type (narrow)
pipes = await extractor.extract(doc, schema=PipeSpec, trade="MEP-Plumbing")
fire_doors = await extractor.extract(doc, schema=DoorSpec, sub_type="Fire Door")

# Or synchronous
doors = extractor.extract_sync(doc, schema=DoorSpec, element_type="Door")

Built-in starter schemas: DoorSpec, WindowSpec, WallSpec, PipeSpec — or define your own.

Post-Parse Hook & Raw Data Collection

Extract vendor-specific XML elements during parsing:

def extract_vendor_codes(item, el):
    if el is None:
        return
    ns = {"g": "http://www.gaeb.de/GAEB_DA_XML/DA86/3.3"}
    codes = el.findall(".//g:VendorCostCode", ns)
    if codes:
        item.raw_data = item.raw_data or {}
        item.raw_data["vendor_codes"] = [c.text for c in codes]

doc = GAEBParser.parse("file.X83", post_parse_hook=extract_vendor_codes)

Or automatically collect all unknown XML elements:

doc = GAEBParser.parse("file.X83", collect_raw_data=True)
for item in doc.iter_items():
    if item.raw_data:
        print(f"{item.oz}: extra fields = {item.raw_data}")

Custom & Vendor Tags (XPath)

doc = GAEBParser.parse("vendor_file.X83", keep_xml=True)

# XPath across the whole document
codes = doc.xpath("//g:VendorCostCode/text()")

# Per-item raw element access
for item in doc.iter_items():
    el = item.source_element  # original lxml element

# Free memory when done
doc.discard_xml()

Custom Cache Backend

from pygaeb import CacheBackend, InMemoryCache, SQLiteCache

# Default: in-memory (LRU-bounded, session-scoped)
classifier = LLMClassifier()

# Persistent: SQLite
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))

# Bring your own: implement CacheBackend protocol
class RedisCache:
    def get(self, key: str) -> str | None: ...
    def put(self, key: str, value: str) -> None: ...
    def delete(self, key: str) -> None: ...
    def keys(self) -> list[str]: ...
    def clear(self) -> None: ...
    def close(self) -> None: ...

classifier = LLMClassifier(cache=RedisCache())

Cross-Phase Validation

from pygaeb import GAEBParser, CrossPhaseValidator

# Tender → Bid: structural identity check
tender = GAEBParser.parse("tender.X83")
bid = GAEBParser.parse("bid.X84")
issues = CrossPhaseValidator.check(source=tender, response=bid)

# Contract → Invoice: unit prices must match
contract = GAEBParser.parse("contract.X86")
invoice = GAEBParser.parse("invoice.X89")
issues = CrossPhaseValidator.check(source=contract, response=invoice)

# Contract → Addendum: change order traceability
addendum = GAEBParser.parse("nachtrag.X88")
issues = CrossPhaseValidator.check(source=contract, response=addendum)

for issue in issues:
    print(issue.severity, issue.message)

Supported Versions & Exchange Phases

Version Parser Track Status
DA XML 2.0 Track A (German elements) v1.0
DA XML 2.1 Track A (German elements) v1.0
DA XML 3.0 Track B (English elements) v1.0
DA XML 3.1 Track B (English elements) v1.0
DA XML 3.2 Track B (English elements) v1.0
DA XML 3.3 Track B (English elements) v1.0
GAEB 90 Track C (fixed-width) Planned
Phase Description Since
X31 Quantity Determination v1.4.0
X50, X51, X52 Cost & Calculation v1.3.0
X80–X86 Procurement (tender, bid, award) v1.0.0
X88 Addendum / Nachtrag (claims & variations) v1.12.0
X89, X89B Invoice / extended invoice v1.0.0
X93, X94, X96, X97 Trade (material ordering) v1.2.0

Configuration

from pygaeb import configure

configure(
    default_model="ollama/llama3",        # LLM model for classification
    classifier_concurrency=10,            # parallel LLM calls
    xsd_dir="/opt/gaeb-schemas",          # optional XSD validation
    log_level="DEBUG",                    # applied to pygaeb.* loggers
    max_file_size_mb=200,                 # input file size limit
)

Or via environment variables:

export PYGAEB_DEFAULT_MODEL=ollama/llama3
export PYGAEB_XSD_DIR=/opt/gaeb-schemas
export PYGAEB_LOG_LEVEL=DEBUG
export PYGAEB_MAX_FILE_SIZE_MB=200

Security

pyGAEB includes security hardening since v1.6.0:

  • XXE prevention — All XML parsing uses hardened parsers with resolve_entities=False and no_network=True
  • Billion Laughs protection — Entity expansion bombs are blocked
  • File size guard — Configurable limit (default 100 MB) prevents memory exhaustion
  • Recursion depth limits — Hierarchy walkers cap at 50 levels to prevent stack overflow
  • Bounded cachingInMemoryCache uses LRU eviction (default 10,000 entries)

Documentation

Full documentation is available at Read the Docs.

License

MIT — see LICENSE for details.

About

Python parser for GAEB DA XML construction data exchange files (v2.0–3.3) with LLM-powered item classification and structured extraction.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages