pyGAEB

Python parser for GAEB DA XML construction data exchange files, with LLM-powered item classification.

Deutsche Version (README.de.md)

pyGAEB parses, validates, classifies, and writes GAEB DA XML files (versions 2.0 through 3.3), producing a unified Pydantic v2 domain model from all inputs. It supports the full GAEB exchange phase spectrum — procurement (X80–X89), trade (X93–X97), cost & calculation (X50–X52), and quantity determination (X31).

An optional LLM classification layer enriches each item with a semantic construction element type via LiteLLM (100+ providers), with pluggable caching and customisable taxonomy.

Highlights

Multi-version — DA XML 2.0, 2.1, 3.0, 3.1, 3.2, 3.3 auto-detected
All exchange phases — Procurement, Trade, Cost & Calculation, Quantity Determination
Security-hardened — XXE prevention, Billion Laughs protection, file size guards, recursion depth limits
Extensible — Custom validators, post-parse hooks, raw XML data collection, custom LLM taxonomy
LLM classification — 100+ provider support via LiteLLM with cost estimation and persistent caching
Cross-phase validation — X83→X84 structural identity, X86→X89 unit price matching, X86→X88 addendum traceability
Document diff — Compare two BoQs with significance-classified field changes, structural diff, and financial impact
BoQ Builder — Programmatic document construction with auto OZ, Decimal convenience, phase rules, and version checks
Excel export — Structured .xlsx workbooks with hierarchy-aware layout, phase-specific columns, and multi-sheet mode
Round-trip — Parse → modify → write back to any DA XML version
Version conversion — Upgrade/downgrade between DA XML 2.0–3.3

Installation

# Core parser + writer + export (zero LLM dependencies)
pip install pyGAEB

# With LLM classification (supports 100+ providers via LiteLLM)
pip install pyGAEB[llm]

Quick Start

Parse any GAEB file

from pygaeb import GAEBParser

doc = GAEBParser.parse("tender.X83")    # DA XML 3.x
doc = GAEBParser.parse("old.D83")       # DA XML 2.x — same call

print(doc.source_version)               # SourceVersion.DA_XML_33
print(doc.exchange_phase)               # ExchangePhase.X83
print(doc.grand_total)                  # Decimal("1234567.89")

Iterate items

Works for all document kinds — procurement, trade, cost, and quantity:

for item in doc.iter_items():
    print(item.oz)              # "01.02.0030"
    print(item.short_text)      # "Mauerwerk der Innenwand…"
    print(item.qty)             # Decimal("1170.000")
    print(item.unit)            # "m2"
    print(item.unit_price)      # Decimal("45.50")
    print(item.total_price)     # Decimal("53235.00")
    print(item.item_type)       # ItemType.NORMAL

Validation

from pygaeb import GAEBParser, ValidationMode

# Lenient (default) — collect warnings, keep parsing
doc = GAEBParser.parse("tender.X83")
for issue in doc.validation_results:
    print(issue.severity, issue.message)

# Strict — raise on first ERROR
doc = GAEBParser.parse("tender.X83", validation=ValidationMode.STRICT)

Custom Validators

Register project-specific validation rules:

from pygaeb import register_validator, clear_validators
from pygaeb.models.item import ValidationResult
from pygaeb.models.enums import ValidationSeverity

def require_unit(doc):
    issues = []
    for item in doc.iter_items():
        if not item.unit:
            issues.append(
                ValidationResult(
                    severity=ValidationSeverity.WARNING,
                    message=f"{item.oz}: missing unit",
                )
            )
    return issues

register_validator(require_unit)
doc = GAEBParser.parse("tender.X83")
# require_unit results are now in doc.validation_results

# Or per-call (not added to the global registry):
doc = GAEBParser.parse("tender.X83", extra_validators=[require_unit])

Write / Round-trip

from pygaeb import GAEBWriter, ExchangePhase
from decimal import Decimal

doc = GAEBParser.parse("tender.X83")
item = doc.award.boq.get_item("01.02.0030")
item.unit_price = Decimal("48.00")

GAEBWriter.write(doc, "bid.X84", phase=ExchangePhase.X84)

Version Conversion

from pygaeb import GAEBConverter, SourceVersion

# Upgrade 2.x → 3.3
report = GAEBConverter.convert("old.D83", "modern.X83")

# Downgrade 3.3 → 3.2 for compatibility
report = GAEBConverter.convert(
    "tender.X83", "compat.X83",
    target_version=SourceVersion.DA_XML_32,
)
print(f"Converted {report.items_converted} items, data loss: {report.has_data_loss}")

Export to JSON / CSV

from pygaeb.convert import to_json, to_csv

to_json(doc, "boq.json")     # full nested BoQ tree
to_csv(doc, "items.csv")     # flat item table with classification columns

Trade Phases (X93–X97)

doc = GAEBParser.parse("order.X96")
print(doc.document_kind)    # DocumentKind.TRADE
print(doc.is_trade)         # True

for item in doc.order.items:
    print(item.art_no, item.short_text, item.net_price)

print(doc.order.supplier_info.address.name)

Cost & Calculation Phases (X50–X52)

doc = GAEBParser.parse("costing.X50")
print(doc.document_kind)    # DocumentKind.COST

for elem in doc.elemental_costing.body.iter_cost_elements():
    print(elem.ele_no, elem.short_text, elem.total_cost)

Quantity Determination (X31)

doc = GAEBParser.parse("measurements.X31")
print(doc.document_kind)    # DocumentKind.QUANTITY

for item in doc.qty_determination.boq.iter_items():
    print(item.oz, item.qty_determ_items)

Financial Summaries & Project Info

doc = GAEBParser.parse("tender.X86")

# BoQ-level totals
totals = doc.award.boq.info.totals
print(totals.total_net, totals.total_gross, totals.vat_amount)

# Per-VAT-rate breakdown
for vp in totals.vat_parts:
    print(f"{vp.vat_pcnt}%: net {vp.net_amount} → gross {vp.gross_amount}")

# Project metadata
print(doc.award.prj_id, doc.award.description, doc.award.currency_label)

Tree Navigation (BoQ Hierarchy)

Navigate the BoQ with parent references, depth tracking, and indexed lookups:

from pygaeb import BoQTree, NodeKind

tree = BoQTree(doc.award.boq)

# Find an item and navigate up
node = tree.find_item("01.01.0010")
print(node.parent.label)       # "Mauerwerk"
print(node.depth)              # level in tree
print(node.label_path)         # ["BoQ", "Default", "Rohbau", "Mauerwerk", "..."]
print(node.siblings)           # sibling nodes

# Walk the hierarchy
for node in tree.walk():
    indent = "  " * node.depth
    print(f"{indent}{node.kind.value}: {node.label}")

# Subtree queries
expensive = tree.root.find_all(
    lambda n: n.kind == NodeKind.ITEM
    and n.item.total_price
    and n.item.total_price > 50000
)

Document Diff (Compare Two BoQs)

Compare two GAEB documents and get structured, significance-classified changes:

from pygaeb import GAEBParser, BoQDiff, DiffMode, Significance

doc_a = GAEBParser.parse("tender_v1.X83")
doc_b = GAEBParser.parse("tender_v2.X83")

result = BoQDiff.compare(doc_a, doc_b)

# Top-level summary
print(result.summary.total_changes)      # 12
print(result.summary.financial_impact)   # Decimal("45230.00")
print(result.summary.max_significance)   # Significance.CRITICAL

# Items added / removed / modified
for item in result.items.added:
    print(f"+ {item.oz}: {item.short_text}")

for item in result.items.removed:
    print(f"- {item.oz}: {item.short_text}")

# Field-level changes with significance
for mod in result.items.modified:
    for change in mod.changes:
        print(f"  {mod.oz} {change.field}: {change.old_value} → {change.new_value}"
              f" [{change.significance.value}]")

# Filter by significance
critical_only = result.items.filter_modified(Significance.CRITICAL)

# Structural changes (sections added/removed/renamed)
for sec in result.structure.sections_added:
    print(f"New section: {sec.label}")

# Strict mode: raises ValueError if documents are from different projects
result = BoQDiff.compare(doc_a, doc_b, mode=DiffMode.STRICT)

Build a Document from Scratch

from pygaeb import BoQBuilder, GAEBWriter

builder = BoQBuilder(phase="X83", version="3.3")
builder.project(no="PRJ-001", name="School Renovation", currency="EUR")

lot = builder.add_lot("1", "Structural Work")
concrete = lot.add_category("01", "Concrete")
concrete.add_item("01.0010", "Foundation", qty=120, unit="m3", unit_price=85)
concrete.add_item("01.0020", "Columns",   qty=40,  unit="m3", unit_price=95)

doc = builder.build()               # GAEBDocument with auto totals
GAEBWriter.write(doc, "output.X83") # Write to GAEB XML

Excel Export

from pygaeb import GAEBParser
from pygaeb.convert import to_excel

doc = GAEBParser.parse("tender.X83")

# Single structured sheet with hierarchy
to_excel(doc, "tender.xlsx")

# Multi-sheet workbook (BoQ + Items + Summary + Info)
to_excel(doc, "tender_full.xlsx", mode="full")

# With optional columns
to_excel(doc, "detailed.xlsx", include_long_text=True, include_classification=True)

LLM Classification

from pygaeb import LLMClassifier

# Default: in-memory cache (no disk I/O, session-scoped)
classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
# classifier = LLMClassifier(model="gpt-4o")
# classifier = LLMClassifier(model="ollama/llama3")  # local, free, private

# Opt-in: persistent SQLite cache (survives across runs)
from pygaeb import SQLiteCache
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))

# Custom taxonomy and prompt
classifier = LLMClassifier(
    model="openai/gpt-4o",
    taxonomy={"Electrical": {"Cable": ["Ladder", "Perforated"]}},
    prompt_template="You are a specialist classifying MEP items...",
)

# Check cost before running
estimate = await classifier.estimate_cost(doc)
print(f"Will classify {estimate.items_to_classify} items for ~${estimate.estimated_cost_usd:.2f}")

# Classify all items
await classifier.enrich(doc)

# Or synchronous
classifier.enrich_sync(doc)

for item in doc.iter_items():
    if item.classification:
        print(item.oz, item.classification.element_type, item.classification.confidence)

Structured Extraction — Custom Schemas

After classification, extract typed attributes into your own Pydantic schema:

from pydantic import BaseModel, Field
from typing import Optional
from pygaeb import StructuredExtractor

class DoorSpec(BaseModel):
    door_type: str = Field("", description="single, double, sliding")
    width_mm: Optional[int] = Field(None, description="Width in mm")
    fire_rating: Optional[str] = Field(None, description="T30, T60, T90")
    glazing: bool = Field(False, description="Has glass panels")
    material: str = Field("", description="wood, steel, aluminium")

extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")

# Extract from all items classified as "Door"
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
for item, spec in doors:
    print(item.oz, spec.door_type, spec.fire_rating, spec.width_mm)

# Filter by trade (broad) or sub_type (narrow)
pipes = await extractor.extract(doc, schema=PipeSpec, trade="MEP-Plumbing")
fire_doors = await extractor.extract(doc, schema=DoorSpec, sub_type="Fire Door")

# Or synchronous
doors = extractor.extract_sync(doc, schema=DoorSpec, element_type="Door")

Built-in starter schemas: DoorSpec, WindowSpec, WallSpec, PipeSpec — or define your own.

Post-Parse Hook & Raw Data Collection

Extract vendor-specific XML elements during parsing:

def extract_vendor_codes(item, el):
    if el is None:
        return
    ns = {"g": "http://www.gaeb.de/GAEB_DA_XML/DA86/3.3"}
    codes = el.findall(".//g:VendorCostCode", ns)
    if codes:
        item.raw_data = item.raw_data or {}
        item.raw_data["vendor_codes"] = [c.text for c in codes]

doc = GAEBParser.parse("file.X83", post_parse_hook=extract_vendor_codes)

Or automatically collect all unknown XML elements:

doc = GAEBParser.parse("file.X83", collect_raw_data=True)
for item in doc.iter_items():
    if item.raw_data:
        print(f"{item.oz}: extra fields = {item.raw_data}")

Custom & Vendor Tags (XPath)

doc = GAEBParser.parse("vendor_file.X83", keep_xml=True)

# XPath across the whole document
codes = doc.xpath("//g:VendorCostCode/text()")

# Per-item raw element access
for item in doc.iter_items():
    el = item.source_element  # original lxml element

# Free memory when done
doc.discard_xml()

Custom Cache Backend

from pygaeb import CacheBackend, InMemoryCache, SQLiteCache

# Default: in-memory (LRU-bounded, session-scoped)
classifier = LLMClassifier()

# Persistent: SQLite
classifier = LLMClassifier(cache=SQLiteCache("~/.pygaeb/cache"))

# Bring your own: implement CacheBackend protocol
class RedisCache:
    def get(self, key: str) -> str | None: ...
    def put(self, key: str, value: str) -> None: ...
    def delete(self, key: str) -> None: ...
    def keys(self) -> list[str]: ...
    def clear(self) -> None: ...
    def close(self) -> None: ...

classifier = LLMClassifier(cache=RedisCache())

Cross-Phase Validation

from pygaeb import GAEBParser, CrossPhaseValidator

# Tender → Bid: structural identity check
tender = GAEBParser.parse("tender.X83")
bid = GAEBParser.parse("bid.X84")
issues = CrossPhaseValidator.check(source=tender, response=bid)

# Contract → Invoice: unit prices must match
contract = GAEBParser.parse("contract.X86")
invoice = GAEBParser.parse("invoice.X89")
issues = CrossPhaseValidator.check(source=contract, response=invoice)

# Contract → Addendum: change order traceability
addendum = GAEBParser.parse("nachtrag.X88")
issues = CrossPhaseValidator.check(source=contract, response=addendum)

for issue in issues:
    print(issue.severity, issue.message)

Supported Versions & Exchange Phases

Version	Parser Track	Status
DA XML 2.0	Track A (German elements)	v1.0
DA XML 2.1	Track A (German elements)	v1.0
DA XML 3.0	Track B (English elements)	v1.0
DA XML 3.1	Track B (English elements)	v1.0
DA XML 3.2	Track B (English elements)	v1.0
DA XML 3.3	Track B (English elements)	v1.0
GAEB 90	Track C (fixed-width)	Planned

Phase	Description	Since
X31	Quantity Determination	v1.4.0
X50, X51, X52	Cost & Calculation	v1.3.0
X80–X86	Procurement (tender, bid, award)	v1.0.0
X88	Addendum / Nachtrag (claims & variations)	v1.12.0
X89, X89B	Invoice / extended invoice	v1.0.0
X93, X94, X96, X97	Trade (material ordering)	v1.2.0

Configuration

from pygaeb import configure

configure(
    default_model="ollama/llama3",        # LLM model for classification
    classifier_concurrency=10,            # parallel LLM calls
    xsd_dir="/opt/gaeb-schemas",          # optional XSD validation
    log_level="DEBUG",                    # applied to pygaeb.* loggers
    max_file_size_mb=200,                 # input file size limit
)

Or via environment variables:

export PYGAEB_DEFAULT_MODEL=ollama/llama3
export PYGAEB_XSD_DIR=/opt/gaeb-schemas
export PYGAEB_LOG_LEVEL=DEBUG
export PYGAEB_MAX_FILE_SIZE_MB=200

Security

pyGAEB includes security hardening since v1.6.0:

XXE prevention — All XML parsing uses hardened parsers with resolve_entities=False and no_network=True
Billion Laughs protection — Entity expansion bombs are blocked
File size guard — Configurable limit (default 100 MB) prevents memory exhaustion
Recursion depth limits — Hierarchy walkers cap at 50 levels to prevent stack overflow
Bounded caching — InMemoryCache uses LRU eviction (default 10,000 entries)

Documentation

Full documentation is available at Read the Docs.

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
docs		docs
pygaeb		pygaeb
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.de.md		README.de.md
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
tender.json		tender.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyGAEB

Highlights

Installation

Quick Start

Parse any GAEB file

Iterate items

Validation

Custom Validators

Write / Round-trip

Version Conversion

Export to JSON / CSV

Trade Phases (X93–X97)

Cost & Calculation Phases (X50–X52)

Quantity Determination (X31)

Financial Summaries & Project Info

Tree Navigation (BoQ Hierarchy)

Document Diff (Compare Two BoQs)

Build a Document from Scratch

Excel Export

LLM Classification

Structured Extraction — Custom Schemas

Post-Parse Hook & Raw Data Collection

Custom & Vendor Tags (XPath)

Custom Cache Backend

Cross-Phase Validation

Supported Versions & Exchange Phases

Configuration

Security

Documentation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyGAEB

Highlights

Installation

Quick Start

Parse any GAEB file

Iterate items

Validation

Custom Validators

Write / Round-trip

Version Conversion

Export to JSON / CSV

Trade Phases (X93–X97)

Cost & Calculation Phases (X50–X52)

Quantity Determination (X31)

Financial Summaries & Project Info

Tree Navigation (BoQ Hierarchy)

Document Diff (Compare Two BoQs)

Build a Document from Scratch

Excel Export

LLM Classification

Structured Extraction — Custom Schemas

Post-Parse Hook & Raw Data Collection

Custom & Vendor Tags (XPath)

Custom Cache Backend

Cross-Phase Validation

Supported Versions & Exchange Phases

Configuration

Security

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages