TUM.ai × Spherecast Hackathon 2026 challenge
![]() |
![]() |
![]() |
![]() |
![]() |
|
CPG companies overpay for ingredients because sourcing is fragmented. The same ingredient gets purchased by multiple brands from different suppliers with no shared visibility. Nobody sees the combined demand — so nobody captures the volume discount, and nobody checks whether a cheaper alternative is actually safe to use in a given product.
The challenge has two parts:
- Which ingredients can actually replace each other? Not just same name — functionally equivalent in the context of a specific finished product.
- Is the substitute compliant? Does it preserve allergen claims, dietary certifications, regulatory status, and physical specifications for that product?
We built Agnes — an autonomous AI agent that reasons across fragmented supply chain data to answer sourcing questions. The system ingests BOM and supplier data for 61 CPG companies, enriches every raw material with external knowledge, finds substitutable ingredients via semantic similarity, and scores each candidate with a quantifiable compliance rubric.
An analyst can ask Agnes: "What can replace soy lecithin in product X?" — and get a ranked list of substitutes with a breakdown of why each one scores the way it does, backed by evidence from supplier datasheets and regulatory databases.
| # | Requirement | Status |
|---|---|---|
| R1 | Ingest BOM and supplier data | ✅ Done |
| R2 | Enrich materials with external knowledge | ✅ Done |
| R3 | Identify interchangeable components | ✅ Done |
| R4 | Infer the compliance bar a substitute must meet | ✅ Done |
| R5 | Score substitutes with explainable reasoning | ✅ Done |
| R6 | Preserve evidence trails | ✅ Done |
| R7 | Surface fragmentation across the portfolio | 🔲 Data ready |
| R8 | Generate consolidated sourcing proposals | 🔜 Future work |
| R9 | Conversational reasoning interface | ✅ Done |
R1 — Ingest: All 61-company BOM and supplier data loaded into a structured database on startup.
R2 — Enrich: Every ingredient enriched with functional role, source origin, allergens, dietary flags, certifications, and regulatory status — pulled from supplier websites, scientific databases, and LLM knowledge. Each fact carries its source and confidence level.
R3 — Identify substitutes: Ingredients converted into semantic vectors. Similarity search finds functionally equivalent candidates even when names differ.
R4 — Infer compliance bar: Enriched spec of the original ingredient defines what any substitute must preserve — dietary claims, allergen profile, regulatory status, physical form.
R5 — Score with reasoning: GPT-4o scores each candidate across 5 dimensions (functional equivalence, spec compatibility, regulatory fit, dietary compliance, certification match), each 0–20, total 0–100. Every score includes a breakdown and written justification.
R6 — Evidence trails: Every property carries provenance: where it came from, the URL, and confidence level.
R7 — Fragmentation: Data layer ready. System can answer "which companies buy the same ingredient from different suppliers?" in a single query. Not yet surfaced as a dedicated UI feature.
R8 — Proposals: Future work. Database schema, API contract, and data models fully defined. The agent that generates and writes proposals is the remaining piece.
R9 — Conversational interface: Agnes is an autonomous chat agent. It decides which tools to call based on the analyst's question — no scripted flows. It surfaces its reasoning trace so the analyst sees which data sources were consulted.
We built an evaluation framework (benchmark runner + dataset) to measure how well the compliance engine finds the right substitutes.
Results: ~50% Precision@3 — for a given ingredient and product, the top 3 returned substitutes contain a correct answer roughly half the time.
Why not higher:
- AI non-determinism. GPT-4o produces slightly different rankings even at temperature=0. Borderline cases flip between runs.
- Dataset bias. Ground-truth dataset was generated using Claude Sonnet 4.7 and Opus with high reasoning effort — not manually verified by domain experts. The "correct" answers reflect what a powerful LLM thinks is substitutable, which may not match real-world procurement expertise. We did not have time to validate against human judgment.
- Thin specs. Several enrichment sources are implemented but disabled due to API reliability constraints. Many ingredient specs are filled by LLM inference rather than authoritative databases, weakening the similarity search signal.
50% precision on an open-ended substitution task with no organizer-provided ground truth is a reasonable starting point. With a verified benchmark and richer enrichment, this number would improve.
API spend over the hackathon:
- Anthropic (Claude Haiku — enrichment): ~$1.14 across 334 calls
- OpenAI (GPT-4o compliance scoring + embeddings + Agnes chat): ~$X
Primary model: GPT-4o for compliance scoring and Agnes chat reasoning.











