Researchers often begin open-web data analysis with a vague analytical question, but the web rarely provides the exact table needed to answer it. SPECTABULAR studies this first step: specifying the target table from a public data portal before any cell values are extracted.
Given a natural-language query and the base URL of an open-web data portal, the task is to infer:
- the primary-key column,
- the primary-key values that define the table rows,
- and the attribute list that defines the non-key columns.
This repo contains:
| Folder | Description |
|---|---|
| mario/ | TableMario — the three-stage AI agent (PK identification → PK value search → attribute generation). |
| spectabench/ | SpecTaBench — 100-query benchmark, curation pipeline, and end-to-end evaluation. |
| baselines/ | End-to-end baselines (AutoGen, AG2, AutoGPT, CrewAI, Sodium-Agent, GPT-WebSearch). All share run_spectabench.py as their entry point. |
See each subfolder's README.md for setup and usage details.
