OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning
Z. Pan, W. Sun, Y. Zhang, T. Purcell, Y. Dong, C. Yang, R. Zhang, X. Zhou, J. Xu PVLDB 2026
This repository contains the implementation of OBELISK, a system for offline query plan optimization in TiDB. OBELISK integrates Bayesian Optimization with language model reasoning to efficiently navigate the high-dimensional space of cost-factor configurations, reducing optimization overhead while discovering high-performance plans.
OBELISK searches TiDB optimizer cost-factor configurations for each SQL query and records the best-performing plan/runtime profile.
Core flow:
- Run baseline execution for a query.
- Warm start with Latin Hypercube samples.
- Run BO (
tcboorvanilla_gp) with LLM-guided proposals. - Persist per-query results and summary statistics.
uv venv .venv
source .venv/bin/activate
uv syncCreate env file:
cp .env.copy .envEdit .env with your TiDB connection and OpenAI API key:
OPENAI_API_KEY=...
TIDB_HOST=...
TIDB_PORT=4000
TIDB_USER=root
TIDB_PASSWORD=...
TIDB_DB_NAME=...
CA_PATH=... # Optionalcurl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
source ~/.bashrc
which tiupRequires TiDB version ≥ v8.5 with the --tag obelisk flag. A configuration file must be specified to disable Coprocessor Cache via the tikv-client.copr-cache configuration, preventing inaccurate execution plan performance measurements:
tiup playground v8.5.3 --db 1 --pd 1 --kv 1 --tiflash 0 --tag obelisk --db.config /path/to/tidb.tomlAdd the following settings to /path/to/tidb.toml:
[tikv-client.copr-cache]
capacity-mb = 0.0mysql --comments --host 127.0.0.1 --port 4000 -u rootFill TiDB connection fields in .env (TIDB_HOST, TIDB_PORT, TIDB_USER, TIDB_PASSWORD, TIDB_DB_NAME, CA_PATH).
uv run src/run.py \
--sql-dir sql/job \
--results-dir results/job-run \
--trials 15 \
--warm_times 10 \
--strategy tcbo| Option | Description |
|---|---|
--sql-dir |
Directory with SQL files |
--results-dir |
Output directory (results/<sql-dir-name>) |
--trials |
Total optimization iterations |
--warm_times |
Warm-start iterations |
--strategy |
BO strategy (tcbo or vanilla_gp) |
obelisk-oqo/
├── src/
│ ├── run.py
│ ├── db/ # TiDB connection and SQL execution
│ ├── llm/ # LLM prompting and config generation
│ ├── optimization/ # BO strategies and optimization pipeline
│ ├── test/ # Script-style validation tools
│ └── util/ # Shared config/constants/log helpers
├── sql/ # Workload SQL files
├── results/ # Output artifacts (ignored in git)
├── logs/ # Runtime logs (ignored in git)
└── pyproject.toml
If you use OBELISK in your research, please cite:
@article{pan2026obelisk,
title={OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning},
author={Pan, Z. and Sun, W. and Zhang, Y. and Purcell, T. and Dong, Y. and Yang, C. and Zhang, R. and Zhou, X. and Xu, J.},
journal={Proceedings of the VLDB Endowment},
volume={19},
number={12},
year={2026}
}Apache 2.0