Skip to content

DaSECandyLab/obelisk-offlineqo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning

Z. Pan, W. Sun, Y. Zhang, T. Purcell, Y. Dong, C. Yang, R. Zhang, X. Zhou, J. Xu PVLDB 2026

Python 3.12+

This repository contains the implementation of OBELISK, a system for offline query plan optimization in TiDB. OBELISK integrates Bayesian Optimization with language model reasoning to efficiently navigate the high-dimensional space of cost-factor configurations, reducing optimization overhead while discovering high-performance plans.

What This Repository Does

OBELISK searches TiDB optimizer cost-factor configurations for each SQL query and records the best-performing plan/runtime profile.

Core flow:

  1. Run baseline execution for a query.
  2. Warm start with Latin Hypercube samples.
  3. Run BO (tcbo or vanilla_gp) with LLM-guided proposals.
  4. Persist per-query results and summary statistics.

Quick Start

Installation

uv venv .venv
source .venv/bin/activate
uv sync

Create env file:

cp .env.copy .env

Edit .env with your TiDB connection and OpenAI API key:

OPENAI_API_KEY=...
TIDB_HOST=...
TIDB_PORT=4000
TIDB_USER=root
TIDB_PASSWORD=...
TIDB_DB_NAME=...
CA_PATH=...  # Optional

TiDB Setup

1. Install TiUP

curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
source ~/.bashrc
which tiup

2. Deploy Local TiDB Cluster

Requires TiDB version ≥ v8.5 with the --tag obelisk flag. A configuration file must be specified to disable Coprocessor Cache via the tikv-client.copr-cache configuration, preventing inaccurate execution plan performance measurements:

tiup playground v8.5.3 --db 1 --pd 1 --kv 1 --tiflash 0 --tag obelisk --db.config /path/to/tidb.toml

Add the following settings to /path/to/tidb.toml:

[tikv-client.copr-cache]
capacity-mb = 0.0

3. Test Connection

mysql --comments --host 127.0.0.1 --port 4000 -u root

4. Configure Environment

Fill TiDB connection fields in .env (TIDB_HOST, TIDB_PORT, TIDB_USER, TIDB_PASSWORD, TIDB_DB_NAME, CA_PATH).

Run Optimization

uv run src/run.py \
  --sql-dir sql/job \
  --results-dir results/job-run \
  --trials 15 \
  --warm_times 10 \
  --strategy tcbo

Usage

Command Line Options

Option Description
--sql-dir Directory with SQL files
--results-dir Output directory (results/<sql-dir-name>)
--trials Total optimization iterations
--warm_times Warm-start iterations
--strategy BO strategy (tcbo or vanilla_gp)

Code Structure

obelisk-oqo/
├── src/
│   ├── run.py
│   ├── db/                # TiDB connection and SQL execution
│   ├── llm/               # LLM prompting and config generation
│   ├── optimization/      # BO strategies and optimization pipeline
│   ├── test/              # Script-style validation tools
│   └── util/              # Shared config/constants/log helpers
├── sql/                   # Workload SQL files
├── results/               # Output artifacts (ignored in git)
├── logs/                  # Runtime logs (ignored in git)
└── pyproject.toml

Citation

If you use OBELISK in your research, please cite:

@article{pan2026obelisk,
  title={OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning},
  author={Pan, Z. and Sun, W. and Zhang, Y. and Purcell, T. and Dong, Y. and Yang, C. and Zhang, R. and Zhou, X. and Xu, J.},
  journal={Proceedings of the VLDB Endowment},
  volume={19},
  number={12},
  year={2026}
}

License

Apache 2.0

About

OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors