Official implementation of the Sonar-TS paper (ICML 2026).
Time-series data is everywhere in industry. Common examples include temperature readings, stock prices, and factory sensor logs. When a non-expert wants to extract specific information from such data, they often hit a serious wall. For instance, they might ask: "In the past month, on which day did the temperature rise sharply between 10 a.m. and 3 p.m. and then drop just as quickly?" Today, no existing method can directly answer this kind of natural-language question over a real time-series database.
Existing approaches fall short in two characteristic ways. Text-to-SQL methods are designed for relational data and cannot describe the shape-based morphology that defines time series (such as plateau, rapid fall, fluctuating stable). Time-series language models can answer questions over short windows, but they cannot scale to the database-scale histories that real applications need.
This leaves a real, unsolved problem: an interface where users describe time-series patterns in natural language and have them answered against the underlying database. We close that gap with three contributions:
- New problem. We formally define the problem as Natural Language Querying for Time Series Databases (NLQ4TSDB).
- New benchmark. We release NLQTSBench, the first benchmark for standardized evaluation of NLQ4TSDB.
- Novel framework. We propose Sonar-TS, a framework that solves NLQ4TSDB through a Search-Then-Verify pipeline.
NLQTSBench is the first standardized benchmark for NLQ4TSDB:
- Size & Diversity: Contains 1,153 tasks spanning 4 difficulty levels and 9 sub-tasks.
- Download: Hosted on HuggingFace at mrtan/NLQTSBench.
Sonar-TS is a three-stage pipeline:
- Offline data processing builds multi-scale feature tables on top of the raw series.
- Online querying runs an LLM through Task Planning, Code Generation, and Execute. Its Experiences (e.g., skills) come from the Prompt Cold Start loop (
cold_start/). - Post-processing renders the verified result as a natural-language answer, with an optional visualization.
git clone https://github.com/Atlamtiz/Sonar-TS.git
cd Sonar-TS
conda create -n sonarts python=3.11 -y && conda activate sonarts
pip install -r requirements.txtThe raw CSVs (around 1.7 GB) live on HuggingFace. One command pulls them into the expected location:
python scripts/download_dataset.pyThis places 1,153 CSVs under nlqtsbench/ts_data/. The benchmark spec (nlqtsbench/tasks.json) is already in this repository.
The framework defaults to DeepSeek (deepseek-v4-flash) with 10 worker threads, one dedicated API key per worker. With 10 keys, a full benchmark run takes about 25 minutes.
Copy the template and paste your keys:
cp -n configs/deepseek_api-key.txt.example configs/deepseek_api-key.txt
$EDITOR configs/deepseek_api-key.txt # paste one key per lineIf you have fewer than 10 keys, also lower concurrency.workers in configs/online.yaml to match.
python -m scripts.load_benchmark # CSV β per-task database
python -m scripts.build_index # SAX feature tables per taskNote: For ease of reproduction, this release ships a lightweight SQLite-based implementation. The framework's data layer is backend-agnostic by design; production TSDBs (InfluxDB, TimescaleDB, etc.) can be supported by swapping the storage adapter.
python main.pyUseful flags (full list via python main.py --help):
| Flag | Effect |
|---|---|
--limit N |
Process only the first N tasks. Use for a quick smoke test before committing to the full ~25 min run. |
--workers N |
Override the worker thread count (default: 10, from configs/online.yaml). Lower this if you have fewer keys. |
--figures |
Also render one PNG per task (output/figures/). Adds ~15-20 min via a multi-process Kaleido pool. |
--rebuild |
Discard output/predict_partial.jsonl and re-run every task. Use after editing prompts, skills, or configs. |
--out-dir |
Write results to a custom directory instead of ./output/. |
Results print to the terminal as a paper-aligned per-category / per-level / overall table, and are written to ./output/:
output/
βββ predict.json submission-format predictions
βββ summary.json per-subtask / per-category / overall scores
βββ per_task.json one row per task with prediction + score
Visualizations produced by python main.py --figures. Curated samples live in output/figures/examples/; the corresponding score breakdown is in output/summary.json and output/per_task.json.
Left: Shape Identification. Right: Composite Trend.
Sonar-TS
βββ main.py
βββ requirements.txt
βββ LICENSE
β
βββ configs/
β βββ online.yaml
β βββ offline.yaml
β βββ deepseek_api-key.txt.example
β βββ deepseek_api-key.txt (gitignored)
β
βββ sonar_ts/
β βββ pipeline.py
β βββ planner.py
β βββ generator.py
β βββ executor.py
β βββ evaluator.py
β βββ llm.py
β βββ schema.py
β βββ storage.py
β βββ offline.py
β βββ prompts/
β βββ postprocess/
β βββ skills/
β
βββ scripts/
β βββ download_dataset.py
β βββ load_benchmark.py
β βββ build_index.py
β βββ run_benchmark.py
β βββ render_samples.py
β
βββ cold_start/
β βββ orchestrator.py
β βββ run_cold_start.py
β βββ download_train_data.py
β βββ agents/
β βββ train_data/
β βββ discovered_skills/
β
βββ nlqtsbench/
β βββ tasks.json
β βββ predict_perfect.json
β βββ ts_data/
β
βββ docs/figures/
β
βββ databases/
βββ output/
See
cold_start/README.mdandnlqtsbench/README.mdfor sub-system details.
@misc{tan2026sonartssearchthenverifynaturallanguage,
title={Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases},
author={Zhao Tan and Yiji Zhao and Shiyu Wang and Chang Xu and Yuxuan Liang and Xiping Liu and Shirui Pan and Ming Jin},
year={2026},
eprint={2602.17001},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.17001},
}






