Fix SQL timeout: use wall-clock time, not accumulated sleep counter by MikhailKuklin · Pull Request #557 · databricks-solutions/ai-dev-kit

MikhailKuklin · 2026-06-04T13:58:27Z

Summary

SQLExecutor.execute() measures its timeout by incrementing an elapsed counter by poll_interval (2s) each iteration. The counter only counts sleep time and ignores how long each get_statement RPC takes. Under warehouse load, get_statement can take several seconds per call, so the counter undercounts real wall clock and the configured timeout fires much later than intended — or, for sufficiently slow RPCs, never fires at all before the statement completes naturally.

This PR switches the timeout boundary to time.monotonic(), so the timeout fires at real wall clock regardless of RPC latency.

Reproduction

A heavy cross-join query with timeout=900 was expected to raise SQLExecutionError at ~900s wall clock. Actual behavior on a warehouse running another query in parallel:

Query ran 1999 seconds wall clock and completed naturally.
The elapsed counter only reached ~666 over the full 1999s, because each iteration was averaging ~6s wall (2s sleep + ~4s get_statement RPC under load) but only incrementing the counter by 2.

Fix

         poll_interval = 2
-        elapsed = 0
+        start_time = time.monotonic()

-        while elapsed < timeout:
+        while time.monotonic() - start_time < timeout:
             try:
                 status = self.client.statement_execution.get_statement(...)
             ...
             time.sleep(poll_interval)
-            elapsed += poll_interval

         # Timeout reached - cancel the statement
         self._cancel_statement(statement_id)
+        elapsed_wall = time.monotonic() - start_time
         raise SQLExecutionError(
-            f"SQL query timed out after {timeout} seconds and was canceled. "
+            f"SQL query timed out after {elapsed_wall:.1f} seconds (limit: {timeout}s) and was canceled. "
             ...
         )

The error message now reports actual elapsed wall clock alongside the configured limit, so callers can see at a glance whether the timeout fired close to the boundary or whether something delayed cancellation.

Validation

Validated against the patched code on a live Databricks SQL warehouse:

Test	Setup	Result
Short timeout under load	`timeout=30` on a query naturally taking minutes	Fired at 35.5s wall clock (one `poll_interval` overshoot), statement state `CANCELED` ✓
Full-scale	`timeout=900` on the same query that previously ran 1999s without the fix	Fired at 903s wall clock, statement state `CANCELED` ✓

Full-scale post-fix error message: "SQL query timed out after 902.3 seconds (limit: 900s) and was canceled. Statement ID: 01f16014-883e-1d8b-8694-de041fd3db03".

Scope

Single-file change. Strict tightening of timeout behavior; no caller observably depended on timeouts being lenient. parallel_executor.py uses the same SQLExecutor and inherits the fix transparently.

Note

We discovered this downstream and shipped the fix internally on a fork. Filing here so the broader ai-dev-kit user base benefits — the latent bug only manifests under warehouse load, which is why it has gone unnoticed for ~5 months despite the file being touched several times for unrelated changes.

/cc @QuentinAmbard since you authored this file and are the most active maintainer — happy to address any review comments.

The polling loop in SQLExecutor.execute() incremented an `elapsed` counter by `poll_interval` each iteration, which only tracks sleep time and ignores how long each `get_statement` RPC takes. Under warehouse load, get_statement can take several seconds per call, so each iteration is much longer than poll_interval. The counter undercounts real wall clock, and the configured timeout fires later than intended — or, for sufficiently slow RPCs, the query completes naturally before the timeout ever fires. Empirically reproduced (2026-06-04): a query with `timeout=900` on a loaded warehouse ran 1999 seconds wall clock before completing naturally, never hitting the timeout — the counter only reached ~666. Fix: use `time.monotonic()` for the timeout boundary. The timeout now fires at real wall clock regardless of RPC latency. Validated against the fix: same query, same timeout, fired at 903s wall clock with the statement reaching `CANCELED`. Error message also now reports actual elapsed time, not the configured limit. The error message changes from: 'SQL query timed out after {timeout} seconds...' to: 'SQL query timed out after {elapsed_wall:.1f} seconds (limit: {timeout}s)...' so users can tell at a glance whether the timeout fired close to the configured boundary or whether something delayed cancellation.

MikhailKuklin added 2 commits June 4, 2026 16:56

Fix trailing whitespace in previous commit

f870874

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557

Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557
MikhailKuklin wants to merge 2 commits into
databricks-solutions:mainfrom
MikhailKuklin:fix/sql-timeout-wall-clock

MikhailKuklin commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MikhailKuklin commented Jun 4, 2026

Summary

Reproduction

Fix

Validation

Scope

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant