Skip to content

Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557

Open
MikhailKuklin wants to merge 2 commits into
databricks-solutions:mainfrom
MikhailKuklin:fix/sql-timeout-wall-clock
Open

Fix SQL timeout: use wall-clock time, not accumulated sleep counter#557
MikhailKuklin wants to merge 2 commits into
databricks-solutions:mainfrom
MikhailKuklin:fix/sql-timeout-wall-clock

Conversation

@MikhailKuklin
Copy link
Copy Markdown

Summary

SQLExecutor.execute() measures its timeout by incrementing an elapsed counter by poll_interval (2s) each iteration. The counter only counts sleep time and ignores how long each get_statement RPC takes. Under warehouse load, get_statement can take several seconds per call, so the counter undercounts real wall clock and the configured timeout fires much later than intended — or, for sufficiently slow RPCs, never fires at all before the statement completes naturally.

This PR switches the timeout boundary to time.monotonic(), so the timeout fires at real wall clock regardless of RPC latency.

Reproduction

A heavy cross-join query with timeout=900 was expected to raise SQLExecutionError at ~900s wall clock. Actual behavior on a warehouse running another query in parallel:

  • Query ran 1999 seconds wall clock and completed naturally.
  • The elapsed counter only reached ~666 over the full 1999s, because each iteration was averaging ~6s wall (2s sleep + ~4s get_statement RPC under load) but only incrementing the counter by 2.

Fix

         poll_interval = 2
-        elapsed = 0
+        start_time = time.monotonic()

-        while elapsed < timeout:
+        while time.monotonic() - start_time < timeout:
             try:
                 status = self.client.statement_execution.get_statement(...)
             ...
             time.sleep(poll_interval)
-            elapsed += poll_interval

         # Timeout reached - cancel the statement
         self._cancel_statement(statement_id)
+        elapsed_wall = time.monotonic() - start_time
         raise SQLExecutionError(
-            f"SQL query timed out after {timeout} seconds and was canceled. "
+            f"SQL query timed out after {elapsed_wall:.1f} seconds (limit: {timeout}s) and was canceled. "
             ...
         )

The error message now reports actual elapsed wall clock alongside the configured limit, so callers can see at a glance whether the timeout fired close to the boundary or whether something delayed cancellation.

Validation

Validated against the patched code on a live Databricks SQL warehouse:

Test Setup Result
Short timeout under load timeout=30 on a query naturally taking minutes Fired at 35.5s wall clock (one poll_interval overshoot), statement state CANCELED
Full-scale timeout=900 on the same query that previously ran 1999s without the fix Fired at 903s wall clock, statement state CANCELED

Full-scale post-fix error message: "SQL query timed out after 902.3 seconds (limit: 900s) and was canceled. Statement ID: 01f16014-883e-1d8b-8694-de041fd3db03".

Scope

Single-file change. Strict tightening of timeout behavior; no caller observably depended on timeouts being lenient. parallel_executor.py uses the same SQLExecutor and inherits the fix transparently.

Note

We discovered this downstream and shipped the fix internally on a fork. Filing here so the broader ai-dev-kit user base benefits — the latent bug only manifests under warehouse load, which is why it has gone unnoticed for ~5 months despite the file being touched several times for unrelated changes.

/cc @QuentinAmbard since you authored this file and are the most active maintainer — happy to address any review comments.

The polling loop in SQLExecutor.execute() incremented an `elapsed`
counter by `poll_interval` each iteration, which only tracks sleep
time and ignores how long each `get_statement` RPC takes. Under
warehouse load, get_statement can take several seconds per call, so
each iteration is much longer than poll_interval. The counter
undercounts real wall clock, and the configured timeout fires later
than intended — or, for sufficiently slow RPCs, the query completes
naturally before the timeout ever fires.

Empirically reproduced (2026-06-04): a query with `timeout=900` on
a loaded warehouse ran 1999 seconds wall clock before completing
naturally, never hitting the timeout — the counter only reached ~666.

Fix: use `time.monotonic()` for the timeout boundary. The timeout
now fires at real wall clock regardless of RPC latency.

Validated against the fix: same query, same timeout, fired at 903s
wall clock with the statement reaching `CANCELED`. Error message
also now reports actual elapsed time, not the configured limit.

The error message changes from:
    'SQL query timed out after {timeout} seconds...'
to:
    'SQL query timed out after {elapsed_wall:.1f} seconds (limit:
    {timeout}s)...'
so users can tell at a glance whether the timeout fired close to the
configured boundary or whether something delayed cancellation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant