Skip to content

Retry transient failures in Xylem Data Lake adapter#221

Merged
dmarulli merged 1 commit into
mainfrom
dmarulli/xylem-datalake-resilience
May 7, 2026
Merged

Retry transient failures in Xylem Data Lake adapter#221
dmarulli merged 1 commit into
mainfrom
dmarulli/xylem-datalake-resilience

Conversation

@dmarulli
Copy link
Copy Markdown
Collaborator

@dmarulli dmarulli commented May 7, 2026

Summary

Adds retry-with-backoff at the two points in the Xylem Data Lake adapter that have killed standard DAG runs this past week. Same pattern (10s/20s/40s, max 3 attempts) at all three points.

Failures observed (Crescent, May 3–7)

Date Failure Where it died
5/03, 5/07 500 INTERNAL SERVER ERROR from /api/v1/sqllab/execute/ mid-run _query at raise_for_status()
5/05, 5/06 401 from /api/v1/me/ after successful Keycloak handshake _create_superset_session at the unexpected-shape check

In both 5xx cases the in-query session-expired path successfully re-authenticated just before the run died — so the existing re-auth mechanism is fine. The gap is that 5xx and initial-auth failures had no retry.

Changes

  1. 5xx retry in _query — new server_failures counter, triggered when 500 ≤ status < 600, backs off and retries before falling through to raise_for_status().
  2. ConnectionError / Timeout retry in _query — extends the existing try/except (which only caught TooManyRedirects) to route transient network failures through the same backoff path.
  3. _create_superset_session_with_retry wrapper — used for both the initial auth call in _extract and for re-auths from _reauth. Catches RuntimeError (e.g., the unexpected-shape 401) and RequestException, retries with backoff.

Test plan

  • Unit tests added for: 5xx retry succeeds, 5xx retry exhausts, ConnectionError/Timeout retry, auth-wrapper success, auth-wrapper retry-then-succeed, auth-wrapper exhausts attempts
  • Full test suite passes (239 tests)
  • Black formatting clean
  • Confirm next standard run on Crescent succeeds (or gracefully retries past) any vendor flake
  • Watch a few Hills/Crescent runs to validate behavior in production

Vendor-side instability this past week killed standard DAG runs in two
places that previously had no retry:
  - 5xx from /api/v1/sqllab/execute/ during chunk queries (fatal at
    raise_for_status)
  - 401 from /api/v1/me/ during initial Superset login (fatal in
    _create_superset_session)

Adds retry-with-backoff (10s/20s/40s, 3 attempts) at both points, plus
the same retry path for ConnectionError and Timeout exceptions on the
query POST. Re-auths triggered mid-query also flow through the new
auth-retry wrapper.
@dmarulli dmarulli merged commit d1bd624 into main May 7, 2026
2 checks passed
@dmarulli dmarulli deleted the dmarulli/xylem-datalake-resilience branch May 7, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant