Skip to content

AI Restyle: fal client poll/result missing raise_for_status (silent error masking) #39

@vansteenbergenmatisse

Description

@vansteenbergenmatisse

Surfaced by the Codex adversarial security audit on PR #35. One of 4 deferred MEDIUMs.

Where

backend/app/integrations/fal.py — two spots in submit_and_poll():

Poll loop (line 146-154):

while True:
    if time.time() - start > timeout:
        raise FalError(...)

    with httpx.Client(timeout=30.0) as client:
        poll_resp = client.get(status_url, headers=poll_headers)

    try:
        status_data = poll_resp.json()
    except json.JSONDecodeError:
        # Treat malformed poll as transient; sleep + retry.
        time.sleep(poll_interval)
        continue
    # ... falls through with no raise_for_status()

Result fetch (line 158-161):

if status == "COMPLETED":
    with httpx.Client(timeout=120.0) as client:
        result_resp = client.get(response_url, headers=poll_headers)
    return result_resp.json()

What's wrong

The submit step at line 120-121 checks resp.status_code >= 400 and raises. The poll + result GETs do not. Consequences:

  • Permanent 401/403/404 during poll — gets retried for the full timeout budget (default 600s). Burns a poll slot every 5s for 10 minutes before raising a generic timeout. Real cause was a bad token or revoked job; user waits in the dark.
  • 5xx during poll — same: retried for the full timeout, then raises timeout instead of the original error.
  • Result fetch failureresult_resp.json() will either parse error JSON as if it were the result (data-shape confusion downstream) or raise JSONDecodeError, which the caller doesn't expect.

upload_file() later in the same module DOES call raise_for_status() (line 189, 204). Pattern is inconsistent.

Severity

MEDIUM. Not a security issue — fal's API key is already protected by require_fal_queue_url() and only the trusted queue.fal.run host receives the auth header. This is purely about error legibility + faster failure.

Suggested fix

Two small changes:

  1. Poll loop: If status code is in [400, 500), break the retry loop and raise — these are permanent. Keep the silent retry only for 5xx and JSONDecodeError (transient).

    if 400 <= poll_resp.status_code < 500:
        raise FalError(f"fal.ai poll failed ({poll_resp.status_code}): {poll_resp.text[:300]}")
    if poll_resp.status_code >= 500:
        time.sleep(poll_interval)
        continue
  2. Result fetch: Add result_resp.raise_for_status() before .json().

Tests: add a httpx mock returning 401 on poll → assert FalError raised within <10s instead of timeout budget.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions