Surfaced by the Codex adversarial security audit on PR #35. One of 4 deferred MEDIUMs.
Where
backend/app/integrations/fal.py — two spots in submit_and_poll():
Poll loop (line 146-154):
while True:
if time.time() - start > timeout:
raise FalError(...)
with httpx.Client(timeout=30.0) as client:
poll_resp = client.get(status_url, headers=poll_headers)
try:
status_data = poll_resp.json()
except json.JSONDecodeError:
# Treat malformed poll as transient; sleep + retry.
time.sleep(poll_interval)
continue
# ... falls through with no raise_for_status()
Result fetch (line 158-161):
if status == "COMPLETED":
with httpx.Client(timeout=120.0) as client:
result_resp = client.get(response_url, headers=poll_headers)
return result_resp.json()
What's wrong
The submit step at line 120-121 checks resp.status_code >= 400 and raises. The poll + result GETs do not. Consequences:
- Permanent 401/403/404 during poll — gets retried for the full timeout budget (default 600s). Burns a poll slot every 5s for 10 minutes before raising a generic timeout. Real cause was a bad token or revoked job; user waits in the dark.
- 5xx during poll — same: retried for the full timeout, then raises timeout instead of the original error.
- Result fetch failure —
result_resp.json() will either parse error JSON as if it were the result (data-shape confusion downstream) or raise JSONDecodeError, which the caller doesn't expect.
upload_file() later in the same module DOES call raise_for_status() (line 189, 204). Pattern is inconsistent.
Severity
MEDIUM. Not a security issue — fal's API key is already protected by require_fal_queue_url() and only the trusted queue.fal.run host receives the auth header. This is purely about error legibility + faster failure.
Suggested fix
Two small changes:
-
Poll loop: If status code is in [400, 500), break the retry loop and raise — these are permanent. Keep the silent retry only for 5xx and JSONDecodeError (transient).
if 400 <= poll_resp.status_code < 500:
raise FalError(f"fal.ai poll failed ({poll_resp.status_code}): {poll_resp.text[:300]}")
if poll_resp.status_code >= 500:
time.sleep(poll_interval)
continue
-
Result fetch: Add result_resp.raise_for_status() before .json().
Tests: add a httpx mock returning 401 on poll → assert FalError raised within <10s instead of timeout budget.
References
Surfaced by the Codex adversarial security audit on PR #35. One of 4 deferred MEDIUMs.
Where
backend/app/integrations/fal.py— two spots insubmit_and_poll():Poll loop (line 146-154):
Result fetch (line 158-161):
What's wrong
The submit step at line 120-121 checks
resp.status_code >= 400and raises. The poll + result GETs do not. Consequences:result_resp.json()will either parse error JSON as if it were the result (data-shape confusion downstream) or raiseJSONDecodeError, which the caller doesn't expect.upload_file()later in the same module DOES callraise_for_status()(line 189, 204). Pattern is inconsistent.Severity
MEDIUM. Not a security issue — fal's API key is already protected by
require_fal_queue_url()and only the trustedqueue.fal.runhost receives the auth header. This is purely about error legibility + faster failure.Suggested fix
Two small changes:
Poll loop: If status code is in [400, 500), break the retry loop and raise — these are permanent. Keep the silent retry only for 5xx and JSONDecodeError (transient).
Result fetch: Add
result_resp.raise_for_status()before.json().Tests: add a httpx mock returning 401 on poll → assert FalError raised within <10s instead of timeout budget.
References
require_fal_queue_urlatintegrations/fal.py:44-56)