Skip to content

fix(api): tolerate undecryptable rows in /api/settings/keys#52

Merged
JFK merged 2 commits intomainfrom
fix/settings-keys-decrypt-tolerance
Apr 7, 2026
Merged

fix(api): tolerate undecryptable rows in /api/settings/keys#52
JFK merged 2 commits intomainfrom
fix/settings-keys-decrypt-tolerance

Conversation

@JFK
Copy link
Copy Markdown
Owner

@JFK JFK commented Apr 7, 2026

Summary

  • Catch decryption failures per row in `GET /api/settings/keys` instead of letting a single `InvalidToken` raise out of the endpoint and return HTTP 500
  • Surface broken rows with `decryption_error: true` + a masked placeholder so the UI can prompt re-entry
  • Add a unit test that seeds an undecryptable row and asserts the endpoint stays 200 with the failed row flagged

Why

Discovered while debugging: the dev server's Settings page rendered blank because `/api/settings/keys` returned 500. Root cause was a stale `api_key.openai` row that had been encrypted with a different `ENCRYPTION_KEY` and could no longer be decrypted (likely from a previous `.env` rotation).

The previous code:
```python
return [{"masked": _mask_key(decrypt(k.value)), ...} for k in keys]
```
A single `Fernet.InvalidToken` would raise out of the comprehension → 500 → entire Settings page unusable.

This is also the most likely root cause of the misleading "Model not found. Check Settings → LLM Models." error users see when an `api_key.*` row is undecryptable: `_get_credential` raises, the heuristic in `errors.py:131-137` mismatches, and the user gets a wrong direction.

Test plan

  • `pytest tests/test_settings_api.py` — 16 passing, including new `test_list_keys_tolerates_undecryptable_row`
  • `pytest --ignore=tests/e2e` — 202 passing, no regressions
  • `ruff check` / `ruff format --check` clean
  • Manual verification: restart dev container, hit `/settings`, confirm the page renders even with the stale row present

Follow-up

  • The misleading error translation in `src/errors.py:131-137` should still be addressed separately. See the v1.0.0 robustness Issues to be filed (LLM model validation + raw error preservation).

🤖 Generated with Claude Code

When ENCRYPTION_KEY is rotated (or a DB is restored from another env),
existing api_key.* rows can no longer be decrypted. The previous
implementation called decrypt() inside a list comprehension, so a single
InvalidToken would raise out of the endpoint and return HTTP 500 — which
in turn blanked the entire Settings page in the UI.

Catch the exception per row, log a warning, and surface the row with
`decryption_error: true` and a masked placeholder. The UI can then prompt
re-entry instead of disappearing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves robustness of the Settings API by preventing a single undecryptable encrypted setting from crashing GET /api/settings/keys, enabling the Settings UI to load even when stale/invalid encrypted rows exist (e.g., after ENCRYPTION_KEY rotation).

Changes:

  • Update GET /api/settings/keys to handle decryption failures per-row, returning a masked placeholder and decryption_error: true instead of raising.
  • Add logging for failed decrypt attempts in list_keys.
  • Add a unit test that seeds an undecryptable row and asserts the endpoint stays HTTP 200 and flags the broken entry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/api/settings.py Wrap per-row decryption in try/except; return placeholder + decryption_error on failure.
tests/test_settings_api.py Add coverage ensuring /api/settings/keys tolerates an undecryptable encrypted row.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +63 to +72
try:
entry["masked"] = _mask_key(decrypt(k.value))
except Exception as e:
# Decryption typically fails when ENCRYPTION_KEY has been rotated
# since this row was written. Surface the row in an error state
# so the UI can prompt re-entry, instead of crashing the whole
# endpoint and blanking the settings page.
logger.warning("Failed to decrypt %s: %s", k.key, e)
entry["masked"] = "****"
entry["decryption_error"] = True
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The except Exception here will also swallow non-rotation failures (e.g., decrypt() raises RuntimeError when ENCRYPTION_KEY is unset/invalid in src/services/crypto.py), causing this endpoint to return 200 with placeholder data and potentially masking a fatal misconfiguration. Consider narrowing the exception to the specific decryption failures you want to tolerate (e.g., cryptography.fernet.InvalidToken, possibly decode/type errors) and letting unexpected exceptions propagate (or be handled separately) so real bugs/config issues aren’t hidden.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +25
# Insert a row with raw bytes that are NOT a valid Fernet token.
# Use delete-then-insert in case a previous test left a google row behind.
from sqlalchemy import delete

async with async_session() as s:
await s.execute(delete(Setting).where(Setting.key == "api_key.google"))
s.add(Setting(key="api_key.google", value="not-a-valid-fernet-token", encrypted=True))
await s.commit()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test inserts value="not-a-valid-fernet-token" with encrypted=True, which isn’t actually “encrypted with a different ENCRYPTION_KEY” (it’s not a valid Fernet token at all). To better match the scenario described (key rotation), consider generating a syntactically valid Fernet token using a different Fernet key and inserting that instead, so the failure mode is specifically “valid token but cannot be decrypted with the current key.”

Copilot uses AI. Check for mistakes.
- Narrow exception in list_keys from `Exception` to `InvalidToken` so
  config errors (e.g. RuntimeError from missing ENCRYPTION_KEY in
  src/services/crypto.py) propagate instead of being silently masked
  as a "decryption_error" row
- Replace the placeholder `not-a-valid-fernet-token` test fixture with
  a real Fernet token generated from a *different* key, so the test
  exercises the actual key-rotation scenario (valid token, signature
  fails) rather than a malformed-input path
- Re-seed `api_key.openai` with a fresh test-key token at the start of
  the tolerance test so the assertion isn't poisoned by stale rows in
  a shared dev DB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JFK
Copy link
Copy Markdown
Owner Author

JFK commented Apr 7, 2026

Addressed both Copilot review comments in faa4dd0:

  1. src/api/settings.py:62 — narrow exception
    Replaced except Exception with except InvalidToken. RuntimeError from crypto.py (missing ENCRYPTION_KEY) now propagates correctly instead of being masked as a decryption_error: true row. Added a comment documenting the rationale.

  2. tests/test_settings_api.py:25 — realistic Fernet token
    The test now generates a syntactically valid Fernet token using a freshly-generated different key (Fernet.generate_key()Fernet(foreign_key).encrypt(b"sk-foreign")), so the failure mode is "valid token, wrong key" — exactly the post-rotation scenario described in the PR. Also re-seeds api_key.openai with a fresh test-key token to keep the assertion stable in shared dev DBs.

pytest tests/test_settings_api.py → 16 passed. Full suite (202) clean.

@JFK JFK merged commit 8e502e3 into main Apr 7, 2026
2 checks passed
@JFK JFK deleted the fix/settings-keys-decrypt-tolerance branch April 7, 2026 13:51
JFK added a commit that referenced this pull request Apr 7, 2026
Failed jobs now persist a structured error_detail JSON blob alongside the
user-facing error_message. The keyword-based translation in
classify_error() loses information and can mistranslate (e.g. an
InvalidToken decrypt failure surfaces as "Model not found"); the new
field keeps the original exception class, raw message, stage, provider,
model, and timestamp so users can debug from the History page instead of
reading server logs.

- Add Job.error_detail Text column + Alembic migration
- src/errors.py: build_error_detail / serialize_error_detail / parse_error_detail
- transcribe.py: populate at refine/verify/metadata catch sites; resolve
  provider_name and refine/metadata model BEFORE the try block so the
  catch site does not re-query a possibly degraded session
- jobs.py: populate at top-level _process_job and _generate_meta_job
  catch sites; expose error_detail in GET /api/jobs/{id}
- templating.py: register parse_error_detail Jinja filter
- job_status.html: add <details> "Show details" affordance
- i18n: history.show_details (en/ja)
- tests: 4 new tests covering field preservation, InvalidToken
  distinction (#52 scenario), serialize/parse round-trip, and full
  pipeline catch path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JFK added a commit that referenced this pull request Apr 7, 2026
… (#56)

* feat(observability): preserve raw API error in error_detail field (#54)

Failed jobs now persist a structured error_detail JSON blob alongside the
user-facing error_message. The keyword-based translation in
classify_error() loses information and can mistranslate (e.g. an
InvalidToken decrypt failure surfaces as "Model not found"); the new
field keeps the original exception class, raw message, stage, provider,
model, and timestamp so users can debug from the History page instead of
reading server logs.

- Add Job.error_detail Text column + Alembic migration
- src/errors.py: build_error_detail / serialize_error_detail / parse_error_detail
- transcribe.py: populate at refine/verify/metadata catch sites; resolve
  provider_name and refine/metadata model BEFORE the try block so the
  catch site does not re-query a possibly degraded session
- jobs.py: populate at top-level _process_job and _generate_meta_job
  catch sites; expose error_detail in GET /api/jobs/{id}
- templating.py: register parse_error_detail Jinja filter
- job_status.html: add <details> "Show details" affordance
- i18n: history.show_details (en/ja)
- tests: 4 new tests covering field preservation, InvalidToken
  distinction (#52 scenario), serialize/parse round-trip, and full
  pipeline catch path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(observability): address Copilot review on PR #56

- _process_job: use get_provider_name(job.provider) for error_detail.provider
  so internal "whisper" doesn't leak through where other catch sites record
  the user-facing "openai".
- _generate_meta_job: pre-compute provider_name and resolve effective model
  before the try block, then record exactly that pair in error_detail.
- transcribe.py metadata step: pass the resolved metadata_model to
  _run_metadata_generation so the call, cost log, and error_detail all
  reference the same model (was re-querying _get_model internally).
- errors.build_error_detail: emit occurred_at as second-precision ISO-8601
  with Z suffix (e.g. 2026-04-07T15:32:18Z) instead of microsecond +00:00
  for predictable UI rendering.
- tests: assert the new occurred_at format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants