add backend vault metadata and hidden media flags#284
Conversation
|
Warning Review limit reached
More reviews will be available in 33 minutes and 23 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (15)
📝 WalkthroughWalkthroughThis PR establishes hidden media filtering as a first-class feature by introducing vault state columns and encryption metadata models, then applies consistent visibility constraints across all public API surfaces and background jobs while preserving encryption material secrecy in responses. ChangesHidden Media & Vault State Implementation
Sequence Diagram(s)sequenceDiagram
participant Client
participant APIEndpoint
participant Gallery
participant Database
Client->>APIEndpoint: GET /image/{id}
APIEndpoint->>Gallery: _load_public_media_or_404
Gallery->>Database: SELECT * WHERE media_id=? AND is_hidden=false
alt Media found and visible
Database-->>Gallery: Media record
Gallery-->>APIEndpoint: Media object
APIEndpoint-->>Client: 200 Image response
else Media hidden or not found
Database-->>Gallery: No rows
Gallery-->>APIEndpoint: HTTPException 404
APIEndpoint-->>Client: 404 Not found
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Context Summary
Suggested issue links
Use |
There was a problem hiding this comment.
🟠 High
Find/backend/src/find_api/workers/jobs.py
Line 454 in 4361ccb
The person_id reset at line 454 clears assignments for ALL faces in the database, but the new filtering at lines 391-396 only loads faces from non-hidden media for clustering. Faces from hidden media will have their person_id set to None and never reassigned, permanently losing their person associations. If a user hides an image, runs clustering, then unhides it, those faces will appear as unclustered noise.
🤖 Copy this AI Prompt to have your agent fix this:
In file backend/src/find_api/workers/jobs.py around line 454:
The `person_id` reset at line 454 clears assignments for ALL faces in the database, but the new filtering at lines 391-396 only loads faces from non-hidden media for clustering. Faces from hidden media will have their `person_id` set to `None` and never reassigned, permanently losing their person associations. If a user hides an image, runs clustering, then unhides it, those faces will appear as unclustered noise.
Evidence trail:
backend/src/find_api/workers/jobs.py lines 391-396 (filtered query with Media.is_hidden.is_(False)), line 432 (face_ids derived from face_rows), line 454 (unfiltered db.query(Face).update({Face.person_id: None})), lines 497-504 (only face_ids get person_id reassigned). Commit: REVIEWED_COMMIT.
There was a problem hiding this comment.
🟡 Medium routers/clusters.py:64
When the first 5 member_ids are all hidden but later members are visible, sample_media returns empty because it only queries those first 5 IDs filtered by is_hidden.is_(False). The cluster passes the visible_member_count > 0 check but returns samples: [], causing cluster cards to display without preview images.
- sample_media = (
- db.query(Media)
- .filter(Media.id.in_(sample_ids), Media.is_hidden.is_(False))
- .all()
- )🤖 Copy this AI Prompt to have your agent fix this:
In file backend/src/find_api/routers/clusters.py around lines 64-68:
When the first 5 `member_ids` are all hidden but later members are visible, `sample_media` returns empty because it only queries those first 5 IDs filtered by `is_hidden.is_(False)`. The cluster passes the `visible_member_count > 0` check but returns `samples: []`, causing cluster cards to display without preview images.
Evidence trail:
backend/src/find_api/routers/clusters.py lines 52, 56-60, 61-62, 64-68 at REVIEWED_COMMIT. Line 52 takes first 5 IDs for sampling; lines 56-60 count visible members across ALL member_ids; lines 64-68 query sample_media using only first 5 IDs filtered by is_hidden. The mismatch between the visibility check scope (all members) and sample scope (first 5) allows a cluster to pass the visibility gate but have zero sample images.
ApprovabilityVerdict: Needs human review 2 blocking correctness issues found. Schema changes add vault metadata columns and modify hidden media filtering across multiple endpoints and clustering jobs. An unresolved HIGH severity comment identifies a bug that permanently loses face-person associations for hidden media during clustering. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR strengthens “hidden/vault” privacy boundaries by adding vault state metadata to Media and ensuring hidden media is excluded from public-facing endpoints and background clustering/search behavior.
Changes:
- Add
vault_state,hidden_at,encrypted_attoMediaplus vault persistence models (VaultConfig,VaultMetadata) and a corresponding Alembic migration. - Update API routes and workers to systematically exclude
is_hiddenmedia from gallery/people/clusters and clustering jobs. - Expand backend tests to cover hidden-media exclusion and to prevent leaking encryption material via vault listing.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/tests/test_vault.py | Adds test to ensure /api/vault/list response doesn’t expose encryption material. |
| backend/tests/test_search.py | Enhances mocked search helper and adds tests for OCR boosting, similarity bounding, OCR inclusion, and hidden filtering. |
| backend/tests/test_people.py | Expands seeding to support hidden media and adds tests to exclude hidden groups/media from people endpoints. |
| backend/tests/test_gallery.py | Updates seeding for hidden media and adds tests ensuring hidden media isn’t exposed on gallery/media routes. |
| backend/tests/test_clusters.py | Updates seeding and adds tests ensuring clusters exclude hidden members and hide hidden-only clusters. |
| backend/src/find_api/workers/jobs.py | Excludes hidden media from image and face clustering jobs. |
| backend/src/find_api/routers/vault.py | Persists additional vault metadata and sets/rolls back media vault timestamps/state. |
| backend/src/find_api/routers/people.py | Skips people entries when no visible sample faces are available. |
| backend/src/find_api/routers/gallery.py | Centralizes “public media” filtering and blocks hidden media from multiple public routes. |
| backend/src/find_api/routers/clusters.py | Filters/adjusts cluster payloads to exclude hidden media, including member counts. |
| backend/src/find_api/models/vault.py | Introduces SQLAlchemy models for vault config and per-media vault metadata. |
| backend/src/find_api/models/media.py | Adds vault state and timestamps to the Media model. |
| backend/src/find_api/models/init.py | Exposes vault models via package exports. |
| backend/src/find_api/core/database.py | Registers vault models and creates/updates DB schema elements for new vault fields. |
| backend/alembic/versions/20260528_vault_state_metadata.py | Adds migration for vault state/timestamps and vault metadata columns. |
| item = response.json()[0] | ||
| assert "encrypted_path" not in item | ||
| assert "iv" not in item | ||
| assert "verifier_ciphertext" not in item | ||
| assert "salt" not in item |
| "Media", | ||
| "Cluster", | ||
| "Face", | ||
| "Person", | ||
| "PersonFeedback", | ||
| "GeneralFeedback", | ||
| "VaultConfig", | ||
| "VaultMetadata", |
| visible_member_count = ( | ||
| db.query(Media.id) | ||
| .filter(Media.id.in_(cluster.member_ids or []), Media.is_hidden.is_(False)) | ||
| .count() | ||
| ) |
| if not member_list: | ||
| raise HTTPException(404, "Cluster not found") | ||
|
|
||
| payload = _cluster_payload(cluster, members=member_list) | ||
| payload["member_count"] = len(member_list) | ||
| return payload |
| count_sql = str(mock_db.execute.call_args_list[0].args[0]) | ||
| result_sql = str(mock_db.execute.call_args_list[1].args[0]) | ||
| assert "is_hidden = false" in count_sql | ||
| assert "is_hidden = false" in result_sql |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/find_api/routers/clusters.py (1)
52-68:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Sample selection may omit visible members.
The current logic takes
sample_idsfrom the first 5 entries incluster.member_idsbefore filtering for visibility (line 52), then queries only those IDs for visible media (line 66). If the first 5 member IDs belong to hidden media, thesampleslist will be empty even whenvisible_member_count > 0, violating the contract that clusters with visible members should display visible samples.Impact: A cluster with member_ids
[hidden1, hidden2, hidden3, hidden4, hidden5, visible1, visible2]would showmember_count: 2butsamples: [], breaking the UI thumbnail display.🔧 Proposed fix: Select sample IDs from visible members
result = [] for cluster in clusters: - # Get sample images from cluster - sample_ids = (cluster.member_ids or [])[:5] - if not sample_ids: - continue - visible_member_count = ( db.query(Media.id) .filter(Media.id.in_(cluster.member_ids or []), Media.is_hidden.is_(False)) .count() ) if visible_member_count == 0: continue + # Get sample images from visible members only + sample_ids = ( + db.query(Media.id) + .filter(Media.id.in_(cluster.member_ids or []), Media.is_hidden.is_(False)) + .limit(5) + .all() + ) + sample_ids = [row.id for row in sample_ids] + sample_media = ( db.query(Media) .filter(Media.id.in_(sample_ids), Media.is_hidden.is_(False)) .all() )This ensures samples are always drawn from the visible member pool.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/routers/clusters.py` around lines 52 - 68, The code builds sample_ids from cluster.member_ids before filtering out hidden media, which can yield empty samples even when visible_member_count > 0; change the logic in the clusters router to first query for visible member IDs (e.g., query Media.id filtering Media.id.in_(cluster.member_ids or []) and Media.is_hidden.is_(False)), derive visible_ids and set visible_member_count = len(visible_ids) (or use the DB count of that same query), then pick sample_ids = visible_ids[:5] and finally query Media for those sample_ids to populate sample_media; update references to sample_ids, visible_member_count, and sample_media accordingly so samples always come from the visible member pool.
🧹 Nitpick comments (3)
backend/src/find_api/routers/gallery.py (1)
65-67: ⚡ Quick winMove public-media DB helpers out of the router layer.
Line [65] to Line [74] adds persistence/query logic directly in
gallery.py. Please move_public_media_queryand_load_public_media_or_404into an existing backend module/service and keep the router as a thin transport layer.As per coding guidelines,
**/*.py: "Keep FastAPI routers thin and place storage, queue, database, and ML logic in existing backend modules."Also applies to: 69-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/routers/gallery.py` around lines 65 - 67, The router contains persistence helpers _public_media_query and _load_public_media_or_404; move these functions out of gallery.py into an existing backend service/repository module (e.g., a media service) so the router is only a transport layer, keeping their signatures (accepting db: Session and raising HTTPException or returning Media) and using Media/Session as before; then update gallery.py to import and call the moved functions (replace local definitions with imports) and ensure any tests or callers reference the new module.backend/src/find_api/routers/vault.py (2)
300-302: ⚡ Quick winExtract hardcoded encryption parameters to constants.
The encryption algorithm and key derivation method are hardcoded string literals. Consider defining these as module-level constants or importing them from the crypto module to maintain a single source of truth and improve maintainability.
♻️ Proposed refactor to use constants
At the top of the file (after imports):
# Encryption metadata constants ENCRYPTION_ALGORITHM = "AES-256-GCM" KEY_DERIVATION_METHOD = "PBKDF2-HMAC-SHA256"Then update the INSERT:
{ "media_id": media.id, "encrypted_path": str(encrypted_path), "iv": iv, - "encryption_algorithm": "AES-256-GCM", - "key_derivation": "PBKDF2-HMAC-SHA256", + "encryption_algorithm": ENCRYPTION_ALGORITHM, + "key_derivation": KEY_DERIVATION_METHOD, "ciphertext_size": encrypted_path.stat().st_size, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/routers/vault.py` around lines 300 - 302, Define module-level constants for the encryption parameters and replace the hardcoded literals: add ENCRYPTION_ALGORITHM = "AES-256-GCM" and KEY_DERIVATION_METHOD = "PBKDF2-HMAC-SHA256" near the top of the vault module (after imports or import them from the crypto module if available), then update the code that builds the metadata dictionary (the keys "encryption_algorithm" and "key_derivation" in the insertion/response code around the encrypted_path handling) to use ENCRYPTION_ALGORITHM and KEY_DERIVATION_METHOD instead of the string literals so the values are centralized and maintainable.
262-327: ⚖️ Poor tradeoffConsider extracting hide_media logic to a service module.
The
hide_mediaendpoint contains significant database, storage, and cryptography logic (~65 lines). As per coding guidelines, routers should be kept thin with storage, database, and other complex logic placed in dedicated backend modules.Consider extracting the core hide operation logic to a service module (e.g.,
backend/src/find_api/services/vault_service.py) to improve testability, separation of concerns, and adherence to the thin-router pattern.As per coding guidelines: "Keep routers thin; put storage, queue, database, and ML logic in their existing modules in backend code."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/routers/vault.py` around lines 262 - 327, The hide_media route implements too much business logic (DB, storage, crypto) and should be refactored into a service; extract the core operation into a new function e.g., vault_service.hide_media_operation(session_token, media_id) that encapsulates resolving the master key (using _get_cached_master_key), loading media and metadata (_load_media_or_404, _load_vault_metadata), downloading (download_file_to_path), encrypting (encrypt_file), storing vault metadata (the INSERT logic), marking media hidden and committing/rolling back, deleting original (delete_file) and invoking _rollback_hidden_state_after_delete_failure on delete errors, and cleaning up temp files and encrypted_path on errors; then make the router hide_media only resolve session_token (using _resolve_session_token), call vault_service.hide_media_operation(payload.session_token or resolved token, payload.media_id) and return the simple {"status":"hidden","media_id":...} response, preserving original exception-to-HTTP mappings and transaction/cleanup behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/alembic/versions/20260528_vault_state_metadata.py`:
- Around line 61-67: The backfill UPDATE uses a WHERE clause that will never
match because the new column vault_state is added with server_default="visible"
and nullable=False; change the migration so the correct values from is_hidden
are written: either move the UPDATE on vault_state (the op.execute(...) block)
to run before adding the vault_state column with server_default, or remove the
WHERE clause and perform an unconditional UPDATE of media setting vault_state =
CASE WHEN is_hidden THEN 'hidden_encrypted' ELSE 'visible' END so all rows are
updated; update the migration around the op.add_column call and the op.execute
call referencing vault_state and is_hidden accordingly.
In `@backend/src/find_api/core/database.py`:
- Around line 225-232: The UPDATE's WHERE clause never matches because
vault_state was created with DEFAULT 'visible'; change the condition to target
hidden rows that currently have the default/empty value so they become
'hidden_encrypted'. Modify the SQL in the UPDATE (the block shown in
backend/src/find_api/core/database.py) to use WHERE is_hidden = TRUE AND
(vault_state IS NULL OR vault_state = '' OR vault_state = 'visible') or simply
WHERE is_hidden = TRUE AND vault_state = 'visible' so hidden media are correctly
backfilled to 'hidden_encrypted' without touching intentionally set vault_state
values.
In `@backend/tests/test_clusters.py`:
- Around line 100-122: Add two unit tests for the cluster detail endpoint to
mirror the list tests: create a mixed cluster (visible + hidden) using
_seed_media and _seed_cluster, call client.get(f"/api/cluster/{cluster.id}") and
assert status_code == 200, assert response body['member_count'] == 1 and that
the visible sample id is present while the hidden sample id is not in response
body['samples']; then create a hidden-only cluster, call
client.get(f"/api/cluster/{cluster.id}") and assert it is not exposed (expect
status_code == 404 and/or response body indicates not found, matching list
behavior where hidden-only clusters produce total == 0). Ensure tests reference
the existing helper functions _seed_media and _seed_cluster and use the same
response field names ("member_count", "samples") used in the current tests.
In `@backend/tests/test_gallery.py`:
- Around line 295-310: Extend the existing
test_hidden_media_not_available_on_public_media_routes to also exercise the new
delete flows: call DELETE on /api/image/{hidden.id} (e.g.,
client.delete(f"/api/image/{hidden.id}")) and call POST on
/api/images/bulk-delete with a payload that includes the hidden.id (e.g.,
{"ids":[hidden.id]}) and assert both responses return 404; this mirrors the
existing detail/thumbnail/like/reprocess checks and ensures the DELETE
/api/image/{id} and POST /api/images/bulk-delete regressions are covered.
---
Outside diff comments:
In `@backend/src/find_api/routers/clusters.py`:
- Around line 52-68: The code builds sample_ids from cluster.member_ids before
filtering out hidden media, which can yield empty samples even when
visible_member_count > 0; change the logic in the clusters router to first query
for visible member IDs (e.g., query Media.id filtering
Media.id.in_(cluster.member_ids or []) and Media.is_hidden.is_(False)), derive
visible_ids and set visible_member_count = len(visible_ids) (or use the DB count
of that same query), then pick sample_ids = visible_ids[:5] and finally query
Media for those sample_ids to populate sample_media; update references to
sample_ids, visible_member_count, and sample_media accordingly so samples always
come from the visible member pool.
---
Nitpick comments:
In `@backend/src/find_api/routers/gallery.py`:
- Around line 65-67: The router contains persistence helpers _public_media_query
and _load_public_media_or_404; move these functions out of gallery.py into an
existing backend service/repository module (e.g., a media service) so the router
is only a transport layer, keeping their signatures (accepting db: Session and
raising HTTPException or returning Media) and using Media/Session as before;
then update gallery.py to import and call the moved functions (replace local
definitions with imports) and ensure any tests or callers reference the new
module.
In `@backend/src/find_api/routers/vault.py`:
- Around line 300-302: Define module-level constants for the encryption
parameters and replace the hardcoded literals: add ENCRYPTION_ALGORITHM =
"AES-256-GCM" and KEY_DERIVATION_METHOD = "PBKDF2-HMAC-SHA256" near the top of
the vault module (after imports or import them from the crypto module if
available), then update the code that builds the metadata dictionary (the keys
"encryption_algorithm" and "key_derivation" in the insertion/response code
around the encrypted_path handling) to use ENCRYPTION_ALGORITHM and
KEY_DERIVATION_METHOD instead of the string literals so the values are
centralized and maintainable.
- Around line 262-327: The hide_media route implements too much business logic
(DB, storage, crypto) and should be refactored into a service; extract the core
operation into a new function e.g.,
vault_service.hide_media_operation(session_token, media_id) that encapsulates
resolving the master key (using _get_cached_master_key), loading media and
metadata (_load_media_or_404, _load_vault_metadata), downloading
(download_file_to_path), encrypting (encrypt_file), storing vault metadata (the
INSERT logic), marking media hidden and committing/rolling back, deleting
original (delete_file) and invoking _rollback_hidden_state_after_delete_failure
on delete errors, and cleaning up temp files and encrypted_path on errors; then
make the router hide_media only resolve session_token (using
_resolve_session_token), call
vault_service.hide_media_operation(payload.session_token or resolved token,
payload.media_id) and return the simple {"status":"hidden","media_id":...}
response, preserving original exception-to-HTTP mappings and transaction/cleanup
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 142f37bf-c08b-479f-bb84-ce24f0f8aaea
📒 Files selected for processing (15)
backend/alembic/versions/20260528_vault_state_metadata.pybackend/src/find_api/core/database.pybackend/src/find_api/models/__init__.pybackend/src/find_api/models/media.pybackend/src/find_api/models/vault.pybackend/src/find_api/routers/clusters.pybackend/src/find_api/routers/gallery.pybackend/src/find_api/routers/people.pybackend/src/find_api/routers/vault.pybackend/src/find_api/workers/jobs.pybackend/tests/test_clusters.pybackend/tests/test_gallery.pybackend/tests/test_people.pybackend/tests/test_search.pybackend/tests/test_vault.py
| def test_hidden_media_not_available_on_public_media_routes(self, client, db): | ||
| hidden = _seed( | ||
| db, filename="hidden-detail.jpg", status="indexed", is_hidden=True | ||
| ) | ||
|
|
||
| detail_response = client.get(f"/api/image/{hidden.id}") | ||
| thumbnail_response = client.get( | ||
| f"/api/image/{hidden.id}/thumbnail", follow_redirects=False | ||
| ) | ||
| like_response = client.post(f"/api/image/{hidden.id}/like") | ||
| reprocess_response = client.post(f"/api/image/{hidden.id}/reprocess") | ||
|
|
||
| assert detail_response.status_code == 404 | ||
| assert thumbnail_response.status_code == 404 | ||
| assert like_response.status_code == 404 | ||
| assert reprocess_response.status_code == 404 |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add hidden-media coverage for delete endpoints.
This covers detail/thumbnail/like/reprocess, but the PR also changes DELETE /api/image/{id} and POST /api/images/bulk-delete to reject hidden media from public flows. A focused hidden-ID case there would close the main regression gap.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/tests/test_gallery.py` around lines 295 - 310, Extend the existing
test_hidden_media_not_available_on_public_media_routes to also exercise the new
delete flows: call DELETE on /api/image/{hidden.id} (e.g.,
client.delete(f"/api/image/{hidden.id}")) and call POST on
/api/images/bulk-delete with a payload that includes the hidden.id (e.g.,
{"ids":[hidden.id]}) and assert both responses return 404; this mirrors the
existing detail/thumbnail/like/reprocess checks and ensures the DELETE
/api/image/{id} and POST /api/images/bulk-delete regressions are covered.
Summary
This PR introduces backend support for hidden/encrypted media state and vault metadata, applies default filtering so hidden media is excluded from normal/public surfaces, keeps backward compatibility for existing records, and adds regression tests.
Fixes #184
Type of change
What changed
vault_state(default:visible)hidden_atencrypted_atVaultConfigVaultMetadataencryption_algorithmkey_derivationciphertext_sizeFiles:
vault_state,hidden_at,encrypted_at)Files:
Applied default exclusion (
is_hidden = false) for normal/public flows.Updated API behavior:
Files:
vault_state, timestamps).File:
File:
Screenshots / recordings (for UI changes)
Attach before/after screenshots or a short video.
How to test
Checklist
GSSoC'26 checklist
Summary by CodeRabbit
Release Notes