feat: improve natural-language search relevance for text-heavy and unrelated images#246
feat: improve natural-language search relevance for text-heavy and unrelated images#246shubh-gitpush wants to merge 5 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 38 minutes and 43 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughIncorporates OCR into hybrid embeddings and search reranking: processor fusion uses OCR-aware weights, mock embedder and tests updated for OCR signals, a new ranking module computes textual boosts and bounds similarity, the /search endpoint applies composite scoring and reranks in-memory, and frontend tests are added for gallery and search UIs. ChangesBackend OCR-aware search relevance
Frontend test coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds OCR-aware relevance improvements across embedding generation and search ranking, plus new UI regression tests for search and gallery empty states.
Changes:
- Introduces OCR-aware hybrid embedding fusion (weighted when OCR text is present).
- Adds textual/OCR boost-based reranking in
/api/searchresponses and returnsocr_textmetadata. - Adds frontend React Query-backed tests for Search and Gallery empty states.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/tests/search-page.test.tsx | Adds coverage for search submit → results render → preview modal flow. |
| frontend/src/tests/gallery-empty-state.test.tsx | Adds coverage for gallery empty state CTA and filter-clearing behavior. |
| backend/tests/test_search.py | Adds test asserting OCR/text boost reranks search results. |
| backend/tests/test_hybrid_embedding.py | Updates embedding tests for new ocr_text signal and adds OCR-weighted fusion test. |
| backend/src/find_api/workers/processors.py | Implements OCR-aware weighted hybrid embedding path and updates logging/docs. |
| backend/src/find_api/routers/search.py | Adds token-based textual boost + candidate set reranking and exposes ocr_text. |
| backend/src/find_api/ml/mock_embedder.py | Updates mock embedder to incorporate OCR text into mock hybrid vector. |
Comments suppressed due to low confidence (1)
backend/src/find_api/ml/mock_embedder.py:1
- Using
str(metadata.get(...))will turnNoneinto the literal string'None', which then becomes a real text signal and affects embeddings. Also,text_partscurrently includes whitespace-only values (they pass theif partfilter before.strip()), potentially passing empty strings intoembed_textand reintroducing the empty-string bias that the full-mode embedder avoids. Prefercaption = (metadata.get(\"caption\") or \"\")/ocr_text = (metadata.get(\"ocr_text\") or \"\"), and buildtext_partsby stripping first and only keeping non-empty stripped strings.
"""Deterministic lightweight embeddings for contributor development."""
| results.sort( | ||
| key=lambda item: ( | ||
| float(item["similarity"]), | ||
| float(item.get("_vector_similarity", 0.0)), | ||
| -int(item["media_id"]), | ||
| ), | ||
| reverse=True, | ||
| ) |
| "caption": caption_text or None, | ||
| "objects": metadata_payload.get("objects") or [], | ||
| "ocr_text": ocr_text or None, | ||
| } |
| """ | ||
| Generate hybrid embedding from image, caption, and detected objects. | ||
| Generate hybrid embedding from image, caption, detected objects, and OCR text. | ||
|
|
||
| Weighted average depends on which text signals are present: | ||
| Weighted average depends on which text signals are present: | ||
| - image + caption + objects → equal thirds (1/3 each) | ||
| - image + caption only → halves (1/2 each) | ||
| - image + objects only → halves (1/2 each) | ||
| - image only → image vector directly | ||
|
|
||
| When OCR text is present, we apply OCR-aware weights and normalise them | ||
| across active signals to prioritize text relevance for document-like images. | ||
|
|
||
| Empty strings are never passed to embed_text() because CLIP encodes |
af05698 to
3c9f274
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/src/find_api/ml/mock_embedder.py (1)
79-97:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStrip before filtering
text_parts.Whitespace-only
caption/ocr_textvalues still passif part, so" "becomes""after stripping and feeds the deterministic empty-text vector back into mock mode. Filter on the stripped values instead.Suggested fix
- text_parts = [part.strip() for part in [caption, object_text, ocr_text] if part] + text_parts = [ + part + for part in (caption.strip(), object_text.strip(), ocr_text.strip()) + if part + ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/ml/mock_embedder.py` around lines 79 - 97, The code builds text_parts using [part.strip() for part in [caption, object_text, ocr_text] if part] which lets whitespace-only strings through; change the filtering to operate on the stripped strings (e.g., compute stripped parts first and then filter with if part) so that values like " " are removed before calling embed_text and producing hybrid_vector; update the text_parts creation in the block where caption, object_text, ocr_text are combined (used before calling self.embed_text and assigning hybrid_vector) to filter on part.strip() rather than the original part.backend/src/find_api/routers/search.py (1)
101-119:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftApply the OCR/text boost before thresholding and pagination.
similarity > :thresholdandLIMIT/OFFSETstill run on raw vector similarity, so this rerank only reorders the current vector-ranked page. Rows just below the vector cutoff or outside the first SQL window can never be promoted, which makesresults,total, andhas_moreinconsistent with the final score you expose.Also applies to: 123-150, 243-258
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/find_api/routers/search.py` around lines 101 - 119, The current code computes threshold and runs count_query (using vector <=> with embedding_str) and pagination on raw vector similarity, then applies OCR/text boost only after fetching a page, which prevents promoted items outside the initial vector cutoff; fix by applying the OCR/text boost before thresholding/pagination: either incorporate the text/ocr boost into the SQL scoring expression (replace 1 - (vector <=> ...) with a combined_score that adds the OCR/text boost) so count_query, the main SELECT, and LIMIT/OFFSET operate on the boosted score, or fetch a sufficiently large candidate set, compute boosted scores in Python, then compute total/has_more/pagination from those boosted scores; update uses of threshold, count_query, the main SELECT (vector <=>), embedding_str, and the results/total/has_more calculation to rely on the boosted score instead of raw similarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/src/find_api/routers/search.py`:
- Around line 225-237: final_similarity (computed as vector_similarity +
textual_boost in search.py) can exceed the normalized upper bound of 1.0; clamp
the computed score before assigning it to the returned "similarity" field: after
computing final_similarity (from vector_similarity and textual_boost, and/or
_compute_textual_boost), apply a bounding step to ensure the value is within
[0.0, 1.0] (and use the clamped value when building the results dict for
"media_id"/"similarity").
- Around line 24-62: The router contains ML/search logic (_tokenize,
_compute_textual_boost and the reranking loop) that should be moved into the
search/ML layer: extract _tokenize and _compute_textual_boost into the existing
search/ML module as exported utilities, move the reranking loop (the code that
applies textual boost to search results and reorders them) into a single
function (e.g., rerank_results(results, query)) in that module, update the
router to call those exported functions (importing
tokenize/compute_textual_boost or rerank_results) so the router only wires
request -> service -> response, and update or add unit tests for the new
functions in the search/ML tests suite.
In `@backend/tests/test_hybrid_embedding.py`:
- Around line 360-380: The file fails Ruff formatting; run the formatter and
commit the changes: run `ruff format` on the test file (the reviewer suggested
`uv run ruff format backend/tests/test_hybrid_embedding.py`) to auto-fix style
issues in the test function test_ocr_present_uses_weighted_hybrid (and any
surrounding helpers like _run) so that backend-check no longer reports rewrites;
after formatting, re-run tests and ensure assertions and the embedder.embed_text
call remain unchanged before committing.
In `@backend/tests/test_search.py`:
- Around line 178-221: The file fails Ruff formatting; run the formatter and
commit the result: run the ruff formatter on the test file and re-stage the
changes (e.g., execute the equivalent of `uv run ruff format` for the file
containing test_ocr_text_boost_reranks_results and _mock_search), then verify
`ruff format --check` passes before pushing so backend/tests/test_search.py
(including the MagicMock blocks for text_heavy and portrait and the
test_ocr_text_boost_reranks_results function) is properly formatted.
In `@frontend/src/__tests__/search-page.test.tsx`:
- Around line 132-152: The test for the empty search state lacks an assertion
that the search API was invoked; update the test to assert that
apiMocks.searchImages was called with the expected query after submitting the
form (e.g., verify apiMocks.searchImages was calledOnce and calledWith a payload
containing "nothing" or the expected query params) so the UI assertions only
pass when the search flow actually executed; locate the assertions around
apiMocks.searchImages, renderWithQueryClient, fireEvent.change and
fireEvent.click in the "shows the empty search state when no matches are
returned" test and add the call/argument assertions immediately after awaiting
the UI results.
- Around line 27-33: The mock ImagePreviewModal currently always renders a
dialog so tests can’t detect open-on-click; change the test mock implementation
for ImagePreviewModal to accept the modal props (e.g., { media, open, onClose })
and only render the element with role="dialog" and
data-testid="preview-filename" when open is true (when open is false render null
or a non-dialog element/aria-hidden) so the existing click test can assert that
the dialog appears only after the Preview click; keep the same data-testid and
span content so other assertions still work.
---
Outside diff comments:
In `@backend/src/find_api/ml/mock_embedder.py`:
- Around line 79-97: The code builds text_parts using [part.strip() for part in
[caption, object_text, ocr_text] if part] which lets whitespace-only strings
through; change the filtering to operate on the stripped strings (e.g., compute
stripped parts first and then filter with if part) so that values like " " are
removed before calling embed_text and producing hybrid_vector; update the
text_parts creation in the block where caption, object_text, ocr_text are
combined (used before calling self.embed_text and assigning hybrid_vector) to
filter on part.strip() rather than the original part.
In `@backend/src/find_api/routers/search.py`:
- Around line 101-119: The current code computes threshold and runs count_query
(using vector <=> with embedding_str) and pagination on raw vector similarity,
then applies OCR/text boost only after fetching a page, which prevents promoted
items outside the initial vector cutoff; fix by applying the OCR/text boost
before thresholding/pagination: either incorporate the text/ocr boost into the
SQL scoring expression (replace 1 - (vector <=> ...) with a combined_score that
adds the OCR/text boost) so count_query, the main SELECT, and LIMIT/OFFSET
operate on the boosted score, or fetch a sufficiently large candidate set,
compute boosted scores in Python, then compute total/has_more/pagination from
those boosted scores; update uses of threshold, count_query, the main SELECT
(vector <=>), embedding_str, and the results/total/has_more calculation to rely
on the boosted score instead of raw similarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 92dbafbc-b21c-423b-8ecd-6590ed6c08b2
📒 Files selected for processing (7)
backend/src/find_api/ml/mock_embedder.pybackend/src/find_api/routers/search.pybackend/src/find_api/workers/processors.pybackend/tests/test_hybrid_embedding.pybackend/tests/test_search.pyfrontend/src/__tests__/gallery-empty-state.test.tsxfrontend/src/__tests__/search-page.test.tsx
| vi.mock("@/components/image-preview-modal", () => ({ | ||
| ImagePreviewModal: ({ media }: { media: { filename?: string } }) => ( | ||
| <div role="dialog" aria-label="preview"> | ||
| <span data-testid="preview-filename">{media.filename}</span> | ||
| </div> | ||
| ), | ||
| })); |
There was a problem hiding this comment.
Mocked preview modal is always “open”, so this test can’t verify open-on-click behavior.
ImagePreviewModal currently renders role="dialog" unconditionally, so the modal assertion can pass even before clicking Preview.
Proposed fix
vi.mock("`@/components/image-preview-modal`", () => ({
- ImagePreviewModal: ({ media }: { media: { filename?: string } }) => (
- <div role="dialog" aria-label="preview">
- <span data-testid="preview-filename">{media.filename}</span>
- </div>
- ),
+ ImagePreviewModal: ({
+ media,
+ isOpen,
+ }: {
+ media: { filename?: string };
+ isOpen?: boolean;
+ }) =>
+ isOpen ? (
+ <div role="dialog" aria-label="preview">
+ <span data-testid="preview-filename">{media.filename}</span>
+ </div>
+ ) : null,
}));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/src/__tests__/search-page.test.tsx` around lines 27 - 33, The mock
ImagePreviewModal currently always renders a dialog so tests can’t detect
open-on-click; change the test mock implementation for ImagePreviewModal to
accept the modal props (e.g., { media, open, onClose }) and only render the
element with role="dialog" and data-testid="preview-filename" when open is true
(when open is false render null or a non-dialog element/aria-hidden) so the
existing click test can assert that the dialog appears only after the Preview
click; keep the same data-testid and span content so other assertions still
work.
| it("shows the empty search state when no matches are returned", async () => { | ||
| apiMocks.searchImages.mockResolvedValueOnce({ | ||
| results: [], | ||
| total: 0, | ||
| query: "nothing", | ||
| }); | ||
|
|
||
| renderWithQueryClient(); | ||
|
|
||
| fireEvent.change( | ||
| screen.getByPlaceholderText(/a visual memory, object, scene, or mood/i), | ||
| { target: { value: "nothing" } }, | ||
| ); | ||
| fireEvent.click(screen.getByRole("button", { name: /^search$/i })); | ||
|
|
||
| await waitFor(() => { | ||
| expect(screen.getByText(/no results found/i)).toBeInTheDocument(); | ||
| expect( | ||
| screen.getByText(/try a broader phrase or a visible object/i), | ||
| ).toBeInTheDocument(); | ||
| }); |
There was a problem hiding this comment.
Assert the search API call in the empty-state test to avoid false positives.
Right now this test only checks text rendering; adding a call assertion ensures submit actually exercised the search flow.
Proposed fix
fireEvent.click(screen.getByRole("button", { name: /^search$/i }));
+ await waitFor(() => {
+ expect(apiMocks.searchImages).toHaveBeenCalledWith({
+ query: "nothing",
+ limit: 24,
+ });
+ });
+
await waitFor(() => {
expect(screen.getByText(/no results found/i)).toBeInTheDocument();
expect(
screen.getByText(/try a broader phrase or a visible object/i),
).toBeInTheDocument();
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/src/__tests__/search-page.test.tsx` around lines 132 - 152, The test
for the empty search state lacks an assertion that the search API was invoked;
update the test to assert that apiMocks.searchImages was called with the
expected query after submitting the form (e.g., verify apiMocks.searchImages was
calledOnce and calledWith a payload containing "nothing" or the expected query
params) so the UI assertions only pass when the search flow actually executed;
locate the assertions around apiMocks.searchImages, renderWithQueryClient,
fireEvent.change and fireEvent.click in the "shows the empty search state when
no matches are returned" test and add the call/argument assertions immediately
after awaiting the UI results.
PR Context Summary
Suggested issue links
Use |
ApprovabilityVerdict: Needs human review Unable to check for correctness in e85b0ff. This PR introduces new search ranking algorithms and OCR-aware embedding weights that significantly change search result scoring and ordering. Unresolved review comments identify a potential bug in tie-breaker sorting logic and data exposure concerns. Human review recommended for these runtime behavior changes. You can customize Macroscope's approvability policy. Learn more. |
Abhash-Chakraborty
left a comment
There was a problem hiding this comment.
Reviewed against issue #17. I resolved the merge conflicts with current main, preserved the existing search query cache/debug diagnostics, restored unrelated frontend test coverage, fixed include_ocr cache-key behavior, and verified the OCR-aware embedding/reranking tests. Backend suite and relevant frontend tests pass.
|
@macroscope-app review Please review this PR against its linked issue, local-first privacy rules, and the current Find repo instructions. |
Summary
This change improves search relevance for text-heavy images by making OCR text a first-class signal in both indexing and retrieval.
Updated hybrid embedding generation to include OCR text alongside image, caption, and detected objects, with weighting tuned for better text-aware matching.
Search quality was weak for document-like queries because OCR text was extracted but not meaningfully used in ranking.
As a result, visually similar but text-irrelevant images could rank close to or above text-relevant images.
By incorporating OCR into embeddings and ranking, the system now better prioritizes images that actually contain matching text, improving relevance for queries such as calendars, notes, receipts, and other document-style content.
Fixes #17
Type of change
What changed
Added OCR-aware reranking in search so results with strong query-to-text matches are boosted appropriately.
Tuned similarity thresholds separately for mock mode and full ML mode to keep behavior stable across environments.
Added/updated backend tests to validate OCR-weighted embeddings and OCR-influenced ranking behavior.
Screenshots / recordings (for UI changes)
Attach before/after screenshots or a short video.
How to test
uv run pytest tests -v
Checklist
GSSoC'26 checklist
Summary by CodeRabbit
New Features
Tests