Skip to content

feat: improve natural-language search relevance for text-heavy and unrelated images#246

Open
shubh-gitpush wants to merge 5 commits into
Abhash-Chakraborty:mainfrom
shubh-gitpush:nlp
Open

feat: improve natural-language search relevance for text-heavy and unrelated images#246
shubh-gitpush wants to merge 5 commits into
Abhash-Chakraborty:mainfrom
shubh-gitpush:nlp

Conversation

@shubh-gitpush
Copy link
Copy Markdown
Contributor

@shubh-gitpush shubh-gitpush commented May 25, 2026

Summary

This change improves search relevance for text-heavy images by making OCR text a first-class signal in both indexing and retrieval.
Updated hybrid embedding generation to include OCR text alongside image, caption, and detected objects, with weighting tuned for better text-aware matching.
Search quality was weak for document-like queries because OCR text was extracted but not meaningfully used in ranking.
As a result, visually similar but text-irrelevant images could rank close to or above text-relevant images.

By incorporating OCR into embeddings and ranking, the system now better prioritizes images that actually contain matching text, improving relevance for queries such as calendars, notes, receipts, and other document-style content.

Fixes #17

Type of change

  • Bug fix
  • Feature
  • Documentation update
  • Refactor
  • CI / tooling

What changed

Added OCR-aware reranking in search so results with strong query-to-text matches are boosted appropriately.
Tuned similarity thresholds separately for mock mode and full ML mode to keep behavior stable across environments.
Added/updated backend tests to validate OCR-weighted embeddings and OCR-influenced ranking behavior.

Screenshots / recordings (for UI changes)

Attach before/after screenshots or a short video.

How to test

uv run pytest tests -v

Checklist

  • I linked the related issue
  • I ran required checks from CONTRIBUTING.md
  • I updated docs/env notes if needed
  • My PR is scoped to a single issue
  • I followed commit message conventions
  • I am not committing secrets or local artifacts

GSSoC'26 checklist

  • I requested issue assignment before starting
  • I have meaningful commits (no spam commits)
  • I am ready to explain my implementation in review comments

Summary by CodeRabbit

  • New Features

    • Search now uses OCR, captions, and detected objects for improved semantic reranking and bounded similarity scores.
    • Hybrid embeddings use OCR-aware weighted fusion, biasing toward available text signals when OCR is present.
    • Search responses can include raw OCR text when explicitly requested.
  • Tests

    • Added and updated tests covering OCR-driven reranking, similarity bounds, hybrid-embedding behavior, and frontend gallery/search flows.

Review Change Stack

Copilot AI review requested due to automatic review settings May 25, 2026 18:17
@github-actions github-actions Bot added the do-not-merge Blocks merging. Remove this label only when the PR is fully ready for production. label May 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Warning

Review limit reached

@Abhash-Chakraborty, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 43 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e0415cf-e5a7-4095-b51b-d762281c8cbb

📥 Commits

Reviewing files that changed from the base of the PR and between a1ab67c and e85b0ff.

📒 Files selected for processing (3)
  • backend/src/find_api/routers/search.py
  • backend/src/find_api/services/query_cache.py
  • backend/tests/test_search.py
📝 Walkthrough

Walkthrough

Incorporates OCR into hybrid embeddings and search reranking: processor fusion uses OCR-aware weights, mock embedder and tests updated for OCR signals, a new ranking module computes textual boosts and bounds similarity, the /search endpoint applies composite scoring and reranks in-memory, and frontend tests are added for gallery and search UIs.

Changes

Backend OCR-aware search relevance

Layer / File(s) Summary
OCR-aware hybrid embedding weights and processor logic
backend/src/find_api/workers/processors.py
Adds OCR_AWARE_SIGNAL_WEIGHTS and updates generate_hybrid_embedding() to embed multiple text signals together and apply OCR-aware normalized weighted fusion when OCR text is present; falls back to equal averaging when absent; updates logging.
Search endpoint textual boost and reranking
backend/src/find_api/routers/search.py
Imports ranking helpers, introduces MOCK_SIMILARITY_THRESHOLD and FULL_SIMILARITY_THRESHOLD, computes query_tokens once, extracts caption/ocr/objects per result, computes a capped textual_boost, combines it with vector_similarity into bounded final_similarity, stores _vector_similarity, and invokes rerank_results() to reorder results.
Search ranking helpers
backend/src/find_api/ml/search_ranking.py
New module providing tokenize, extract_object_labels, compute_textual_boost, bound_similarity, TEXT_QUERY_TERMS, and rerank_results for overlap-based textual boosts and in-memory sorting.
Mock embedder OCR support
backend/src/find_api/ml/mock_embedder.py
embed_metadata() now reads ocr_text, builds text_parts from caption/object_text/ocr_text, embeds combined text when present, and forms a weighted hybrid (image0.45 + text0.55) or falls back to image-only; normalization preserved.
Hybrid embedding test suite with OCR coverage
backend/tests/test_hybrid_embedding.py
Updates _run to accept ocr_text, adds OCR_VEC, updates existing tests to pass ocr_text=\"\", and adds test_ocr_present_uses_weighted_hybrid asserting weighted fusion and a single embed_text call with the combined non-empty text list.
Search API reranking tests
backend/tests/test_search.py
Adds tests verifying OCR-text boosting reranks results, similarity is clamped to <=1.0, and ocr_text is returned only when include_ocr=true.

Frontend test coverage

Layer / File(s) Summary
Gallery empty state test suite
frontend/src/__tests__/gallery-empty-state.test.tsx
New Vitest tests for gallery empty states with mocked Next.js/router and gallery APIs; includes renderWithQueryClient and resetSearchParams; asserts empty messages and correct navigation for filtered empty-state.
Search page test suite
frontend/src/__tests__/search-page.test.tsx
New Vitest tests for Search page with mocked API and UI components; tests search submission shows results and preview modal, and empty-results UI displays guidance text.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #17: This PR integrates OCR into hybrid embeddings and ranking, addressing the issue's objective to improve relevance for text-heavy queries.

Possibly related PRs

Suggested labels

api, testing

Suggested reviewers

  • Abhash-Chakraborty

Poem

🐰 I hopped through code with curious nose,

Found OCR text where the calendar grows,
Weighed image and words with rabbity art,
Tests now sing; search ranks play their part,
Hooray — results find the text-first heart.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main feature: improving search relevance for text-heavy images, which is the primary change across all modified files.
Description check ✅ Passed The PR description covers the summary, type of change (feature), what changed, testing instructions, and addresses most checklist items. However, 'Screenshots/recordings' section is empty and 'CONTRIBUTING.md checks' checkbox is unchecked.
Linked Issues check ✅ Passed The PR implements all core acceptance criteria: OCR text is now integrated into embeddings (processors.py, mock_embedder.py), reranking is applied for text-aware matching (search.py, search_ranking.py), thresholds are tuned separately (MOCK_SIMILARITY_THRESHOLD, FULL_SIMILARITY_THRESHOLD), and comprehensive tests validate the changes.
Out of Scope Changes check ✅ Passed All changes directly support the PR objective of improving OCR-aware search relevance. Frontend test files (gallery and search page tests) validate UI integration with the new ranking features, while backend changes implement the core OCR weighting and reranking logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds OCR-aware relevance improvements across embedding generation and search ranking, plus new UI regression tests for search and gallery empty states.

Changes:

  • Introduces OCR-aware hybrid embedding fusion (weighted when OCR text is present).
  • Adds textual/OCR boost-based reranking in /api/search responses and returns ocr_text metadata.
  • Adds frontend React Query-backed tests for Search and Gallery empty states.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
frontend/src/tests/search-page.test.tsx Adds coverage for search submit → results render → preview modal flow.
frontend/src/tests/gallery-empty-state.test.tsx Adds coverage for gallery empty state CTA and filter-clearing behavior.
backend/tests/test_search.py Adds test asserting OCR/text boost reranks search results.
backend/tests/test_hybrid_embedding.py Updates embedding tests for new ocr_text signal and adds OCR-weighted fusion test.
backend/src/find_api/workers/processors.py Implements OCR-aware weighted hybrid embedding path and updates logging/docs.
backend/src/find_api/routers/search.py Adds token-based textual boost + candidate set reranking and exposes ocr_text.
backend/src/find_api/ml/mock_embedder.py Updates mock embedder to incorporate OCR text into mock hybrid vector.
Comments suppressed due to low confidence (1)

backend/src/find_api/ml/mock_embedder.py:1

  • Using str(metadata.get(...)) will turn None into the literal string 'None', which then becomes a real text signal and affects embeddings. Also, text_parts currently includes whitespace-only values (they pass the if part filter before .strip()), potentially passing empty strings into embed_text and reintroducing the empty-string bias that the full-mode embedder avoids. Prefer caption = (metadata.get(\"caption\") or \"\") / ocr_text = (metadata.get(\"ocr_text\") or \"\"), and build text_parts by stripping first and only keeping non-empty stripped strings.
"""Deterministic lightweight embeddings for contributor development."""

Comment thread backend/src/find_api/routers/search.py Outdated
Comment on lines +244 to +251
results.sort(
key=lambda item: (
float(item["similarity"]),
float(item.get("_vector_similarity", 0.0)),
-int(item["media_id"]),
),
reverse=True,
)
Comment on lines +214 to 217
"caption": caption_text or None,
"objects": metadata_payload.get("objects") or [],
"ocr_text": ocr_text or None,
}
Comment on lines 144 to 156
"""
Generate hybrid embedding from image, caption, and detected objects.
Generate hybrid embedding from image, caption, detected objects, and OCR text.

Weighted average depends on which text signals are present:
Weighted average depends on which text signals are present:
- image + caption + objects → equal thirds (1/3 each)
- image + caption only → halves (1/2 each)
- image + objects only → halves (1/2 each)
- image only → image vector directly

When OCR text is present, we apply OCR-aware weights and normalise them
across active signals to prioritize text relevance for document-like images.

Empty strings are never passed to embed_text() because CLIP encodes
@shubh-gitpush shubh-gitpush force-pushed the nlp branch 2 times, most recently from af05698 to 3c9f274 Compare May 25, 2026 18:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/src/find_api/ml/mock_embedder.py (1)

79-97: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strip before filtering text_parts.

Whitespace-only caption/ocr_text values still pass if part, so " " becomes "" after stripping and feeds the deterministic empty-text vector back into mock mode. Filter on the stripped values instead.

Suggested fix
-        text_parts = [part.strip() for part in [caption, object_text, ocr_text] if part]
+        text_parts = [
+            part
+            for part in (caption.strip(), object_text.strip(), ocr_text.strip())
+            if part
+        ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/ml/mock_embedder.py` around lines 79 - 97, The code
builds text_parts using [part.strip() for part in [caption, object_text,
ocr_text] if part] which lets whitespace-only strings through; change the
filtering to operate on the stripped strings (e.g., compute stripped parts first
and then filter with if part) so that values like "   " are removed before
calling embed_text and producing hybrid_vector; update the text_parts creation
in the block where caption, object_text, ocr_text are combined (used before
calling self.embed_text and assigning hybrid_vector) to filter on part.strip()
rather than the original part.
backend/src/find_api/routers/search.py (1)

101-119: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Apply the OCR/text boost before thresholding and pagination.

similarity > :threshold and LIMIT/OFFSET still run on raw vector similarity, so this rerank only reorders the current vector-ranked page. Rows just below the vector cutoff or outside the first SQL window can never be promoted, which makes results, total, and has_more inconsistent with the final score you expose.

Also applies to: 123-150, 243-258

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/routers/search.py` around lines 101 - 119, The current
code computes threshold and runs count_query (using vector <=> with
embedding_str) and pagination on raw vector similarity, then applies OCR/text
boost only after fetching a page, which prevents promoted items outside the
initial vector cutoff; fix by applying the OCR/text boost before
thresholding/pagination: either incorporate the text/ocr boost into the SQL
scoring expression (replace 1 - (vector <=> ...) with a combined_score that adds
the OCR/text boost) so count_query, the main SELECT, and LIMIT/OFFSET operate on
the boosted score, or fetch a sufficiently large candidate set, compute boosted
scores in Python, then compute total/has_more/pagination from those boosted
scores; update uses of threshold, count_query, the main SELECT (vector <=>),
embedding_str, and the results/total/has_more calculation to rely on the boosted
score instead of raw similarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/find_api/routers/search.py`:
- Around line 225-237: final_similarity (computed as vector_similarity +
textual_boost in search.py) can exceed the normalized upper bound of 1.0; clamp
the computed score before assigning it to the returned "similarity" field: after
computing final_similarity (from vector_similarity and textual_boost, and/or
_compute_textual_boost), apply a bounding step to ensure the value is within
[0.0, 1.0] (and use the clamped value when building the results dict for
"media_id"/"similarity").
- Around line 24-62: The router contains ML/search logic (_tokenize,
_compute_textual_boost and the reranking loop) that should be moved into the
search/ML layer: extract _tokenize and _compute_textual_boost into the existing
search/ML module as exported utilities, move the reranking loop (the code that
applies textual boost to search results and reorders them) into a single
function (e.g., rerank_results(results, query)) in that module, update the
router to call those exported functions (importing
tokenize/compute_textual_boost or rerank_results) so the router only wires
request -> service -> response, and update or add unit tests for the new
functions in the search/ML tests suite.

In `@backend/tests/test_hybrid_embedding.py`:
- Around line 360-380: The file fails Ruff formatting; run the formatter and
commit the changes: run `ruff format` on the test file (the reviewer suggested
`uv run ruff format backend/tests/test_hybrid_embedding.py`) to auto-fix style
issues in the test function test_ocr_present_uses_weighted_hybrid (and any
surrounding helpers like _run) so that backend-check no longer reports rewrites;
after formatting, re-run tests and ensure assertions and the embedder.embed_text
call remain unchanged before committing.

In `@backend/tests/test_search.py`:
- Around line 178-221: The file fails Ruff formatting; run the formatter and
commit the result: run the ruff formatter on the test file and re-stage the
changes (e.g., execute the equivalent of `uv run ruff format` for the file
containing test_ocr_text_boost_reranks_results and _mock_search), then verify
`ruff format --check` passes before pushing so backend/tests/test_search.py
(including the MagicMock blocks for text_heavy and portrait and the
test_ocr_text_boost_reranks_results function) is properly formatted.

In `@frontend/src/__tests__/search-page.test.tsx`:
- Around line 132-152: The test for the empty search state lacks an assertion
that the search API was invoked; update the test to assert that
apiMocks.searchImages was called with the expected query after submitting the
form (e.g., verify apiMocks.searchImages was calledOnce and calledWith a payload
containing "nothing" or the expected query params) so the UI assertions only
pass when the search flow actually executed; locate the assertions around
apiMocks.searchImages, renderWithQueryClient, fireEvent.change and
fireEvent.click in the "shows the empty search state when no matches are
returned" test and add the call/argument assertions immediately after awaiting
the UI results.
- Around line 27-33: The mock ImagePreviewModal currently always renders a
dialog so tests can’t detect open-on-click; change the test mock implementation
for ImagePreviewModal to accept the modal props (e.g., { media, open, onClose })
and only render the element with role="dialog" and
data-testid="preview-filename" when open is true (when open is false render null
or a non-dialog element/aria-hidden) so the existing click test can assert that
the dialog appears only after the Preview click; keep the same data-testid and
span content so other assertions still work.

---

Outside diff comments:
In `@backend/src/find_api/ml/mock_embedder.py`:
- Around line 79-97: The code builds text_parts using [part.strip() for part in
[caption, object_text, ocr_text] if part] which lets whitespace-only strings
through; change the filtering to operate on the stripped strings (e.g., compute
stripped parts first and then filter with if part) so that values like "   " are
removed before calling embed_text and producing hybrid_vector; update the
text_parts creation in the block where caption, object_text, ocr_text are
combined (used before calling self.embed_text and assigning hybrid_vector) to
filter on part.strip() rather than the original part.

In `@backend/src/find_api/routers/search.py`:
- Around line 101-119: The current code computes threshold and runs count_query
(using vector <=> with embedding_str) and pagination on raw vector similarity,
then applies OCR/text boost only after fetching a page, which prevents promoted
items outside the initial vector cutoff; fix by applying the OCR/text boost
before thresholding/pagination: either incorporate the text/ocr boost into the
SQL scoring expression (replace 1 - (vector <=> ...) with a combined_score that
adds the OCR/text boost) so count_query, the main SELECT, and LIMIT/OFFSET
operate on the boosted score, or fetch a sufficiently large candidate set,
compute boosted scores in Python, then compute total/has_more/pagination from
those boosted scores; update uses of threshold, count_query, the main SELECT
(vector <=>), embedding_str, and the results/total/has_more calculation to rely
on the boosted score instead of raw similarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 92dbafbc-b21c-423b-8ecd-6590ed6c08b2

📥 Commits

Reviewing files that changed from the base of the PR and between ae515d6 and 15d0e2f.

📒 Files selected for processing (7)
  • backend/src/find_api/ml/mock_embedder.py
  • backend/src/find_api/routers/search.py
  • backend/src/find_api/workers/processors.py
  • backend/tests/test_hybrid_embedding.py
  • backend/tests/test_search.py
  • frontend/src/__tests__/gallery-empty-state.test.tsx
  • frontend/src/__tests__/search-page.test.tsx

Comment thread backend/src/find_api/routers/search.py Outdated
Comment thread backend/src/find_api/routers/search.py
Comment thread backend/tests/test_hybrid_embedding.py
Comment thread backend/tests/test_search.py
Comment on lines +27 to +33
vi.mock("@/components/image-preview-modal", () => ({
ImagePreviewModal: ({ media }: { media: { filename?: string } }) => (
<div role="dialog" aria-label="preview">
<span data-testid="preview-filename">{media.filename}</span>
</div>
),
}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mocked preview modal is always “open”, so this test can’t verify open-on-click behavior.

ImagePreviewModal currently renders role="dialog" unconditionally, so the modal assertion can pass even before clicking Preview.

Proposed fix
 vi.mock("`@/components/image-preview-modal`", () => ({
-  ImagePreviewModal: ({ media }: { media: { filename?: string } }) => (
-    <div role="dialog" aria-label="preview">
-      <span data-testid="preview-filename">{media.filename}</span>
-    </div>
-  ),
+  ImagePreviewModal: ({
+    media,
+    isOpen,
+  }: {
+    media: { filename?: string };
+    isOpen?: boolean;
+  }) =>
+    isOpen ? (
+      <div role="dialog" aria-label="preview">
+        <span data-testid="preview-filename">{media.filename}</span>
+      </div>
+    ) : null,
 }));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/__tests__/search-page.test.tsx` around lines 27 - 33, The mock
ImagePreviewModal currently always renders a dialog so tests can’t detect
open-on-click; change the test mock implementation for ImagePreviewModal to
accept the modal props (e.g., { media, open, onClose }) and only render the
element with role="dialog" and data-testid="preview-filename" when open is true
(when open is false render null or a non-dialog element/aria-hidden) so the
existing click test can assert that the dialog appears only after the Preview
click; keep the same data-testid and span content so other assertions still
work.

Comment on lines +132 to +152
it("shows the empty search state when no matches are returned", async () => {
apiMocks.searchImages.mockResolvedValueOnce({
results: [],
total: 0,
query: "nothing",
});

renderWithQueryClient();

fireEvent.change(
screen.getByPlaceholderText(/a visual memory, object, scene, or mood/i),
{ target: { value: "nothing" } },
);
fireEvent.click(screen.getByRole("button", { name: /^search$/i }));

await waitFor(() => {
expect(screen.getByText(/no results found/i)).toBeInTheDocument();
expect(
screen.getByText(/try a broader phrase or a visible object/i),
).toBeInTheDocument();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the search API call in the empty-state test to avoid false positives.

Right now this test only checks text rendering; adding a call assertion ensures submit actually exercised the search flow.

Proposed fix
   fireEvent.click(screen.getByRole("button", { name: /^search$/i }));

+  await waitFor(() => {
+    expect(apiMocks.searchImages).toHaveBeenCalledWith({
+      query: "nothing",
+      limit: 24,
+    });
+  });
+
   await waitFor(() => {
     expect(screen.getByText(/no results found/i)).toBeInTheDocument();
     expect(
       screen.getByText(/try a broader phrase or a visible object/i),
     ).toBeInTheDocument();
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/__tests__/search-page.test.tsx` around lines 132 - 152, The test
for the empty search state lacks an assertion that the search API was invoked;
update the test to assert that apiMocks.searchImages was called with the
expected query after submitting the form (e.g., verify apiMocks.searchImages was
calledOnce and calledWith a payload containing "nothing" or the expected query
params) so the UI assertions only pass when the search flow actually executed;
locate the assertions around apiMocks.searchImages, renderWithQueryClient,
fireEvent.change and fireEvent.click in the "shows the empty search state when
no matches are returned" test and add the call/argument assertions immediately
after awaiting the UI results.

@Abhash-Chakraborty Abhash-Chakraborty added gssoc26 Related to GirlScript Summer of Code 2026. level:advanced GSSoC difficulty level: advanced. Base contributor points: 55. backend FastAPI, database, storage, and API work frontend Frontend UI, state, and Next.js work ml Model inference, embeddings, OCR, captions, and search relevance type:feature Feature PR. GSSoC type bonus. type:testing Testing-related PR. GSSoC type bonus: +10 points. under-review Maintainer needs to verify labels May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

PR Context Summary

Suggested issue links

  • No strong issue match found yet.

Use Fixes #123 or Closes #123 in the PR body when one of the suggestions is the intended issue.
Manual rerun: Actions > PR Context Triage > Run workflow > set pr_number and force_review=true.

@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented May 27, 2026

Approvability

Verdict: Needs human review

Unable to check for correctness in e85b0ff. This PR introduces new search ranking algorithms and OCR-aware embedding weights that significantly change search result scoring and ordering. Unresolved review comments identify a potential bug in tie-breaker sorting logic and data exposure concerns. Human review recommended for these runtime behavior changes.

You can customize Macroscope's approvability policy. Learn more.

@Abhash-Chakraborty Abhash-Chakraborty added quality:clean Clean and maintainable PR. GSSoC contributor multiplier: 1.2x. ready-to-merge Fully approved, tested, and cleared for immediate merging. and removed frontend Frontend UI, state, and Next.js work under-review Maintainer needs to verify do-not-merge Blocks merging. Remove this label only when the PR is fully ready for production. labels May 31, 2026
Copy link
Copy Markdown
Owner

@Abhash-Chakraborty Abhash-Chakraborty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against issue #17. I resolved the merge conflicts with current main, preserved the existing search query cache/debug diagnostics, restored unrelated frontend test coverage, fixed include_ocr cache-key behavior, and verified the OCR-aware embedding/reranking tests. Backend suite and relevant frontend tests pass.

@github-actions
Copy link
Copy Markdown

@macroscope-app review

Please review this PR against its linked issue, local-first privacy rules, and the current Find repo instructions.
Linked issue(s): #17.
Trigger source: label-gated review (ready-to-merge).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend FastAPI, database, storage, and API work gssoc26 Related to GirlScript Summer of Code 2026. level:advanced GSSoC difficulty level: advanced. Base contributor points: 55. ml Model inference, embeddings, OCR, captions, and search relevance quality:clean Clean and maintainable PR. GSSoC contributor multiplier: 1.2x. ready-to-merge Fully approved, tested, and cleared for immediate merging. type:feature Feature PR. GSSoC type bonus. type:testing Testing-related PR. GSSoC type bonus: +10 points.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: improve natural-language search relevance for text-heavy and unrelated images

3 participants