Skip to content

OpenConceptLab/ocl_issues#2388 | custom encoder model for reranking#8

Open
snyaggarwal wants to merge 1 commit intomainfrom
issues#2388
Open

OpenConceptLab/ocl_issues#2388 | custom encoder model for reranking#8
snyaggarwal wants to merge 1 commit intomainfrom
issues#2388

Conversation

@snyaggarwal
Copy link
Copy Markdown
Contributor

Linked Issue

Closes OpenConceptLab/ocl_issues#2388

Copy link
Copy Markdown
Member

@paynejd paynejd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — OpenConceptLab/ocl_issues#2388

1. Add drop-in models to ENCODER_MODEL_OPTIONS with descriptions

Currently rerankerModels.js only has the default. Add 3 additional drop-in compatible models (no backend code changes needed — these all work with the existing CrossEncoder in sentence-transformers 3.3.1):

export const ENCODER_MODEL_OPTIONS = [
  {
    id: 'BAAI/bge-reranker-v2-m3',
    description: 'Multilingual, general-purpose (0.6B)',
  },
  {
    id: 'cross-encoder/ms-marco-MiniLM-L-6-v2',
    description: 'Fast and lightweight, English-only (23M)',
  },
  {
    id: 'ncbi/MedCPT-Cross-Encoder',
    description: 'Biomedical domain, trained on PubMed (110M)',
  },
  {
    id: 'Alibaba-NLP/gte-reranker-modernbert-base',
    description: 'Balanced quality, supports longer descriptions (149M)',
  },
]

This changes the data shape from string array to object array, so RerankerConfig.jsx needs updating to use option.id and display option.description in the dropdown. Each model offers a genuinely different tradeoff:

  • ms-marco-MiniLM: 27x smaller than default, ~10x faster — good for latency-critical or large batch runs
  • MedCPT: only biomedical-domain cross-encoder available, trained on 18M PubMed query-article pairs — most relevant for health terminology mapping
  • gte-modernbert: near-default quality at 4x smaller, 8192 token context window (vs 128 default) for longer concept descriptions

2. Log the encoder model on rerank events

At line 2226, the rerank_finished log should include which model was used:

log({action: 'rerank_finished', description: `Reranked with ${encoderModel}`}, index)

Same for rerank_failed at line 2231:

log({action: 'rerank_failed', description: `Rerank failed with ${encoderModel}`}, index)

This is visible in the row's Discuss/log panel and persists with the project — important for debugging and reproducibility when users are experimenting with different models.

3. Fix prop naming inconsistency

MapProject.jsx passes rerankerConfig={encoderModel} and setRerankerConfig={setEncoderModel} to ConfigurationForm, but the value is a plain string, not a config object. The naming is misleading. Suggest renaming to encoderModel/setEncoderModel or rerankerModel/setRerankerModel throughout.

4. Fix Spanish translation accents

In es/translations.json:

  • "Configuracion del reranker""Configuración del reranker"
  • "automaticamente""automáticamente"

5. Coordinate Closes keyword with oclapi2#839

Both PRs say "Closes OpenConceptLab/ocl_issues#2388". Whichever merges first will auto-close the issue prematurely. Suggest this PR (oclmap) keeps the Closes since it's the user-facing final piece, and oclapi2#839 changes to Ref #2388.


Related follow-up tickets filed:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to configure a custom reranker in OCL Mapper Project Config

2 participants