Skip to content

Semantic Caching with Bi-Direction Encoder and Cross Encoder#76

Merged
shahmeer99 merged 28 commits into
mainfrom
semantic-caching
Apr 17, 2026
Merged

Semantic Caching with Bi-Direction Encoder and Cross Encoder#76
shahmeer99 merged 28 commits into
mainfrom
semantic-caching

Conversation

@sureshkumarsrinath
Copy link
Copy Markdown
Contributor

@sureshkumarsrinath sureshkumarsrinath commented Feb 6, 2026

Adding a simple Cross-Encoder similarity based semantic cache with cache benchmark tests. This is simple benchmark to establish the baseline for cache performance.

@sureshkumarsrinath sureshkumarsrinath marked this pull request as ready for review March 6, 2026 03:07
@sureshkumarsrinath sureshkumarsrinath changed the title [WIP] Semantic Caching with Bi-Direction Encoder and Cross Encoder Semantic Caching with Bi-Direction Encoder and Cross Encoder Mar 6, 2026
Comment thread src/cache.py Outdated
Comment thread src/cache.py Outdated
Comment thread src/cache.py Outdated
Comment thread src/cache.py Outdated
Comment thread tests/test_benchmarks.py Outdated
@jarulraj jarulraj requested a review from shahmeer99 March 6, 2026 16:26
Comment thread src/cache.py Outdated
@@ -0,0 +1,159 @@
from yaml import Node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

Comment thread src/cache.py Outdated
# Global cache and constants
# -----------------------------
SEMANTIC_CACHE: Dict[str, Deque[Dict[str, Any]]] = {}
SEMANTIC_CACHE_THRESHOLD = 0.85
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

Comment thread src/main.py Outdated

# Step 1: Get chunks (golden, retrieved, or none)

normalized_question = normalize_question(question)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these definitions done if cfg.semantic_cache_enabled is not True. Wasted computation even when semantic caching is turned off

Comment thread src/generator.py
Copy link
Copy Markdown
Contributor

@shahmeer99 shahmeer99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged the current main and added some comments in main to make it more clear (those were my second last and third last commits in this branch). I have updated the testing as well (this is my last commit) to cover the concern (CONCERN 1) I explain below. The testing (last commit) adds:

  • Adds "trick_variations" in cache benchmark YAML. These were partially model generated and partially generated by me. They are not the best at the moment and you should update them later (as will I). The idea here is that we want to check if the cache will give us a hit for a variation that sounds really similar to the original question but in reality does NOT have the same answer (this a hit will give us a false positive and results in the wrong answer). The script has been updated to reflect this part of the test as well. You original hit rate metric remains as is (only the output print has changed). I want the EMPHASIZE here that the trick_variations were quickly created by me and Claude but we need to come up with better trick_variation cases to really test this problem.

CONCERN 1: how to choose optimal thresholds?

Where are these 2 values coming from:
BI_ENCODER_THRESHOLD = 0.40
CROSS_ENCODER_THRESHOLD = 0.75

These thresholds essentially determine what constitutes a hit in the semantic cache and with that we run the risk of returning a completely irrelevant answer in case of a false positive hit (this is the worst case scenario!) or not getting any hits at all (wasted computations). We need a comprehensive benchmark and suite of tests to make sure we are hitting some kind of sweet spot between these 2 issues here. Could you elaborate on how you chose these 2 values to begin with?

@shahmeer99
Copy link
Copy Markdown
Contributor

@sureshkumarsrinath The idea is actually good and I think this is a feature that will help immensely in the long run. But we need some better testing about what makes a semantic cache good (threshold values, etc). I have left a detailed comment explaining my changes and my concerns and some future steps (also left some small syntax comments here and there). One more thing beyond that, have you considered using a ColBERT approach (maxSIM at token level across 2 queries)? It is a bit expensive but can be heavily parallelized and can be potentially used at the final stage but I am not sure if it'll help or not.

@sureshkumarsrinath
Copy link
Copy Markdown
Contributor Author

@sureshkumarsrinath The idea is actually good and I think this is a feature that will help immensely in the long run. But we need some better testing about what makes a semantic cache good (threshold values, etc). I have left a detailed comment explaining my changes and my concerns and some future steps (also left some small syntax comments here and there). One more thing beyond that, have you considered using a ColBERT approach (maxSIM at token level across 2 queries)? It is a bit expensive but can be heavily parallelized and can be potentially used at the final stage but I am not sure if it'll help or not.
@shahmeer99 Thanks. I will note your suggestions here and work on improving this feature.

@shahmeer99 shahmeer99 added the enhancement New feature or request label Mar 13, 2026
@sureshkumarsrinath
Copy link
Copy Markdown
Contributor Author

I performed an exhaustive benchmark across several combinations of the threshold values against an adversarial dataset to determine the sweet spot between Accuracy (correctly retrieving the paraphrased queries) and False Positive Rate (returning entirely wrong answers).

As serving an irrelevant answer due to a false positive match is absolutely not acceptable and goes against the fundamental user experience, I've intentionally optimized the system for safety over aggressive caching. I've turned both of these values up high to ensure a 0% false positive rate:

BI Threshold Cross Threshold Accuracy False Positive Rate
0.85 0.99 99.3% 1.3%
0.90 0.99 97.3% 0.7%
0.95 0.99 62.0% 0.0%

@sureshkumarsrinath
Copy link
Copy Markdown
Contributor Author

This is the just the baseline cache implementation for the system. Further improvement to be done as flag enabled implementation

Comment thread config/config.yaml Outdated
use_double_prompt: false
enable_history: true
max_history_turns: 3
semantic_cache_enabled: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should be false

@shahmeer99
Copy link
Copy Markdown
Contributor

@sureshkumarsrinath This looks good to me. Thanks for considering the extra false positive testing. Like you said, its a great start for this feature and can be improved upon in later iterations.

Here are some proposed changes:

  1. MINOR - Change the ABC class name from Cache to something more specific. Cache itself is too generic and may be confusing.
  2. MINOR - keep the default value for use semantic cache in config.yaml as false.
  3. MAJOR - We still need a log file for queries that hit the cache. The log should have a bool (something like semantic_cache_hit = T/F) that captures whether this query hit the cache or not. But it should still have the same info regarding retrieved_chunks and config from the actual query that it takes the answer from (the one in the cache). This requires some thought: (1) do you copy paste the log of the original query with the semantic_cache_hit as T? (2) Do you just reference the log file of the query was used from the cache? (3) Is there another better way to do this? This requires some thought but it is important that each query generated a log file regardless of whether it hits the cache or not.

Note: PR 102 changes the logging structure. I am hoping that will be reviewed and merged by tomorrow so you can resolve point 3 regarding log file creation after that PR is merged ideally.

@shahmeer99 shahmeer99 merged commit f146e92 into main Apr 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants