Semantic Caching with Bi-Direction Encoder and Cross Encoder#76
Conversation
| @@ -0,0 +1,159 @@ | |||
| from yaml import Node | |||
| # Global cache and constants | ||
| # ----------------------------- | ||
| SEMANTIC_CACHE: Dict[str, Deque[Dict[str, Any]]] = {} | ||
| SEMANTIC_CACHE_THRESHOLD = 0.85 |
|
|
||
| # Step 1: Get chunks (golden, retrieved, or none) | ||
|
|
||
| normalized_question = normalize_question(question) |
There was a problem hiding this comment.
Why are these definitions done if cfg.semantic_cache_enabled is not True. Wasted computation even when semantic caching is turned off
shahmeer99
left a comment
There was a problem hiding this comment.
I have merged the current main and added some comments in main to make it more clear (those were my second last and third last commits in this branch). I have updated the testing as well (this is my last commit) to cover the concern (CONCERN 1) I explain below. The testing (last commit) adds:
- Adds "trick_variations" in cache benchmark YAML. These were partially model generated and partially generated by me. They are not the best at the moment and you should update them later (as will I). The idea here is that we want to check if the cache will give us a hit for a variation that sounds really similar to the original question but in reality does NOT have the same answer (this a hit will give us a false positive and results in the wrong answer). The script has been updated to reflect this part of the test as well. You original hit rate metric remains as is (only the output print has changed). I want the EMPHASIZE here that the trick_variations were quickly created by me and Claude but we need to come up with better trick_variation cases to really test this problem.
CONCERN 1: how to choose optimal thresholds?
Where are these 2 values coming from:
BI_ENCODER_THRESHOLD = 0.40
CROSS_ENCODER_THRESHOLD = 0.75
These thresholds essentially determine what constitutes a hit in the semantic cache and with that we run the risk of returning a completely irrelevant answer in case of a false positive hit (this is the worst case scenario!) or not getting any hits at all (wasted computations). We need a comprehensive benchmark and suite of tests to make sure we are hitting some kind of sweet spot between these 2 issues here. Could you elaborate on how you chose these 2 values to begin with?
|
@sureshkumarsrinath The idea is actually good and I think this is a feature that will help immensely in the long run. But we need some better testing about what makes a semantic cache good (threshold values, etc). I have left a detailed comment explaining my changes and my concerns and some future steps (also left some small syntax comments here and there). One more thing beyond that, have you considered using a ColBERT approach (maxSIM at token level across 2 queries)? It is a bit expensive but can be heavily parallelized and can be potentially used at the final stage but I am not sure if it'll help or not. |
|
|
I performed an exhaustive benchmark across several combinations of the threshold values against an adversarial dataset to determine the sweet spot between Accuracy (correctly retrieving the paraphrased queries) and False Positive Rate (returning entirely wrong answers). As serving an irrelevant answer due to a false positive match is absolutely not acceptable and goes against the fundamental user experience, I've intentionally optimized the system for safety over aggressive caching. I've turned both of these values up high to ensure a 0% false positive rate:
|
|
This is the just the baseline cache implementation for the system. Further improvement to be done as flag enabled implementation |
| use_double_prompt: false | ||
| enable_history: true | ||
| max_history_turns: 3 | ||
| semantic_cache_enabled: true |
|
@sureshkumarsrinath This looks good to me. Thanks for considering the extra false positive testing. Like you said, its a great start for this feature and can be improved upon in later iterations. Here are some proposed changes:
Note: PR 102 changes the logging structure. I am hoping that will be reviewed and merged by tomorrow so you can resolve point 3 regarding log file creation after that PR is merged ideally. |
Adding a simple Cross-Encoder similarity based semantic cache with cache benchmark tests. This is simple benchmark to establish the baseline for cache performance.