An overview of the LLM2Vec methodology and an overview of the items that need to be completed in order to apply LLM2Vec on dicta-il/dictalm2.0.
Overview
LLM2Vec is a recipe of converting decoder-only LLM into an embedding model (encoder model). Given an base decoder-only LLM, the recipe is as follows:
- Enable bidirectional attention
- Finetune on MNTP (Masked next token prediction) task - achieve word-level encoder model (like BERT)
- Unsupervised contrastive-learning training - in order to achieve sentence-level encoder model (like SBert)
Additionally, in the paper, the authors also apply supervised approach replacing step no. 3 (supervised contrastive-learning) and evaluate their performance as well.
As a result, starting from a base model (mistralai/Mistral-7B-Instruct-v0.2 for example) the authors created 3 new models:
- McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp - By applying steps 1+2 and creating the base model, which suitable for word-level embedding
- McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse - By finetuning the base model with step 3 (unsupervised contrastive-learning) and creating a model that is suitable for sentence-level embedding
- McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised - By applying the alternative to step 3 with the supervised approach
1. Enabling bidirectional attention
As decoder-only LLMs use use casual attention mask (that is, use attention only against previous words), the first step is to replace the mask (which is a matrix with minus infinity above the diagonal) with all 1 matrix, in order that each token can access any other token.
2. Finetune on MNTP (Masked next token prediction) task
A finetuning pahse that is relatively resembles MLM (masked language modeling task). But here there's a slight difference:
- In a traditional next token prediction, we try to predict the next token based on the activations of the previous token.
- In masked language modeling mask out a portion of the input (the rest is masked) and attempt to recover it. The attention is now bidirectional - tokens can be influenced both by the future and the past
- MNTP is a mixture of the two approaches: we mask out a portion of the input tokens like in MLM, but we try to recover them using the activations of the previous token like in NTP
3. Unsupervised contrastive-learning training
Similar to the way SBert was tranied but in unsupervised way:
- We generate two representations of a given passage using different dropout masks
- We randomly sample a negative item
- We train the model to predict higher similarity for the positive samples and lower similarity for the negative samples
An overview of the LLM2Vec methodology and an overview of the items that need to be completed in order to apply LLM2Vec on dicta-il/dictalm2.0.
Overview
LLM2Vec is a recipe of converting decoder-only LLM into an embedding model (encoder model). Given an base decoder-only LLM, the recipe is as follows:
Additionally, in the paper, the authors also apply supervised approach replacing step no. 3 (supervised contrastive-learning) and evaluate their performance as well.
As a result, starting from a base model (mistralai/Mistral-7B-Instruct-v0.2 for example) the authors created 3 new models:
1. Enabling bidirectional attention
As decoder-only LLMs use use casual attention mask (that is, use attention only against previous words), the first step is to replace the mask (which is a matrix with minus infinity above the diagonal) with all 1 matrix, in order that each token can access any other token.
2. Finetune on MNTP (Masked next token prediction) task
A finetuning pahse that is relatively resembles MLM (masked language modeling task). But here there's a slight difference:
3. Unsupervised contrastive-learning training
Similar to the way SBert was tranied but in unsupervised way: