Original Task
Citing from the original course task:
Training a strong Hebrew Sentence Encoder from a pretrained Decoder While recent years
have brought many additions to the open-source set of pretrained LMs in high-resource languages such
as English, most of these tools are not directly useful for use on Hebrew Inputs. Recently, a new project
aiming to bridge this gap has introduced new tools and most importantly benchmarks for Herbrew LMs.
Concurrently, some new open-source strong models have been trained on Hebrew text, most recently, the
DictaLM 2.0. In this project, you will modify the DictaLM model to be a strong Encoder-model using
the LLM2Vec method. To evaluate the result, you will train linear classifier for a Hebrew sentiment
analysis task on top of embeddings from your trained model, and against some baselines. Such baselines
can be strong English and multilingual pretrained models, and existing pretrained Hebrew encoders (for
example, AlephBERT and AlephBERTGimmel).
See this github issue - huggingface/sentence-transformers#2547 (comment)
And read - https://huggingface.co/docs/setfit/conceptual_guides/setfit#classifier-training-phase
Data
Hebrew sentiment analysis dataset - https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment
As the benchmarks for hebrew sentiment used in Alephbert etc was proven as leaked
Classifier
We used the recommended approach of training a Logistic Regression classifier on top of the model embeddings, especially as recommended in:
- Mistral blog (as our model is mistral-based) - https://github.com/mistralai/mistral-inference/blob/main/tutorials/classifier.ipynb
- As demonstrated by the famous Jay Alamar - https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/nlp/03_Sentence_Classification_with_BERT.ipynb
- By huggingfaces' creators book - Natural Language Processing with Transformers, Revised Edition, chapter 2
Original Task
Citing from the original course task:
See this github issue - huggingface/sentence-transformers#2547 (comment)
And read - https://huggingface.co/docs/setfit/conceptual_guides/setfit#classifier-training-phase
Data
Hebrew sentiment analysis dataset - https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment
As the benchmarks for hebrew sentiment used in Alephbert etc was proven as leaked
Classifier
We used the recommended approach of training a Logistic Regression classifier on top of the model embeddings, especially as recommended in: