Skip to content

Evaluation #8

@omriel1

Description

@omriel1

Original Task

Citing from the original course task:

Training a strong Hebrew Sentence Encoder from a pretrained Decoder While recent years
have brought many additions to the open-source set of pretrained LMs in high-resource languages such
as English, most of these tools are not directly useful for use on Hebrew Inputs. Recently, a new project
aiming to bridge this gap has introduced new tools and most importantly benchmarks for Herbrew LMs.
Concurrently, some new open-source strong models have been trained on Hebrew text, most recently, the
DictaLM 2.0. In this project, you will modify the DictaLM model to be a strong Encoder-model using
the LLM2Vec method. To evaluate the result, you will train linear classifier for a Hebrew sentiment
analysis task on top of embeddings from your trained model, and against some baselines. Such baselines
can be strong English and multilingual pretrained models, and existing pretrained Hebrew encoders (for
example, AlephBERT and AlephBERTGimmel).

See this github issue - huggingface/sentence-transformers#2547 (comment)
And read - https://huggingface.co/docs/setfit/conceptual_guides/setfit#classifier-training-phase

Data

Hebrew sentiment analysis dataset - https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment

As the benchmarks for hebrew sentiment used in Alephbert etc was proven as leaked

Classifier

We used the recommended approach of training a Logistic Regression classifier on top of the model embeddings, especially as recommended in:

  1. Mistral blog (as our model is mistral-based) - https://github.com/mistralai/mistral-inference/blob/main/tutorials/classifier.ipynb
  2. As demonstrated by the famous Jay Alamar - https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/nlp/03_Sentence_Classification_with_BERT.ipynb
  3. By huggingfaces' creators book - Natural Language Processing with Transformers, Revised Edition, chapter 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions