Update lighteval fork by ljvmiranda921 · Pull Request #29 · filbench/lighteval

ljvmiranda921 · 2025-05-11T02:34:40Z

Merge after submission

tests on the aws runner were hanging, culprit was multiporcessing when loading datasets.

Co-authored-by: xgw <xinguang.wxg@alibaba-inc.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

--------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* lift protobuf restriction * fix typos * remove incorrect comment --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

replaces the metric for gsm8k and the stop words to yield better results.

* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix

Enables the evaluation of any system in the user's control. Fixes [Issue 430](#430). Try with ``` python -m lighteval custom google-translate /path/to/google_translate_model.py "lighteval|wmt20:fr-de|0|0" --max-samples 10 ``` google_translate_model.py ``` import logging from typing import Optional from tqdm import tqdm from transformers import AutoTokenizer from lighteval.data import GenerativeTaskDataset from lighteval.models.abstract_model import LightevalModel, ModelInfo from lighteval.models.model_output import ( GenerativeResponse, LoglikelihoodResponse, LoglikelihoodSingleTokenResponse, ) from lighteval.tasks.requests import ( GreedyUntilRequest, LoglikelihoodRequest, LoglikelihoodRollingRequest, LoglikelihoodSingleTokenRequest, ) logger = logging.getLogger(__name__) class GoogleTranslateClient(LightevalModel): def __init__(self, config, env_config) -> None: self.model = config.model self.model_definition_file_path = config.model_definition_file_path self.model_info = ModelInfo( model_name=config.model, model_sha="", model_dtype=None, model_size="", ) self._tokenizer = AutoTokenizer.from_pretrained("gpt2") # Use a dummy tokenizer for compatibility import httpcore # Needed to fix some googletrans bug # https://stackoverflow.com/questions/72796594/attributeerror-module-httpcore-has-no-attribute-synchttptransport#comment136664963_77334618 setattr(httpcore, 'SyncHTTPTransport', 'AsyncHTTPProxy') from googletrans import Translator self.translator = Translator() def greedy_until( self, requests: list[GreedyUntilRequest], override_bs: Optional[int] = None, ) -> list[GenerativeResponse]: """ Generates responses using a greedy decoding strategy until certain ending conditions are met. Args: requests (list[Request]): list of requests containing the context and ending conditions. disable_tqdm (bool, optional): Whether to disable the progress bar. Defaults to False. override_bs (int, optional): Override the batch size for generation. Defaults to None. Returns: list[GenerativeResponse]: list of generated responses. """ for request in requests: request.tokenized_context = self.tok_encode(request.context) dataset = GenerativeTaskDataset(requests=requests, num_dataset_splits=self.DATASET_SPLITS) results = [] for _ in tqdm( dataset.splits_start_end_iterator(), total=dataset.num_dataset_splits, desc="Splits", position=0, disable=False, # self.disable_tqdm, ): for r in tqdm(dataset, desc="Batch", position=1, disable=False): context = r.context.replace("French phrase: ", "") # TODO: Get src and dest from request translation = self.translator.translate(context, src='fr', dest='de') result = translation.text cur_response = GenerativeResponse( result=result, logits=None, generated_tokens=[], input_tokens=[], ) results.append(cur_response) return dataset.get_original_order(results) @Property def tokenizer(self): return self._tokenizer def tok_encode(self, text: str): return self.tokenizer.encode(text) @Property def add_special_tokens(self) -> bool: return False @Property def max_length(self) -> int: """Return the maximum sequence length of the model.""" return 4096 def loglikelihood( self, requests: list[LoglikelihoodRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodResponse]: """Tokenize the context and continuation and compute the log likelihood of those tokenized sequences. """ raise NotImplementedError def loglikelihood_rolling( self, requests: list[LoglikelihoodRollingRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodResponse]: """This function is used to compute the log likelihood of the context for perplexity metrics.""" raise NotImplementedError def loglikelihood_single_token( self, requests: list[LoglikelihoodSingleTokenRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodSingleTokenResponse]: """Tokenize the context and continuation and compute the log likelihood of those tokenized sequences. """ raise NotImplementedError ```

* lift protobuf restriction * fix typos * remove incorrect comment * initial attempt to fix the from_model() method * ensure input tensors are moved to proper device * remove conditional, since device should never be None here --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

This PR aims to make iterating over splits a bit more intuitive, at least in my opinion. Open to feedback though! If the current behavior was intentional, feel free to close. --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* added option to bill to org in inference providers * removed tokenizer logging

* Update README.md * Update README.md

* Release: v0.9.0 * Bump dev version to 0.9.1.dev0

* fix inference endpoints bugs * Release: v0.9.1

* fix tqdm logging * suppress stderr draws * add env var for custom tqdm behavior

* Added support for quanitzation in vllm backend * Fixed style issues --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

* Fix revision arg for vLLM tokenizer * Add unit test * Update test * Move test repo

* add smolm generative tasks * add jeopardy * pretty 🥰 * consistent stop sequences * add versions + change names --------- Co-authored-by: Hynek Kydlicek <kydlicek.hynek@huggingface.co> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Fix #738

* adding relevant languages + sorting everyone * added chrf++ * updated language list to rm redundancies, renamed task to follow usual pattern * fix adapter plus manage languages simply - in the future we might want to have a custom enum with one key several values

``` uv run lighteval accelerate "model_name=HuggingFaceTB/SmolVLM-Instruct" "lighteval|mmmu_pro|0|0" --use-chat-template --vision-model ``` --------- Co-authored-by: qubvel <qubvel@gmail.com>

* fixing task metric type * fix style

* Add MCQ support to Yourbench evaluation --------- Co-authored-by: Hynek Kydlíček <kydlicek.hynek@gmail.com>

## Pull Request Overview This PR introduces support for passing custom keyword arguments when loading pretrained transformer models, enabling more flexible configuration of model loading. It also replaces the fixed "generation_size" parameter with a more general "model_loading_kwargs" field. - Removed the fixed generation_size parameter. - Added a new model_loading_kwargs field to the configuration. - Updated the auto model creation to copy the provided kwargs. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * suggestion from copilot --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

## Pull Request Overview This PR adds support for using a custom template to determine where evaluation results are saved. The changes include adding a new parameter "results_path_template" across multiple main modules and updating the EvaluationTracker to honor this template for saving results; the associated tests and documentation have been updated accordingly. - Added a new test for the custom results template. - Extended CLI options in several main modules to accept "results_path_template". - Updated EvaluationTracker logic and documentation to reflect the new functionality. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

NathanHB and others added 30 commits April 28, 2025 13:32

fix slow tests (#689)

818a2cf

tests on the aws runner were hanging, culprit was multiporcessing when loading datasets.

fix wrong 'custom_task_directory' in python api doc (#671)

534a5b8

Co-authored-by: xgw <xinguang.wxg@alibaba-inc.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Fix yourbench custom task (#695)

71fec80

--------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Lift protobuf restriction (#683)

b07d418

* lift protobuf restriction * fix typos * remove incorrect comment --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

fix gsm8k metric (#688)

c776694

replaces the metric for gsm8k and the stop words to yield better results.

Nathan adds wanddb logging (#685)

96e885d

Add pass@1 for GPQA-D and MATH-500 (#698)

d50bc30

* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix

reorder authors (#699)

dfeb234

docs: improve consistency in punctuation of metric list (#605)

40626e7

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

[FIX] Inference providers (#701)

039debc

* added option to bill to org in inference providers * removed tokenizer logging

Update README.md (#703)

9bf210c

* Update README.md * Update README.md

fix typos (#702)

77efde0

better release notes (#704)

42c9c61

Bump dev version0.9.1.dev0 (#705)

20cff95

* Release: v0.9.0 * Bump dev version to 0.9.1.dev0

Patch 0.9.1 (#708)

c871456

* fix inference endpoints bugs * Release: v0.9.1

add livecodebench v6 (#712)

7477de0

Fix tqdm logging (#711)

f7392fa

* fix tqdm logging * suppress stderr draws * add env var for custom tqdm behavior

Added support for quantization in vLLM backend (#690)

04a74a2

* Added support for quanitzation in vllm backend * Fixed style issues --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Fix revision arg for vLLM tokenizer (#721)

d3da6b9

* Fix revision arg for vLLM tokenizer * Add unit test * Update test * Move test repo

Update README.md (#733)

f684d35

fix litellm (#736)

a590376

Update main_endpoint.py (#739)

d18f11a

Fix #738

Adds multimodal support and MMMU pro (#675)

1607dc1

``` uv run lighteval accelerate "model_name=HuggingFaceTB/SmolVLM-Instruct" "lighteval|mmmu_pro|0|0" --use-chat-template --vision-model ``` --------- Co-authored-by: qubvel <qubvel@gmail.com>

Fix task metric type mismatch (#743)

3bb8a50

* fixing task metric type * fix style

Add MCQ support to Yourbench evaluation (#734)

317cb50

* Add MCQ support to Yourbench evaluation --------- Co-authored-by: Hynek Kydlíček <kydlicek.hynek@gmail.com>

NathanHB and others added 4 commits May 21, 2025 10:13

fix custom model example (#766)

cce0bfc

add dependencies to run after pip install (#767)

2651750

ljvmiranda921 merged commit 5791574 into filbench:main May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lighteval fork#29

Update lighteval fork#29
ljvmiranda921 merged 34 commits into
filbench:mainfrom
huggingface:main

ljvmiranda921 commented May 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

ljvmiranda921 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

ljvmiranda921 commented May 11, 2025 •

edited

Loading