Update lighteval fork#29
Merged
Merged
Conversation
tests on the aws runner were hanging, culprit was multiporcessing when loading datasets.
Co-authored-by: xgw <xinguang.wxg@alibaba-inc.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
--------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* lift protobuf restriction * fix typos * remove incorrect comment --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
replaces the metric for gsm8k and the stop words to yield better results.
* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix
Enables the evaluation of any system in the user's control. Fixes [Issue 430](#430). Try with ``` python -m lighteval custom google-translate /path/to/google_translate_model.py "lighteval|wmt20:fr-de|0|0" --max-samples 10 ``` google_translate_model.py ``` import logging from typing import Optional from tqdm import tqdm from transformers import AutoTokenizer from lighteval.data import GenerativeTaskDataset from lighteval.models.abstract_model import LightevalModel, ModelInfo from lighteval.models.model_output import ( GenerativeResponse, LoglikelihoodResponse, LoglikelihoodSingleTokenResponse, ) from lighteval.tasks.requests import ( GreedyUntilRequest, LoglikelihoodRequest, LoglikelihoodRollingRequest, LoglikelihoodSingleTokenRequest, ) logger = logging.getLogger(__name__) class GoogleTranslateClient(LightevalModel): def __init__(self, config, env_config) -> None: self.model = config.model self.model_definition_file_path = config.model_definition_file_path self.model_info = ModelInfo( model_name=config.model, model_sha="", model_dtype=None, model_size="", ) self._tokenizer = AutoTokenizer.from_pretrained("gpt2") # Use a dummy tokenizer for compatibility import httpcore # Needed to fix some googletrans bug # https://stackoverflow.com/questions/72796594/attributeerror-module-httpcore-has-no-attribute-synchttptransport#comment136664963_77334618 setattr(httpcore, 'SyncHTTPTransport', 'AsyncHTTPProxy') from googletrans import Translator self.translator = Translator() def greedy_until( self, requests: list[GreedyUntilRequest], override_bs: Optional[int] = None, ) -> list[GenerativeResponse]: """ Generates responses using a greedy decoding strategy until certain ending conditions are met. Args: requests (list[Request]): list of requests containing the context and ending conditions. disable_tqdm (bool, optional): Whether to disable the progress bar. Defaults to False. override_bs (int, optional): Override the batch size for generation. Defaults to None. Returns: list[GenerativeResponse]: list of generated responses. """ for request in requests: request.tokenized_context = self.tok_encode(request.context) dataset = GenerativeTaskDataset(requests=requests, num_dataset_splits=self.DATASET_SPLITS) results = [] for _ in tqdm( dataset.splits_start_end_iterator(), total=dataset.num_dataset_splits, desc="Splits", position=0, disable=False, # self.disable_tqdm, ): for r in tqdm(dataset, desc="Batch", position=1, disable=False): context = r.context.replace("French phrase: ", "") # TODO: Get src and dest from request translation = self.translator.translate(context, src='fr', dest='de') result = translation.text cur_response = GenerativeResponse( result=result, logits=None, generated_tokens=[], input_tokens=[], ) results.append(cur_response) return dataset.get_original_order(results) @Property def tokenizer(self): return self._tokenizer def tok_encode(self, text: str): return self.tokenizer.encode(text) @Property def add_special_tokens(self) -> bool: return False @Property def max_length(self) -> int: """Return the maximum sequence length of the model.""" return 4096 def loglikelihood( self, requests: list[LoglikelihoodRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodResponse]: """Tokenize the context and continuation and compute the log likelihood of those tokenized sequences. """ raise NotImplementedError def loglikelihood_rolling( self, requests: list[LoglikelihoodRollingRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodResponse]: """This function is used to compute the log likelihood of the context for perplexity metrics.""" raise NotImplementedError def loglikelihood_single_token( self, requests: list[LoglikelihoodSingleTokenRequest], override_bs: Optional[int] = None ) -> list[LoglikelihoodSingleTokenResponse]: """Tokenize the context and continuation and compute the log likelihood of those tokenized sequences. """ raise NotImplementedError ```
* lift protobuf restriction * fix typos * remove incorrect comment * initial attempt to fix the from_model() method * ensure input tensors are moved to proper device * remove conditional, since device should never be None here --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
This PR aims to make iterating over splits a bit more intuitive, at least in my opinion. Open to feedback though! If the current behavior was intentional, feel free to close. --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* added option to bill to org in inference providers * removed tokenizer logging
* Update README.md * Update README.md
* Release: v0.9.0 * Bump dev version to 0.9.1.dev0
* fix inference endpoints bugs * Release: v0.9.1
* fix tqdm logging * suppress stderr draws * add env var for custom tqdm behavior
* Added support for quanitzation in vllm backend * Fixed style issues --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* Fix revision arg for vLLM tokenizer * Add unit test * Update test * Move test repo
* add smolm generative tasks * add jeopardy * pretty 🥰 * consistent stop sequences * add versions + change names --------- Co-authored-by: Hynek Kydlicek <kydlicek.hynek@huggingface.co> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* adding relevant languages + sorting everyone * added chrf++ * updated language list to rm redundancies, renamed task to follow usual pattern * fix adapter plus manage languages simply - in the future we might want to have a custom enum with one key several values
``` uv run lighteval accelerate "model_name=HuggingFaceTB/SmolVLM-Instruct" "lighteval|mmmu_pro|0|0" --use-chat-template --vision-model ``` --------- Co-authored-by: qubvel <qubvel@gmail.com>
* fixing task metric type * fix style
* Add MCQ support to Yourbench evaluation --------- Co-authored-by: Hynek Kydlíček <kydlicek.hynek@gmail.com>
## Pull Request Overview This PR introduces support for passing custom keyword arguments when loading pretrained transformer models, enabling more flexible configuration of model loading. It also replaces the fixed "generation_size" parameter with a more general "model_loading_kwargs" field. - Removed the fixed generation_size parameter. - Added a new model_loading_kwargs field to the configuration. - Updated the auto model creation to copy the provided kwargs. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * suggestion from copilot --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Pull Request Overview This PR adds support for using a custom template to determine where evaluation results are saved. The changes include adding a new parameter "results_path_template" across multiple main modules and updating the EvaluationTracker to honor this template for saving results; the associated tests and documentation have been updated accordingly. - Added a new test for the custom results template. - Extended CLI options in several main modules to accept "results_path_template". - Updated EvaluationTracker logic and documentation to reflect the new functionality. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge after submission