Skip to content

Update lighteval fork#29

Merged
ljvmiranda921 merged 34 commits into
filbench:mainfrom
huggingface:main
May 21, 2025
Merged

Update lighteval fork#29
ljvmiranda921 merged 34 commits into
filbench:mainfrom
huggingface:main

Conversation

@ljvmiranda921
Copy link
Copy Markdown

@ljvmiranda921 ljvmiranda921 commented May 11, 2025

Merge after submission

NathanHB and others added 30 commits April 28, 2025 13:32
tests on the aws runner were hanging, culprit was multiporcessing when loading datasets.
Co-authored-by: xgw <xinguang.wxg@alibaba-inc.com>
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* lift protobuf restriction

* fix typos

* remove incorrect comment

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
replaces the metric for gsm8k and the stop words to yield better results.
* Add pass@1 for GPQA-D and clean up AIME

* Add pass@1 for math_500

* Add pass@1 for MATH-500

* Update test

* Fix
Enables the evaluation of any system in the user's control. Fixes [Issue 430](#430).


Try with 
```
python -m lighteval custom google-translate /path/to/google_translate_model.py "lighteval|wmt20:fr-de|0|0" --max-samples 10
```

google_translate_model.py
```
import logging
from typing import Optional

from tqdm import tqdm
from transformers import AutoTokenizer

from lighteval.data import GenerativeTaskDataset
from lighteval.models.abstract_model import LightevalModel, ModelInfo
from lighteval.models.model_output import (
    GenerativeResponse,
    LoglikelihoodResponse,
    LoglikelihoodSingleTokenResponse,
)
from lighteval.tasks.requests import (
    GreedyUntilRequest,
    LoglikelihoodRequest,
    LoglikelihoodRollingRequest,
    LoglikelihoodSingleTokenRequest,
)


logger = logging.getLogger(__name__)


class GoogleTranslateClient(LightevalModel):

    def __init__(self, config, env_config) -> None:
        self.model = config.model
        self.model_definition_file_path = config.model_definition_file_path

        self.model_info = ModelInfo(
            model_name=config.model,
            model_sha="",
            model_dtype=None,
            model_size="",
        )
        
        self._tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Use a dummy tokenizer for compatibility

        import httpcore
        # Needed to fix some googletrans bug
        # https://stackoverflow.com/questions/72796594/attributeerror-module-httpcore-has-no-attribute-synchttptransport#comment136664963_77334618
        setattr(httpcore, 'SyncHTTPTransport', 'AsyncHTTPProxy')  
        from googletrans import Translator
        self.translator = Translator()

    def greedy_until(
        self,
        requests: list[GreedyUntilRequest],
        override_bs: Optional[int] = None,
    ) -> list[GenerativeResponse]:
        """
        Generates responses using a greedy decoding strategy until certain ending conditions are met.

        Args:
            requests (list[Request]): list of requests containing the context and ending conditions.
            disable_tqdm (bool, optional): Whether to disable the progress bar. Defaults to False.
            override_bs (int, optional): Override the batch size for generation. Defaults to None.

        Returns:
            list[GenerativeResponse]: list of generated responses.
        """
        for request in requests:
            request.tokenized_context = self.tok_encode(request.context)

        dataset = GenerativeTaskDataset(requests=requests, num_dataset_splits=self.DATASET_SPLITS)
        results = []

        for _ in tqdm(
            dataset.splits_start_end_iterator(),
            total=dataset.num_dataset_splits,
            desc="Splits",
            position=0,
            disable=False,  # self.disable_tqdm,
        ):
            for r in tqdm(dataset, desc="Batch", position=1, disable=False):
                context = r.context.replace("French phrase: ", "")
                # TODO: Get src and dest from request
                translation = self.translator.translate(context, src='fr', dest='de')


                result = translation.text
                cur_response = GenerativeResponse(
                    result=result,
                    logits=None, 
                    generated_tokens=[],
                    input_tokens=[],
                )
                results.append(cur_response)


        return dataset.get_original_order(results)

    @Property
    def tokenizer(self):
        return self._tokenizer

    def tok_encode(self, text: str):
        return self.tokenizer.encode(text)

    @Property
    def add_special_tokens(self) -> bool:
        return False

    @Property
    def max_length(self) -> int:
        """Return the maximum sequence length of the model."""
        return 4096

    def loglikelihood(
        self, requests: list[LoglikelihoodRequest], override_bs: Optional[int] = None
    ) -> list[LoglikelihoodResponse]:
        """Tokenize the context and continuation and compute the log likelihood of those
        tokenized sequences.
        """
        raise NotImplementedError

    def loglikelihood_rolling(
        self, requests: list[LoglikelihoodRollingRequest], override_bs: Optional[int] = None
    ) -> list[LoglikelihoodResponse]:
        """This function is used to compute the log likelihood of the context for perplexity metrics."""
        raise NotImplementedError

    def loglikelihood_single_token(
        self, requests: list[LoglikelihoodSingleTokenRequest], override_bs: Optional[int] = None
    ) -> list[LoglikelihoodSingleTokenResponse]:
        """Tokenize the context and continuation and compute the log likelihood of those
        tokenized sequences.
        """
        raise NotImplementedError

```
* lift protobuf restriction

* fix typos

* remove incorrect comment

* initial attempt to fix the from_model() method

* ensure input tensors are moved to proper device

* remove conditional, since device should never be None here

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
This PR aims to make iterating over splits a bit more intuitive, at least in my opinion. Open to feedback though! If the current behavior was intentional, feel free to close.

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* added option to bill to org in inference providers

* removed tokenizer logging
* Update README.md

* Update README.md
* Release: v0.9.0

* Bump dev version to 0.9.1.dev0
* fix inference endpoints bugs

* Release: v0.9.1
* fix tqdm logging

* suppress stderr draws

* add env var for custom tqdm behavior
* Added support for quanitzation in vllm backend

* Fixed style issues

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* Fix revision arg for vLLM tokenizer

* Add unit test

* Update test

* Move test repo
* add smolm generative tasks

* add jeopardy

* pretty 🥰

* consistent stop sequences

* add versions  + change names

---------

Co-authored-by: Hynek Kydlicek <kydlicek.hynek@huggingface.co>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* adding relevant languages + sorting everyone
* added chrf++
* updated language list to rm redundancies, renamed task to follow usual pattern
* fix adapter plus manage languages simply - in the future we might want to have a custom enum with one key several values
```
uv run lighteval accelerate "model_name=HuggingFaceTB/SmolVLM-Instruct" "lighteval|mmmu_pro|0|0" --use-chat-template --vision-model
```

---------

Co-authored-by: qubvel <qubvel@gmail.com>
* fixing task metric type

* fix style
* Add MCQ support to Yourbench evaluation

---------

Co-authored-by: Hynek Kydlíček <kydlicek.hynek@gmail.com>
NathanHB and others added 4 commits May 21, 2025 10:13
## Pull Request Overview

This PR introduces support for passing custom keyword arguments when loading pretrained transformer models, enabling more flexible configuration of model loading. It also replaces the fixed "generation_size" parameter with a more general "model_loading_kwargs" field.
- Removed the fixed generation_size parameter.
- Added a new model_loading_kwargs field to the configuration.
- Updated the auto model creation to copy the provided kwargs.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* suggestion from copilot

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Pull Request Overview

This PR adds support for using a custom template to determine where evaluation results are saved. The changes include adding a new parameter "results_path_template" across multiple main modules and updating the EvaluationTracker to honor this template for saving results; the associated tests and documentation have been updated accordingly.
- Added a new test for the custom results template.
- Extended CLI options in several main modules to accept "results_path_template".
- Updated EvaluationTracker logic and documentation to reflect the new functionality.


---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ljvmiranda921 ljvmiranda921 merged commit 5791574 into filbench:main May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.