Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1e1f8c5
factory reset repo to make it usable
lisadunlap Oct 10, 2024
feda1e6
added main fig and env
lisadunlap Oct 10, 2024
e33781e
refactor pt something
lisadunlap Feb 2, 2025
0e5b27b
change paper code file
lisadunlap Feb 2, 2025
9b44bc0
confirmed preset vibes works
lisadunlap Feb 2, 2025
ab12246
got random position switching working
lisadunlap Feb 4, 2025
c6ef369
create proposer class
lisadunlap Feb 4, 2025
fa07183
renaming
lisadunlap Feb 4, 2025
6065340
added ranker class skeleton
lisadunlap Feb 4, 2025
9cd218f
moved ranker to class, set shuffle to false, prototype emb model
lisadunlap Feb 6, 2025
3f75f1e
merge conflicts
lisadunlap Feb 6, 2025
d2303d4
added batch to vibe ranking
lisadunlap Feb 6, 2025
7acf0a2
removed unused code
lisadunlap Feb 8, 2025
0831d42
removed redundant configs
lisadunlap Feb 10, 2025
611427c
added some paper code configs
lisadunlap Feb 10, 2025
07c708b
Merge pull request #15 from lisadunlap/single_model
lisadunlap Feb 10, 2025
a12419e
removed lotus from proposer, cleaned up code
lisadunlap Feb 11, 2025
74ec819
updated readme
lisadunlap Feb 11, 2025
26d1ffb
Merge pull request #16 from lisadunlap/single_model
lisadunlap Feb 11, 2025
1c0ec89
before clean up
lisadunlap Feb 16, 2025
4833916
movin more functions 'round
lisadunlap Feb 16, 2025
f54c5a1
cleaned up ranker
lisadunlap Feb 16, 2025
186ca34
Merge pull request #17 from lisadunlap/adding_fun_things
lisadunlap Feb 18, 2025
6b5b885
the app is slightly better
lisadunlap Mar 14, 2025
baeffb6
app kinda works i think
lisadunlap Mar 14, 2025
d09050f
added results loading
lisadunlap Mar 14, 2025
e5d8160
y
lisadunlap Mar 17, 2025
92f8266
okay i thiiiink the gradio vis is working well
lisadunlap Mar 18, 2025
f115610
updated utils llm
lisadunlap Mar 22, 2025
2281506
llama script with wandb logs
nazcol Mar 30, 2025
b9e10ab
cleaning up
lisadunlap Apr 3, 2025
13373f9
configs/helm/math_cot.yaml
lisadunlap Apr 3, 2025
78a5fc3
adding in more configs
lisadunlap Apr 3, 2025
6692363
updated readme
lisadunlap Apr 3, 2025
44a66fb
fixing configs
lisadunlap Apr 7, 2025
7f226ce
uncommented embedding classifier
lisadunlap Apr 7, 2025
15faeb7
Merge pull request #18 from lisadunlap/adding_fun_things
lisadunlap Apr 7, 2025
41b7c07
updated requirements
lisadunlap Apr 7, 2025
f4610e2
removed bad files
lisadunlap Apr 7, 2025
1e9e250
changed link to website
lisadunlap Apr 7, 2025
48ff662
Update README.md
lisadunlap Jul 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ MANIFEST
cache/
serve/global_vars.py

vibecheck_results/

# Installer logs
pip-log.txt
pip-delete-this-directory.txt
Expand Down Expand Up @@ -168,4 +170,10 @@ wandb/
*/test.zip

# config runs
/pipeline_results
/pipeline_results

venv/
__pycache__/
*.pyc
uploads/*
!uploads/.gitkeep
26 changes: 8 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Give your generative models a ✨vibe check✨


### [VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models](https://arxiv.org/abs/2410.12851)
The title sucks I know, but the paper's alright.
### [VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models](https://bench-mark.org)

<p align="center">
<img src="method_vibecheck.png" width="800">
</p>


**This is a simplified and more user-friendly version of the VibeCheck paper.** Original code is in `_deprecated` and should run, it's just very messy. Still working on adding all the functionality of the orignal code but the core functionality is here and the visualizations are much better. Namely we moved to using [LOTUS](https://lotus-ai.readthedocs.io/en/latest/), a pandas wrapper to easily run LLM/embedding calls on your data. It reduced my many thousand lines of code to like 2 files. I'm telling you it's the bees knees.
**This is a simplified and more user-friendly version of the VibeCheck paper.** Original code is in `paper_code` and should run, it's just very messy.
<!-- Still working on adding all the functionality of the orignal code but the core functionality is here and the visualizations are much better. Namely we moved to using [LOTUS](https://lotus-ai.readthedocs.io/en/latest/), a pandas wrapper to easily run LLM/embedding calls on your data. It reduced my many thousand lines of code to like 2 files. I'm telling you it's the bees knees. -->

## Data

* [Link to chatbot arena data](https://huggingface.co/datasets/lmarena-ai/Llama-3-70b-battles)
* [Human VS GPT (HC3)](https://huggingface.co/datasets/Hello-SimpleAI/HC3)
* [HELM Predictions](https://crfm.stanford.edu/helm/classic/latest/) (fair warning, this is a real pain to download)
* [HELM Predictions](https://crfm.stanford.edu/helm/classic/latest/)

## Quickstart

Expand All @@ -35,23 +35,21 @@ pip install -r requirements.txt

3. Set env variables for your LLM API keys (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY, etc)

To run local models, you can use the [LiteLLM library](https://docs.litellm.ai/docs/) with notes on how to set up with LOTUS [here](https://lotus-ai.readthedocs.io/en/latest/llm.html)
<!-- To run local models, you can use the [LiteLLM library](https://docs.litellm.ai/docs/) with notes on how to set up with LOTUS [here](https://lotus-ai.readthedocs.io/en/latest/llm.html) -->

4. Example run
```
python main.py --data_path data/friendly_and_cold_sample.csv --models friendly cold --num_final_vibes 3
python main.py data_path=data/friendly_and_cold_sample.csv models=[friendly,cold] num_final_vibes=3
```
This runs a toy example on LLM outputs, one model is prompted to be friendly, the other cold and factual. I randomly assigned preference so friendly results are favored 80% of the time

**Note:** We use a slightly different definition of vibe than the paper (e.g. "friendly tone" instead of "Tone: High: friendly Low: cold"). I think this definition is more intuitive, but if you want to use the paper definition, you can run `main_old.py` with the same arguments.

*Gradio Visualization:* Add the `--gradio` flag to see a gradio visualization of the data. This is useful for debugging the ranker outputs by looking at the pairwise comparisons.
Alternatively, you can set a custom [config](configs/base.yaml) and run with `python main.py --config configs/my_config.yaml [any other args you want to override]`

## Data Structure

All data needs to contain the columns "question", model_name_1, model_name_2, and optionally "preference". If the preference column is not provided, run `generate_preference_labels.py` to compute the preference via LLM as a judge.

Say your two models are gpt-4o and gemini-1.5-flash. Your CSV should have the columns "question", "gpt-4o", "gemini-1.5-flash" and in your command, set your data path and set `--models gpt-4o gemini-1.5-flash`. Sometime soon I will add an option to only optimize for model matching if you only care to find differentiating qualities, so get excited for that.
Say your two models are gpt-4o and gemini-1.5-flash. Your CSV should have the columns "question", "gpt-4o", "gemini-1.5-flash" and in your command, set your data path and set `models=['gpt-4o', 'gemini-1.5-flash']`. If you only care to find differentiating qualities, you can set `filter.min_pref_score_diff=0`.

## 🎯 Citation

Expand All @@ -66,11 +64,3 @@ If you use this repo in your research, please cite it as follows and ideally use
}
```

## TODO

- [ ] Add a way to only optimize for model matching if you only care to find differentiating qualities
- [ ] Add different vibe selection methods (LARS, filter by coeffs, etc)
- [ ] Add multi-model comparison
- [ ] Add absolute ranking option
- [ ] Add causal inference stuff to try to match model vibes and check preference
- [ ] Multimodal support
17 changes: 0 additions & 17 deletions _depreciated/configs/human_vs_gpt.yaml

This file was deleted.

1 change: 1 addition & 0 deletions components/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Empty file to make components a Python package
10 changes: 10 additions & 0 deletions components/prompts/preference_judge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
preference_judge_prompt = """You are an impartial judge and evaluate the quality of the responses provided by two AI assistants (A and B) to the user question displayed below. You should choose the assistant that you think is better. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Only output tie if the two responses are almost exactly the same.

Here is the prompt and the outputs of A and B respectively:

{judge_input}

Please respond with the model which contains a higher quality response. Based on your analysis, please explain your reasoning before assigning a score. Use the following format for your response:
Analysis: {{reasoning}}
Model: {{A, B, tie}}
"""
80 changes: 80 additions & 0 deletions components/prompts/proposer_prompts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
proposer_freeform = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. Write down as many differences as you can find between the two outputs. Please format your differences as a list of properties that appear more in one output than the other.

Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1?

{combined_responses}

The format should be a list of properties that appear more in one output than the other in the format of a short description of the property.

Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.

Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.

Respond with a list of properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
"""

proposer_freeform_iteration = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some differences between the two outputs, but there are many more differences to find. Write down as many differences as you can find between the two outputs which are not already in the list of differences. Please format your differences as a list of properties that appear more in one output than the other.

Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1? Here are the differences I have already found and the questions and responses:

{combined_responses}

The format should be a list of properties that appear more in one output than the other in the format of a short description of the property.

Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.

Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.
Respond with a list of new properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
"""

proposer_onesided = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. Write down as many properties as you can find that are present in Model 1 but not in Model 2. Please format your differences as a list of properties that appear more in one output than the other.

{combined_responses}

The format should be a list of properties that appear more in the output of Model 1 than the output of Model 2 in the format of a short description of the property. Respond with a list of properties, each on a new line.

Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.

Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2, only the properties that are present more in Model 1 than Model 2.
If there are no substantive differences between the outputs, please respond with only "No differences found."
"""

proposer_onesided_iteration = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some properties, but there are many more properties to find. Write down as many properties as you can find that are present in Model 1 but not in Model 2.

Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1? Here are the differences I have already found and the questions and responses:

{combined_responses}

The format should be a list of properties that appear more in the output of Model 1 than the output of Model 2 in the format of a short description of the property. Respond with a list of properties, each on a new line.

Note that this example is not at all exhaustive, but rather just an example of the format. Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.

Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2, only the properties that are present more in Model 1 than Model 2. If there are no substantive differences between the outputs, please respond with only "No differences found."
"""

proposer_freeform_axis = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions. Write down as many differences as you can find between the two outputs. Please format your differences as a list of axes of variation and differences between the two outputs. Try to give axes which represent a property that a human could easily interpret and they could categorize a pair of text outputs as higher or lower on that specific axis.

Here are the questions and responses:
{combined_responses}

The format should be a list of axes in the format of {{axis}}: High: {{high description}} Low: {{low description}} for each axis, with each axis on a new line separated by *. Do NOT include any other text in your response.

Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
If there are no substantive differences between the outputs, please respond with only "No differences found."
"""

proposer_freeform_iteration_axis = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some differences between the two outputs, but there are many more differences to find. Write down as many differences as you can find between the two outputs which are not already in the list of differences. Please format your differences as a list of properties that appear more in one output than the other.

Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. Please format your differences as a list of axes of variation and differences between the two outputs. Try to give axes which represent a property that a human could easily interpret and they could categorize a pair of text outputs as higher or lower on that specific axis.

Here are the differences I have already found and the questions and responses:

{combined_responses}

The format should be a list of axes in the format of {{axis}}: High: {{high description}} Low: {{low description}} for each axis, with each axis on a new line separated by *. Do NOT include any other text in your response.

Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.

Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.
Respond with a list of new properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
"""
Loading