lisadunlap · nazcol · Oct 10, 2024 · Oct 10, 2024 · Feb 2, 2025 · Feb 2, 2025
diff --git a/.gitignore b/.gitignore
@@ -45,6 +45,8 @@ MANIFEST
 cache/
 serve/global_vars.py
 
+vibecheck_results/
+
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
@@ -168,4 +170,10 @@ wandb/
 */test.zip
 
 # config runs
-/pipeline_results
+/pipeline_results
+
+venv/
+__pycache__/
+*.pyc
+uploads/*
+!uploads/.gitkeep
diff --git a/README.md b/README.md
@@ -1,21 +1,21 @@
 # Give your generative models a ✨vibe check✨
 
 
-### [VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models](https://arxiv.org/abs/2410.12851)
-The title sucks I know, but the paper's alright.
+### [VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models](https://bench-mark.org)
 
 <p align="center">
   <img src="method_vibecheck.png" width="800">
 </p>
 
 
-**This is a simplified and more user-friendly version of the VibeCheck paper.** Original code is in `_deprecated` and should run, it's just very messy. Still working on adding all the functionality of the orignal code but the core functionality is here and the visualizations are much better. Namely we moved to using [LOTUS](https://lotus-ai.readthedocs.io/en/latest/), a pandas wrapper to easily run LLM/embedding calls on your data. It reduced my many thousand lines of code to like 2 files. I'm telling you it's the bees knees.
+**This is a simplified and more user-friendly version of the VibeCheck paper.** Original code is in `paper_code` and should run, it's just very messy. 
+<!-- Still working on adding all the functionality of the orignal code but the core functionality is here and the visualizations are much better. Namely we moved to using [LOTUS](https://lotus-ai.readthedocs.io/en/latest/), a pandas wrapper to easily run LLM/embedding calls on your data. It reduced my many thousand lines of code to like 2 files. I'm telling you it's the bees knees. -->
 
 ## Data
 
 * [Link to chatbot arena data](https://huggingface.co/datasets/lmarena-ai/Llama-3-70b-battles)
 * [Human VS GPT (HC3)](https://huggingface.co/datasets/Hello-SimpleAI/HC3)
-* [HELM Predictions](https://crfm.stanford.edu/helm/classic/latest/) (fair warning, this is a real pain to download)
+* [HELM Predictions](https://crfm.stanford.edu/helm/classic/latest/)
 
 ## Quickstart
 
@@ -35,23 +35,21 @@ pip install -r requirements.txt
 
 3. Set env variables for your LLM API keys (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY, etc)
 
-To run local models, you can use the [LiteLLM library](https://docs.litellm.ai/docs/) with notes on how to set up with LOTUS [here](https://lotus-ai.readthedocs.io/en/latest/llm.html)
+<!-- To run local models, you can use the [LiteLLM library](https://docs.litellm.ai/docs/) with notes on how to set up with LOTUS [here](https://lotus-ai.readthedocs.io/en/latest/llm.html) -->
 
 4. Example run
 ```
-python main.py --data_path data/friendly_and_cold_sample.csv --models friendly cold --num_final_vibes 3
+python main.py data_path=data/friendly_and_cold_sample.csv models=[friendly,cold] num_final_vibes=3
 ```
 This runs a toy example on LLM outputs, one model is prompted to be friendly, the other cold and factual. I randomly assigned preference so friendly results are favored 80% of the time
 
-**Note:** We use a slightly different definition of vibe than the paper (e.g. "friendly tone" instead of "Tone: High: friendly Low: cold"). I think this definition is more intuitive, but if you want to use the paper definition, you can run `main_old.py` with the same arguments.
-
-*Gradio Visualization:* Add the `--gradio` flag to see a gradio visualization of the data. This is useful for debugging the ranker outputs by looking at the pairwise comparisons.
+Alternatively, you can set a custom [config](configs/base.yaml) and run with `python main.py --config configs/my_config.yaml [any other args you want to override]`
 
 ## Data Structure
 
 All data needs to contain the columns "question", model_name_1, model_name_2, and optionally "preference". If the preference column is not provided, run `generate_preference_labels.py` to compute the preference via LLM as a judge.
 
-Say your two models are gpt-4o and gemini-1.5-flash. Your CSV should have the columns "question", "gpt-4o", "gemini-1.5-flash" and in your command, set your data path and set `--models gpt-4o gemini-1.5-flash`. Sometime soon I will add an option to only optimize for model matching if you only care to find differentiating qualities, so get excited for that. 
+Say your two models are gpt-4o and gemini-1.5-flash. Your CSV should have the columns "question", "gpt-4o", "gemini-1.5-flash" and in your command, set your data path and set `models=['gpt-4o', 'gemini-1.5-flash']`. If you only care to find differentiating qualities, you can set `filter.min_pref_score_diff=0`.
 
 ## 🎯 Citation
 
@@ -66,11 +64,3 @@ If you use this repo in your research, please cite it as follows and ideally use
 }
 ```
 
-## TODO
-
-- [ ] Add a way to only optimize for model matching if you only care to find differentiating qualities
-- [ ] Add different vibe selection methods (LARS, filter by coeffs, etc)
-- [ ] Add multi-model comparison
-- [ ] Add absolute ranking option
-- [ ] Add causal inference stuff to try to match model vibes and check preference
-- [ ] Multimodal support
diff --git a/_depreciated/configs/human_vs_gpt.yaml b/_depreciated/configs/human_vs_gpt.yaml
diff --git a/components/__init__.py b/components/__init__.py
@@ -0,0 +1 @@
+# Empty file to make components a Python package
diff --git a/components/prompts/preference_judge.py b/components/prompts/preference_judge.py
@@ -0,0 +1,10 @@
+preference_judge_prompt = """You are an impartial judge and evaluate the quality of the responses provided by two AI assistants (A and B) to the user question displayed below. You should choose the assistant that you think is better. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Only output tie if the two responses are almost exactly the same.
+
+Here is the prompt and the outputs of A and B respectively:
+
+{judge_input}
+
+Please respond with the model which contains a higher quality response. Based on your analysis, please explain your reasoning before assigning a score. Use the following format for your response:
+Analysis: {{reasoning}}
+Model: {{A, B, tie}}
+"""
diff --git a/components/prompts/proposer_prompts.py b/components/prompts/proposer_prompts.py
@@ -0,0 +1,80 @@
+proposer_freeform = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. Write down as many differences as you can find between the two outputs. Please format your differences as a list of properties that appear more in one output than the other.
+
+Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1?
+
+{combined_responses}
+
+The format should be a list of properties that appear more in one output than the other in the format of a short description of the property. 
+
+Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+
+Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.
+
+Respond with a list of properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
+
+proposer_freeform_iteration = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some differences between the two outputs, but there are many more differences to find. Write down as many differences as you can find between the two outputs which are not already in the list of differences. Please format your differences as a list of properties that appear more in one output than the other.
+
+Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1? Here are the differences I have already found and the questions and responses:
+
+{combined_responses}
+
+The format should be a list of properties that appear more in one output than the other in the format of a short description of the property.
+
+Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+
+Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.
+Respond with a list of new properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
+
+proposer_onesided = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. Write down as many properties as you can find that are present in Model 1 but not in Model 2. Please format your differences as a list of properties that appear more in one output than the other.
+
+{combined_responses}
+
+The format should be a list of properties that appear more in the output of Model 1 than the output of Model 2 in the format of a short description of the property. Respond with a list of properties, each on a new line.
+
+Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+
+Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2, only the properties that are present more in Model 1 than Model 2.
+If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
+
+proposer_onesided_iteration = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some properties, but there are many more properties to find. Write down as many properties as you can find that are present in Model 1 but not in Model 2.
+
+Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. What properties are seen in the responses from Model 1 that are not seen in the responses from Model 2? What properties are seen in the responses from Model 2 that are not seen in the responses from Model 1? Here are the differences I have already found and the questions and responses:
+
+{combined_responses}
+
+The format should be a list of properties that appear more in the output of Model 1 than the output of Model 2 in the format of a short description of the property. Respond with a list of properties, each on a new line.
+
+Note that this example is not at all exhaustive, but rather just an example of the format. Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+
+Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2, only the properties that are present more in Model 1 than Model 2. If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
+
+proposer_freeform_axis = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions. Write down as many differences as you can find between the two outputs. Please format your differences as a list of axes of variation and differences between the two outputs. Try to give axes which represent a property that a human could easily interpret and they could categorize a pair of text outputs as higher or lower on that specific axis. 
+
+Here are the questions and responses:
+{combined_responses}
+
+The format should be a list of axes in the format of {{axis}}: High: {{high description}} Low: {{low description}} for each axis, with each axis on a new line separated by *. Do NOT include any other text in your response.
+
+Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
+
+proposer_freeform_iteration_axis = """You are a machine learning researcher trying to figure out the major differences between the behaviors of two llms by finding differences in their responses to the same set of questions and seeing if these differences correspond with user preferences. I have already found some differences between the two outputs, but there are many more differences to find. Write down as many differences as you can find between the two outputs which are not already in the list of differences. Please format your differences as a list of properties that appear more in one output than the other.
+
+Below are multiple sets of questions and responses, separated by dashed lines. For each set, analyze the differences between Model 1 and Model 2. Please format your differences as a list of axes of variation and differences between the two outputs. Try to give axes which represent a property that a human could easily interpret and they could categorize a pair of text outputs as higher or lower on that specific axis. 
+
+Here are the differences I have already found and the questions and responses:
+
+{combined_responses}
+
+The format should be a list of axes in the format of {{axis}}: High: {{high description}} Low: {{low description}} for each axis, with each axis on a new line separated by *. Do NOT include any other text in your response.
+
+Consider differences on many different axes such as tone, language, structure, content, safety, and any other axis that you can think of. If the questions have a specific property or cover a specific topic (e.g. coding, creative writing, math, etc.), also consider differences which are relevant to that property or topic.
+
+Remember that these differences should be human interpretable and that the differences should be concise, substantive and objective. Write down as many properties as you can find which are not already represented in the list of differences. Do not explain which model has which property, simply describe the property. Your response should not include any mention of Model 1 or Model 2.
+Respond with a list of new properties, each on a new line separated by *. Do NOT include any other text in your response. If there are no substantive differences between the outputs, please respond with only "No differences found."
+"""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# Empty file to make components a Python package