AI Testbed

A flexible toolkit for testing and comparing responses from multiple AI models (OpenAI, Claude, Gemini, and Ollama).

Screenshot

GUI

(see: readme_gui.md)[https://github.com/jmoral4/AITestbed/blob/main/README_GUI.md]

CLI

Overview

AI Testbed provides two primary workflows:

Sequential Testing - Run a prompt through multiple AI models one after another with aitestbed.py
Concurrent Testing - Test multiple AI models in parallel using separate windows with concurrent_ai_test.py

This toolkit is designed for researchers, developers, and enthusiasts who want to:

Compare model outputs side-by-side
Test the same prompt across different models
Experiment with various model parameters
Benchmark performance and response quality

Setup Requirements

Dependencies

Install required packages:

pip install openai anthropic google-generativeai halo requests

API Keys

Create an apikeys.json file in the same directory as the scripts with the following structure:

{
  "openai": "sk-your-openai-api-key",
  "anthropic": "sk-ant-your-anthropic-api-key",
  "gemini": "your-gemini-api-key"
}

Notes:

For Ollama, no API key is required (it runs locally)
Place the apikeys.json file in the same directory as the scripts
API keys for services you don't plan to use can be omitted

Prompt File

Create a prompt.txt file with your question or instructions. For example:

Explain quantum computing in simple terms that a 10-year-old could understand.

This can be one line or hundreds of lines long. It all depends on the context length of the models you've using. I've successfully provided multiple files from a codebase or entire chapters of writings.

It's usually helpful to denote files in a larger query like so:

# Task: Refactor the webpage mypage.html below to introduce charts from chartjs. 
# Criteria 1: Only use ChartJS
# Criteria 2: Don't add new text content but feel free to re-arrange items.
# Criteria 3: Don't use the color BLUE, I hate it.

# mypage.html
<lots of content

# another_reference_page.html
<content>

# mycustom.js
<content>

You can also create a system_prompt.txt file for Ollama models.

Usage

Method 0: Debug usage (within an IDE like PyCharm/VSCode)

Sequentially

Open aitestbed.py , ensure you have a prompt in prompt.txt, ensure you have the keys within apikeys.json (example in examplesettings.json) for the models you want to run. Run it. Your query will output sequentially in your IDE's console/terminal output.

Concurrently

Open concurrent_ai_test.py, ensure you have a prompt in prompt.txt, ensure you have the keys within apikeys.json (example in examplesettings.json) for the models you want to run. Run it. Your query will output in multiple Windows Terminal windows.

Method 1: Sequential Testing with aitestbed.py

This mode runs the same prompt through multiple models sequentially in the same terminal window.

python aitestbed.py

This will:

Load prompt from prompt.txt (or prompt for input if not found)
Run the prompt through OpenAI, Claude, Gemini, and Ollama models
Display responses one after another

Method 2: Concurrent Testing with concurrent_ai_test.py

This mode launches separate windows/tabs for each model, allowing you to see responses develop in parallel.

python concurrent_ai.py --prompt-file "prompt.txt" --openai-model "gpt-4o" --claude-model "claude-3-7-sonnet-latest" --ollama-model "llama3.1" --gemini-model "gemini-2.0-flash"

Arguments:

--prompt-file: Path to the prompt file (required)
--system-prompt-file: Path to system prompt file for Ollama (optional)
--openai-model: OpenAI model to use (default: "o3-mini")
--claude-model: Claude model to use (default: "claude-3-7-sonnet-latest")
--ollama-model: Ollama model to use (default: "llama3.1")
--gemini-model: Gemini model to use (default: "gemini-2.0-flash")
--reasoning-effort: Reasoning effort for OpenAI (optional, "high", "medium", "low")

Using Individual AI Runners

You can also run queries against specific models using ai_runner.py:

python ai_runner.py --model openai --model-name "gpt-4o" --prompt-file "prompt.txt" --wait

Arguments:

--model: Which AI model to use (required, choices: openai, claude, ollama, gemini)
--prompt-file: Path to the prompt file (optional)
--system-prompt-file: Path to system prompt file for Ollama (optional)
--model-name: Specific model name to use (e.g., "gpt-4o", "llama3.1")
--wait: Wait for user input before closing (optional)
--reasoning-effort: Reasoning effort for OpenAI (optional, "high", "auto", "off")

Supported Models

Any model can be added with a simple json config entry in aitestbed.py like so:

    "gemini-2.0-flash": {
        "max_tokens": 8192,
    }

Models without configs fallback to 4096 context size but otherwise work fine. Note: This means models that are "undefined" will clip output to 4096.

Preconfigured Models

OpenAI

gpt-4o
o3-mini (supports reasoning_effort)
o1 (supports reasoning_effort)

Claude

claude-3-7-sonnet-latest (supports thinking mode)
claude-3-5-haiku

Gemini

gemini-2.5-pro-exp-03-25
gemini-2.0-pro
gemini-2.0-flash
gemini-2.0-flash-lite

Ollama (Local Models)

llama3.1
gemma3
(and any other models you have installed locally)

Certainly! Here's the section formatted in Markdown:

## Helper Script: Collecting Codebase Files for Prompt Context

To simplify the process of collecting C# source files from your codebase into a single prompt context, a helper script is included: `prep_context.py`.

### Purpose
This script helps you:
- Scan a directory (optionally recursive) for `.cs` files (or any files you define)
- Interactively select which files to include
- Output a single combined `prompt_context.txt` that can be used with AI models in your prompt.txt file.

### Usage

```bash
python prep_context.py ./myproject -r -o prompt_context.txt

Arguments:

directory (required): Path to the directory where .cs files are located.
-r, --recursive: Search subdirectories recursively.
-o, --output: Output filename (default: prompt_context.txt)

Features

Interactive selection of files (choose individual files or type all)
Adds clear file headers (based on filenames) to separate content blocks
Handles invalid input gracefully and ensures consistent formatting for AI consumption

Example Output Snippet

--------------
MyClass.cs
--------------
public class MyClass {
    public void DoSomething() {
        Console.WriteLine("Hello, world!");
    }
}

--------------
Program.cs
--------------
static void Main(string[] args) {
    var obj = new MyClass();
    obj.DoSomething();
}

This file can then be used as your prompt.txt or appended to a larger task instruction for more complex AI prompts.

Advanced Usage

Model Configuration

The system includes built-in configurations for various models (max tokens, reasoning support, etc.) in aitestbed.py. If you're using a model not in the configuration, it will use sensible defaults.

Colored Output

Responses from different models are color-coded for easy identification:

OpenAI: Magenta
Claude: Cyan
Gemini: Yellow
Ollama: Green

Windows Terminal Integration

On Windows, the concurrent tester will attempt to use Windows Terminal to launch separate tabs. If not available, it will fall back to using command prompt windows.

Troubleshooting

Common Issues

API Key Errors: Double-check your apikeys.json file is in the correct location and properly formatted.
Model Not Found: Ensure you're using the correct model identifiers for each service.
Ollama Connection Errors: Make sure Ollama is running locally (ollama serve). If you've used a different/custom url or port then update run_ollama_query() and pass a base_url
Parameter Errors: Some models don't support certain parameters (e.g., reasoning_effort). The tool will attempt to filter these out automatically.

Notes

Claude models with thinking enabled will show their reasoning process before providing the final answer
The system automatically handles appropriate token limits for different models
For best results with local models, ensure Ollama is running before starting tests (kick it if it's asleep via Ollama list on a console window)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_GUI.md		README_GUI.md
ai_runner.py		ai_runner.py
ai_testbed_gui.py		ai_testbed_gui.py
aitestbed.py		aitestbed.py
concurrent_ai.py		concurrent_ai.py
examplesettings.json		examplesettings.json
extract_responses.py		extract_responses.py
prep_context.py		prep_context.py
prompt.example.txt		prompt.example.txt
prompt.example2.txt		prompt.example2.txt
prompt.example3.txt		prompt.example3.txt
requirements.txt		requirements.txt
requirements_gui.txt		requirements_gui.txt

Folders and files

Latest commit

History

Repository files navigation

AI Testbed

Screenshot

GUI

CLI

Overview

Setup Requirements

Dependencies

API Keys

Prompt File

Usage

Method 0: Debug usage (within an IDE like PyCharm/VSCode)

Sequentially

Concurrently

Method 1: Sequential Testing with aitestbed.py

Method 2: Concurrent Testing with concurrent_ai_test.py

Using Individual AI Runners

Supported Models

Preconfigured Models

OpenAI

Claude

Gemini

Ollama (Local Models)

Features

Example Output Snippet

Advanced Usage

Model Configuration

Colored Output

Windows Terminal Integration

Troubleshooting

Common Issues

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages