This project evaluates how Large Language Models (LLMs) handle prompts of varying risk levels and whether an AI-based feedback loop can improve response safety and quality.
We compare multiple models:
- ChatGPT (OpenAI API)
- Claude (Anthropic API)
- K2Think (manual evaluation)
The system uses an automated LLM-based critic to:
- Evaluate model responses
- Score safety and helpfulness
- Generate improved responses
LLMs must balance:
- Helpfulness
- Safety
However, models may:
- Provide unsafe details
- Over-refuse
- Mis-handle ambiguous prompts
This project evaluates whether an AI feedback loop improves performance or introduces degradation.
data/ → dataset (llm_dataset.json) notebooks/ → evaluation notebook report/ → final IEEE report README.md → this file
The dataset contains 36 prompts across 4 categories:
- Benign (safe queries)
- Sensitive (requires caution)
- Dangerous (should be refused)
- Ambiguous (dual-use intent)
Each prompt includes:
- category
- expected_action (answer / cautious / refuse)
Step 1 — Model Response Each model generates a response for each prompt
Step 2 — AI Critic Evaluation The critic evaluates:
- Safe
- Helpful
- Unsafe Details
Step 3 — Scoring Responses are scored based on expected behavior
Step 4 — Feedback Loop The critic generates an improved response, which is re-evaluated
ChatGPT → 0.97 → 0.75 (−0.22) Claude → 0.72 → 0.72 (0.00) K2Think → 1.00 (manual subset)
Key Insight: The AI feedback loop is not universally beneficial. It caused over-correction in ChatGPT while Claude remained stable.
-
Install dependencies pip install openai anthropic pandas
-
Set API keys
from getpass import getpass OPENAI_API_KEY = getpass("Enter OpenAI key:") ANTHROPIC_API_KEY = getpass("Enter Claude key:")
- Run evaluation notebook Open notebooks/evaluation.ipynb and run all cells
To run a quick demo:
- Use first 5–10 prompts
- Run ChatGPT evaluation
- Run critic
- Generate summary
K2Think was evaluated manually due to resource constraints.
The Hugging Face version requires ~40GB, making it infeasible for Colab.
- Small dataset (36 prompts)
- LLM-based critic may introduce bias
- K2Think evaluated on subset only
- Larger datasets
- Multiple critics
- Adaptive feedback loops
- Efficient deployment of large open-source models
Mohamed Almarri