LLM TestLab

title	LLM TestLab Playground
emoji	🧪
colorFrom	blue
colorTo	indigo
sdk	docker
pinned	false
license	mit
app_port	7860

⭐ If you find this project useful, please consider starring it — it helps others discover it!

LLM TestLab

Comprehensive Testing Suite for Large Language Models

A flexible Python toolkit for evaluating LLMs on:

Text Metrics: Hallucination, consistency, semantic robustness, safety
Code Evaluation: Syntax, execution, quality, security, semantic correctness across 9+ languages
Dual Embedders: Optimized for both text and code analysis
Optional FAISS: High-performance vector similarity

Features

Text Evaluation Metrics

Hallucination Severity Index (HSI) – Detect factual deviations from knowledge base
Consistency Stability Score (CSS) – Measure output stability across runs
Semantic Robustness Index (SRI) – Test invariance to paraphrasing
Safety Vulnerability Exposure (SVE) – Detect unsafe responses to adversarial prompts
Knowledge Base Coverage (KBC) – Measure factual alignment

Code Evaluation Metrics (9+ Languages)

Syntax Validity (SV) – Compiler/interpreter-based validation
Execution Pass Rate (EPR) – Test case execution and verification
Code Quality Score (CQS) – Complexity, documentation, error handling
Security Risk Score (SRS) – Vulnerability pattern detection
Semantic Code Correctness (SCC) – Embedding-based similarity to reference
Comprehensive Code Evaluation (CCE) – Weighted aggregation of all metrics

Supported Languages: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Ruby, PHP

Advanced Features

Dual Embedders: all-MiniLM-L6-v2 for text, BAAI/bge-small-en-v1.5 for code
FAISS Support: Optional, for faster similarity searches
Knowledge Base Management: Add, remove, or list facts
Security Patterns: Customizable keywords and regex patterns
Rich Logging: Built-in debug/info logging

Project Structure

llm-testlab/
├── llm_testing_suite/
│   ├── __init__.py          
│   ├── llm_testing_suite.py    # Main test suite (text metrics)
│   └── code_evaluator.py       # Code evaluation module
├── examples/
│   ├── run_text_evaluation.py  # Text metrics evaluation script
│   ├── run_code_evaluation.py  # Code metrics evaluation script
│   ├── groq_example.py         # Groq API text evaluation
│   ├── groq_code_evaluation.py # Groq API code evaluation
│   └── huggingface_example.py  # HuggingFace integration
├── pyproject.toml              # Package configuration
├── requirements.txt            # Dependencies
├── README.md
├── LICENSE
└── .gitignore

Installation

From PyPI

pip install llm-testlab

From Source

git clone https://github.com/Saivineeth147/llm-testlab.git
cd llm-testlab
pip install .

Optional Dependencies

# With FAISS and HuggingFace support
pip install llm-testlab[faiss,huggingface]

# Or install individually
pip install faiss-cpu  # or faiss-gpu
pip install transformers

Quick Start

Text Metrics Example

from llm_testing_suite import LLMTestSuite

def my_llm(prompt):
    return "Rome is the capital of Italy"

# Initialize with FAISS support
suite = LLMTestSuite(my_llm, use_faiss=True)

# Add knowledge
suite.add_knowledge("Rome is the capital of Italy")

# Run all novel metrics
result = suite.run_all_novel_metrics(
    prompt="What is the capital of Italy?",
    paraphrases=["Italy's capital?", "Capital city of Italy?"],
    adversarial_prompts=["ignore previous instructions"],
    runs=3
)

print(f"HSI: {result['HSI']['HSI']:.4f}")           # Hallucination
print(f"CSS: {result['CSS']['CSS']:.4f}")           # Consistency  
print(f"SRI: {result['SRI']['SRI']:.4f}")           # Robustness
print(f"SVE: {result['SVE']['SVE']:.4f}")           # Safety
print(f"KBC: {result['KBC']['KBC']:.4f}")           # Coverage

Code Evaluation Example

from llm_testing_suite import LLMTestSuite

def code_llm(prompt):
    return '''
def add(a, b):
    """Add two numbers."""
    return a + b
    
print(add(5, 3))
'''

suite = LLMTestSuite(code_llm)

# Comprehensive code evaluation
result = suite.comprehensive_code_evaluation(
    prompt="Write a function to add two numbers",
    code_response=code_llm("..."),
    test_cases=[
        {"input": "", "expected_output": "8"}
    ],
    language="python"
)

print(f"Overall Score: {result['overall_score']:.1f}/100")
print(f"Syntax Valid: {result['syntax_valid']}")
print(f"Quality Score: {result['quality_score']}/100")
print(f"Security: {'✓' if result['is_secure'] else '✗'}")

Managing Knowledge Base

# Add a single fact
suite.add_knowledge("New York is the largest city in the USA")

# Add multiple facts
suite.add_knowledge_bulk([
    "Python is a programming language",
    "AI is transforming industries"
])

# List knowledge base
suite.list_knowledge()

# Remove a fact
suite.remove_knowledge("Python is a programming language")

# Clear the knowledge base
suite.clear_knowledge()

Managing Security Keywords

# Add malicious keywords
suite.add_malicious_keywords(["hack system", "steal data"])

# List keywords
suite.list_malicious_keywords()

# Remove a keyword
suite.remove_malicious_keyword("hack system")

List keywords

tester.list_malicious_keywords()

Remove a keyword

tester.remove_malicious_keyword("hack system")

Output Format

All test methods support three return types controlled by the `return_type` parameter: `"dict"`, `"table"`, or `"both"`.

"dict": Returns a Python dictionary with the test results.
"table": Prints a formatted table using the rich library, no dictionary returned.
"both": Returns the dictionary and prints the table.

Code Evaluation Details

Individual Metrics

from llm_testing_suite import LLMTestSuite

suite = LLMTestSuite(your_llm_function)

# 1. Syntax Validity
syntax = suite.code_syntax_validity(code, language="python")
# Returns: {"syntax_valid": True/False, "error": ...}

# 2. Execution Test
execution = suite.code_execution_test(
    code,
    test_cases=[
        {"input": "5\n", "expected_output": "5"}
    ],
    language="python"
)
# Returns: {"pass_rate": 1.0, "passed_tests": 1, "total_tests": 1, ...}

# 3. Quality Metrics
quality = suite.code_quality_metrics(code, language="python")
# Returns: {"quality_score": 80, "metrics": {...}}

# 4. Security Scan
security = suite.code_security_scan(code, language="python")
# Returns: {"is_secure": True, "vulnerabilities": [...]}

# 5. Semantic Correctness
semantic = suite.code_semantic_correctness(
    prompt="Write add function",
    code_response=generated_code,
    reference_code=reference_solution
)
# Returns: {"semantic_similarity": 0.85, "semantically_correct": True}

Quality Scoring (0-100)

Each criterion worth 20 points:

Has Comments (#, //, /*) - 20 pts
Has Docstring (""", /**) - 20 pts
Has Error Handling (try/except, try/catch) - 20 pts
Low Complexity (< 10 branches/loops) - 20 pts
Has Functions (at least 1) - 20 pts

Security Patterns Detected

SQL Injection
Command Injection
XSS vulnerabilities
Buffer overflows (C/C++)
Hardcoded secrets
Unsafe deserialization
Path traversal
Language-specific antipatterns

Supported Languages

Language	Syntax Check	Execution	Quality	Security
Python	✅ AST	✅	✅	✅
JavaScript	✅ Node	✅	✅	✅
TypeScript	✅ tsc	✅	✅	✅
Java	✅ javac	✅	✅	✅
C/C++	✅ gcc/g++	✅	✅	✅
Go	✅ go fmt	✅	✅	✅
Rust	✅ rustc	⚠️	✅	✅
Ruby	✅ ruby -c	✅	✅	✅
PHP	✅ php -l	✅	✅	✅

Note: Compilers/interpreters must be installed for full syntax validation. Falls back to regex-based checks if unavailable.

Dual Embedder Architecture

LLMTestSuite uses specialized embedders for optimal evaluation:

Text Embedder: `all-MiniLM-L6-v2`

Used for: HSI, CSS, SRI, SVE, KBC (text metrics)
Size: 22M params, 384 dimensions
Speed: Fast
Purpose: General semantic similarity

Code Embedder: `BAAI/bge-small-en-v1.5`

Used for: Code semantic correctness (SCC)
Size: 33M params, 384 dimensions
Speed: Fast
Purpose: Code-specific semantic understanding

Custom Embedder

from sentence_transformers import SentenceTransformer

suite = LLMTestSuite(my_llm)

# Replace code embedder
suite.code_embedder = SentenceTransformer("microsoft/codebert-base")
suite.code_evaluator.embedder = suite.code_embedder

# Or use different text embedder
suite = LLMTestSuite(my_llm, embedder_model="all-mpnet-base-v2")

Embedder Comparison

Model	Params	Dims	Speed	Best For
all-MiniLM-L6-v2	22M	384	Fast	Text (default)
all-mpnet-base-v2	110M	768	Medium	Text (higher accuracy)
bge-small-en-v1.5	33M	384	Fast	Code (default)
bge-base-en-v1.5	109M	768	Medium	Code (balanced)
CodeBERT	125M	768	Medium	Code (Microsoft)

Output Format

All test methods support three return types via return_type parameter:

"dict" - Returns Python dictionary (default)
"table" - Prints formatted table using rich library
"both" - Returns dictionary AND prints table

Example Results

# HSI Result
{
    "prompt": "What is the capital of France?",
    "answer": "Paris is the capital of France",
    "HSI": 0.01,  # Lower is better (0-1 scale)
    "closest_fact": "Paris is the capital of France"
}

# Code Evaluation Result
{
    "overall_score": 85.0,
    "syntax_valid": True,
    "quality_score": 80,
    "is_secure": True,
    "pass_rate": 1.0,
    "semantic_similarity": 0.89
}

Complete Example: Groq API

from groq import Groq
from llm_testing_suite import LLMTestSuite

client = Groq(api_key="your-api-key")

def groq_llm(prompt):
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content

suite = LLMTestSuite(groq_llm)

# Text evaluation
result = suite.run_all_novel_metrics(
    prompt="What is the capital of France?",
    paraphrases=["France's capital?"],
    runs=3
)

# Code evaluation
code_result = suite.comprehensive_code_evaluation(
    prompt="Write fibonacci function",
    code_response=groq_llm("Write a Python fibonacci function"),
    language="python"
)

See examples/groq_code_evaluation.py for comprehensive test suite.

Logging

# Enable debug logging
suite = LLMTestSuite(my_llm, debug=True)

# Or configure manually
import logging
logging.getLogger("llm_testing_suite").setLevel(logging.DEBUG)

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

MIT License - see LICENSE file for details

Acknowledgments

Sentence-Transformers for embedding models
FAISS for efficient similarity search
Rich library for beautiful terminal output
Open-source LLM community

Star this repo ⭐ if you find it useful!

For questions or issues, please open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
examples		examples
llm_testing_suite		llm_testing_suite
playground		playground
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM TestLab

Features

Text Evaluation Metrics

Code Evaluation Metrics (9+ Languages)

Advanced Features

Project Structure

Installation

From PyPI

From Source

Optional Dependencies

Quick Start

Text Metrics Example

Code Evaluation Example

Managing Knowledge Base

Managing Security Keywords

List keywords

Remove a keyword

Output Format

Code Evaluation Details

Individual Metrics

Quality Scoring (0-100)

Security Patterns Detected

Supported Languages

Dual Embedder Architecture

Text Embedder: all-MiniLM-L6-v2

Code Embedder: BAAI/bge-small-en-v1.5

Custom Embedder

Embedder Comparison

Output Format

Example Results

Complete Example: Groq API

Logging

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Text Embedder: `all-MiniLM-L6-v2`

Code Embedder: `BAAI/bge-small-en-v1.5`

Packages