A framework for experimenting with different strategies for selecting in-context learning exemplars (few-shot examples) for LLM-based agents.
┌─────────────────────────────────────────────────────────────────────────────┐
│ run_baselines.py │
│ (Main Entry Point) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ BaseAgent │◄─────────│ Buffer │◄─────────│ Exemplar │
│ (SLM) │ │ │ │ Selector │
└───────────────┘ └───────────────┘ └───────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Llama/Mistral│ │ 5 Strategies: │
│ via Bedrock │ │ none, random, │
└───────────────┘ │ semantic, │
│ │ recency, │
▼ │ success │
┌───────────────┐ └───────────────┘
│ Episode │
│ Output │
└───────────────┘
│
▼ (async)
┌───────────────┐
│ LLM Scorer │
│ (Claude) │
│ via Bedrock │
└───────────────┘
| Component | Model | Purpose |
|---|---|---|
| SLM Agent | Llama-3-8B, Mistral-7B | Fast plan generation |
| LLM Scorer | Claude 3 Sonnet/Haiku | Async episode evaluation |
| Strategy | Description |
|---|---|
none |
Zero-shot (no exemplars) |
random |
Random sample of k episodes |
semantic |
k most similar by instruction embedding |
recency |
k most recent episodes |
success |
k highest-reward episodes |
# Clone the repository
git clone <repo-url>
cd OptEx
# Install dependencies
pip install boto3 numpy sentence-transformers
# Configure AWS credentials for Bedrock
aws configurepython run_baselines.py --baseline random --episodes 50 --use-mockpython run_baselines.py \
--baseline semantic \
--episodes 100 \
--k 4 \
--slm-model llama-3-8b \
--scorer-model claude-3-sonnet \
--region us-west-2--baseline Exemplar selection strategy [none, random, semantic, recency, success]
--episodes Number of episodes to run (default: 100)
--k Number of exemplars in context (default: 4)
--seed Random seed (default: 0)
--save Output file path (default: runs/baseline_results.jsonl)
--slm-model SLM for plan generation [llama-3-8b, llama-3-70b, mistral-7b, mixtral-8x7b]
--scorer-model LLM for scoring [claude-3-haiku, claude-3-sonnet, claude-3-opus, claude-3.5-sonnet]
--region AWS region for Bedrock (default: us-west-2)
--use-mock Use mock LLM calls (for testing)
--score-batch-size Batch size for async scoring (default: 10)
--max-concurrency Max concurrent scoring requests (default: 5)
--embedder Sentence transformer model for semantic search
OptEx/
├── run_baselines.py # Main entry point
├── optex/
│ └── src/
│ ├── agent/ # Agent implementation
│ │ └── _base_agent.py
│ ├── buffer/ # Episode storage
│ │ └── _base_exemplar_buffer.py
│ ├── episode/ # Episode data structure
│ │ └── _base_episode.py
│ ├── exemplars/ # Selection strategies
│ │ ├── _base_exemplar.py
│ │ ├── _no_exemplars.py
│ │ ├── _random_exemplar.py
│ │ ├── _semantic_exemplar.py
│ │ ├── _recency_exemplar.py
│ │ └── _success_only_exemplar.py
│ ├── llm/ # Bedrock LLM integration
│ │ ├── _bedrock_client.py # Unified Bedrock client
│ │ ├── _slm_agent.py # SLM for plan generation
│ │ └── _llm_scorer.py # Async LLM scorer
│ └── scorer/ # Dataset difficulty estimation
│ └── generation/
│ └── difficulty_estimation.py
For each episode:
- Sample Instruction - Get a task from the benchmark
- Select Exemplars - Choose k relevant past episodes using the selected strategy
- Build Prompt - Construct prompt with exemplars + current instruction
- Generate Plan - SLM (Llama/Mistral) generates an action plan
- Execute Actions - Run the plan in the environment (mock or real)
- Store Episode - Add to buffer for future exemplar selection
Episodes are scored in batches using Claude via Bedrock:
- Evaluates task completion, plan quality, action relevance, efficiency
- Returns reward (0-1) with reasoning
- Non-blocking to maximize throughput
- FIFO queue with configurable max size (default: 2000)
- Stores instruction, plan, actions, reward, timestamp, embedding
- Enables experience replay for in-context learning
# optex/src/exemplars/_my_strategy.py
from optex.src.exemplars import BaseExemplar
class MyStrategy(BaseExemplar):
def select(self, buffer, instruction):
eps = buffer.all()
# Your selection logic here
return selected_episodes[:self.k]from optex.src.llm import BedrockClient, SLMAgent
client = BedrockClient(region_name="us-west-2")
agent = SLMAgent(
bedrock_client=client,
model_name="mistral-7b", # or any Bedrock model ID
temperature=0.7,
)from optex.src.llm import LLMScorer
scorer = LLMScorer(
model_name="claude-3-sonnet",
scoring_prompt="Your custom evaluation prompt...",
)See LICENSE file.