OptExAgent: Optimizing Exemplar Selection for Agents

A framework for experimenting with different strategies for selecting in-context learning exemplars (few-shot examples) for LLM-based agents.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           run_baselines.py                                  │
│                         (Main Entry Point)                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        ▼                           ▼                           ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   BaseAgent   │◄─────────│    Buffer     │◄─────────│   Exemplar    │
│   (SLM)       │          │               │          │   Selector    │
└───────────────┘          └───────────────┘          └───────────────┘
        │                                                     │
        ▼                                                     ▼
┌───────────────┐                                    ┌───────────────┐
│  Llama/Mistral│                                    │ 5 Strategies: │
│  via Bedrock  │                                    │ none, random, │
└───────────────┘                                    │ semantic,     │
        │                                            │ recency,      │
        ▼                                            │ success       │
┌───────────────┐                                    └───────────────┘
│   Episode     │
│   Output      │
└───────────────┘
        │
        ▼ (async)
┌───────────────┐
│  LLM Scorer   │
│  (Claude)     │
│  via Bedrock  │
└───────────────┘

Key Components

Models (AWS Bedrock)

Component	Model	Purpose
SLM Agent	Llama-3-8B, Mistral-7B	Fast plan generation
LLM Scorer	Claude 3 Sonnet/Haiku	Async episode evaluation

Exemplar Selection Strategies

Strategy	Description
`none`	Zero-shot (no exemplars)
`random`	Random sample of k episodes
`semantic`	k most similar by instruction embedding
`recency`	k most recent episodes
`success`	k highest-reward episodes

Installation

# Clone the repository
git clone <repo-url>
cd OptEx

# Install dependencies
pip install boto3 numpy sentence-transformers

# Configure AWS credentials for Bedrock
aws configure

Usage

Basic Run (with mock LLM)

python run_baselines.py --baseline random --episodes 50 --use-mock

Full Run with Bedrock

python run_baselines.py \
    --baseline semantic \
    --episodes 100 \
    --k 4 \
    --slm-model llama-3-8b \
    --scorer-model claude-3-sonnet \
    --region us-west-2

All Command-Line Options

--baseline       Exemplar selection strategy [none, random, semantic, recency, success]
--episodes       Number of episodes to run (default: 100)
--k              Number of exemplars in context (default: 4)
--seed           Random seed (default: 0)
--save           Output file path (default: runs/baseline_results.jsonl)

--slm-model      SLM for plan generation [llama-3-8b, llama-3-70b, mistral-7b, mixtral-8x7b]
--scorer-model   LLM for scoring [claude-3-haiku, claude-3-sonnet, claude-3-opus, claude-3.5-sonnet]
--region         AWS region for Bedrock (default: us-west-2)
--use-mock       Use mock LLM calls (for testing)

--score-batch-size   Batch size for async scoring (default: 10)
--max-concurrency    Max concurrent scoring requests (default: 5)
--embedder           Sentence transformer model for semantic search

Project Structure

OptEx/
├── run_baselines.py          # Main entry point
├── optex/
│   └── src/
│       ├── agent/            # Agent implementation
│       │   └── _base_agent.py
│       ├── buffer/           # Episode storage
│       │   └── _base_exemplar_buffer.py
│       ├── episode/          # Episode data structure
│       │   └── _base_episode.py
│       ├── exemplars/        # Selection strategies
│       │   ├── _base_exemplar.py
│       │   ├── _no_exemplars.py
│       │   ├── _random_exemplar.py
│       │   ├── _semantic_exemplar.py
│       │   ├── _recency_exemplar.py
│       │   └── _success_only_exemplar.py
│       ├── llm/              # Bedrock LLM integration
│       │   ├── _bedrock_client.py   # Unified Bedrock client
│       │   ├── _slm_agent.py        # SLM for plan generation
│       │   └── _llm_scorer.py       # Async LLM scorer
│       └── scorer/           # Dataset difficulty estimation
│           └── generation/
│               └── difficulty_estimation.py

How It Works

1. Episode Loop

For each episode:

Sample Instruction - Get a task from the benchmark
Select Exemplars - Choose k relevant past episodes using the selected strategy
Build Prompt - Construct prompt with exemplars + current instruction
Generate Plan - SLM (Llama/Mistral) generates an action plan
Execute Actions - Run the plan in the environment (mock or real)
Store Episode - Add to buffer for future exemplar selection

2. Async Scoring

Episodes are scored in batches using Claude via Bedrock:

Evaluates task completion, plan quality, action relevance, efficiency
Returns reward (0-1) with reasoning
Non-blocking to maximize throughput

3. Exemplar Buffer

FIFO queue with configurable max size (default: 2000)
Stores instruction, plan, actions, reward, timestamp, embedding
Enables experience replay for in-context learning

Extending

Add a New Exemplar Strategy

# optex/src/exemplars/_my_strategy.py
from optex.src.exemplars import BaseExemplar

class MyStrategy(BaseExemplar):
    def select(self, buffer, instruction):
        eps = buffer.all()
        # Your selection logic here
        return selected_episodes[:self.k]

Use a Different Model

from optex.src.llm import BedrockClient, SLMAgent

client = BedrockClient(region_name="us-west-2")
agent = SLMAgent(
    bedrock_client=client,
    model_name="mistral-7b",  # or any Bedrock model ID
    temperature=0.7,
)

Custom Scoring Prompt

from optex.src.llm import LLMScorer

scorer = LLMScorer(
    model_name="claude-3-sonnet",
    scoring_prompt="Your custom evaluation prompt...",
)

License

See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
optex		optex
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_all_baselines.sh		run_all_baselines.sh
run_baselines.py		run_baselines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OptExAgent: Optimizing Exemplar Selection for Agents

Architecture

Key Components

Models (AWS Bedrock)

Exemplar Selection Strategies

Installation

Usage

Basic Run (with mock LLM)

Full Run with Bedrock

All Command-Line Options

Project Structure

How It Works

1. Episode Loop

2. Async Scoring

3. Exemplar Buffer

Extending

Add a New Exemplar Strategy

Use a Different Model

Custom Scoring Prompt

License

About

Uh oh!

Releases

Packages

Languages

License

kevinscaria/OptEx

Folders and files

Latest commit

History

Repository files navigation

OptExAgent: Optimizing Exemplar Selection for Agents

Architecture

Key Components

Models (AWS Bedrock)

Exemplar Selection Strategies

Installation

Usage

Basic Run (with mock LLM)

Full Run with Bedrock

All Command-Line Options

Project Structure

How It Works

1. Episode Loop

2. Async Scoring

3. Exemplar Buffer

Extending

Add a New Exemplar Strategy

Use a Different Model

Custom Scoring Prompt

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages