AI agent teams race to approximate a mystery function, each using a different embedding model to search 1,468 ArXiv papers for numerical methods techniques.
Built with Mastra, the Vercel AI SDK, and Next.js.
Three teams of AI agents compete to approximate an unknown function f(x) on [-5, 5]. Each team gets 64 sample points and must build the best approximation — scored on accuracy x speed. The teams search the same corpus of 1,468 numerical methods papers, but each uses a different embedding model for retrieval:
| Team | Embedding Model | Max Iterations | Agents |
|---|---|---|---|
| ZeroEntropy | zembed-1 (2560-dim) | 64 | 3 |
| OpenAI | text-embedding-3-small (1536-dim) | 64 | 3 |
| Cohere | embed-english-v3.0 (1024-dim) | 64 | 3 |
Same corpus, same LLM, same iteration budget, same scoring — the only variable is which embedding model powers the search.
The dashboard shows the race live with score charts and function approximation plots updating in real time.
f(x) = sin(x^2 * 2 + 3x) * exp(-0.08x^2) // chirp — frequency sweeps with x
+ 2/(1 + 100*(x-1.7)^2) // sharp spike — Runge-like pole
+ 0.4*|sin(3x)| // periodic kinks — non-differentiable
+ 1.5*exp(-8*(x+3)^2)*cos(25x) // localized burst — high-freq near x=-3
+ 0.6*tanh(15*(x-3.5)) // steep step — sigmoid transition
This function is designed to be hard: it combines features that break naive polynomial interpolation. Agents need to discover techniques like Chebyshev node placement, barycentric interpolation, and piecewise/rational methods from the retrieved papers.
digits = -log10(mean_absolute_error) // averaged over 10,001 test points, capped [0, 15]
speed = ops_per_second / baseline // relative to naive 64-point Lagrange interpolation
score = digits * sqrt(speed)
download_arxiv.py— Fetches metadata for 1,468 papers on numerical methods from the ArXiv APIdownload_full_papers.py— Downloads full LaTeX source for each paper (1,345 succeeded)src/scripts/build-chunks.ts— Chunks papers at newline boundaries into ~1500-char segments (69,602 chunks)src/scripts/pre-embed.ts/embed_ze_modal.py— Embeds all chunks with all 3 providers, saves to binary files on disk
- Server loads 69,602 pre-computed embeddings per provider into memory (~1.4 GB total)
- User clicks "Start Race"
- For each of 3 providers, 3 Mastra agents launch in parallel (9 agents total)
- Each agent loop:
generate()-> LLM calls tools (maxSteps=5) ->searchPapers(embed query, cosine similarity, return top-5 chunks) ->evalCode(run candidate function, measure accuracy + speed) -> repeat - Agents within a team share a notebook: best code so far, findings from searches, scores
- Dashboard polls
/api/statusevery 1s and/api/plotevery 3s - Score chart shows best-so-far (solid lines) and raw iteration scores (dotted lines) per team
- Function plots show ground truth vs. each team's best approximation
auto-optimize/
├── corpus/
│ ├── arxiv_papers.jsonl # 1,468 paper metadata (title, abstract, authors)
│ ├── papers/ # 1,345 full LaTeX source files
│ ├── chunks_medium.json # 69,602 text chunks
│ └── embeddings/
│ ├── zeroentropy_embeddings.bin
│ ├── openai_embeddings.bin
│ └── cohere_embeddings.bin
│
├── src/
│ ├── app/ # Next.js app
│ │ ├── page.tsx # Server wrapper
│ │ ├── dashboard.tsx # Main dashboard UI (client component)
│ │ ├── layout.tsx # HTML shell, fonts, theme
│ │ └── api/
│ │ ├── race/route.ts # POST = start race, DELETE = reset
│ │ ├── status/route.ts # GET = poll race state
│ │ ├── eval/route.ts # POST = evaluate code standalone
│ │ └── plot/route.ts # GET = function plot data
│ │
│ ├── lib/ # Core logic
│ │ ├── eval.ts # Evaluation harness — mystery function, scoring
│ │ ├── corpus.ts # Loads chunks from chunks_medium.json
│ │ ├── vectorstore.ts # Embedding + cosine similarity search
│ │ ├── race-state.ts # In-memory race state management
│ │ └── race-runner.ts # Orchestrator: 3 teams x 3 agents
│ │
│ ├── mastra/ # Mastra agent framework
│ │ ├── index.ts # Mastra instance with 3 agents
│ │ ├── agents/
│ │ │ └── optimizer.ts # Agent definition + system prompt
│ │ └── tools/
│ │ ├── search-papers.ts # Tool: semantic search over paper corpus
│ │ └── eval-code.ts # Tool: evaluate an approximation function
│ │
│ └── scripts/
│ ├── build-chunks.ts # Chunk papers into segments
│ ├── pre-embed.ts # Embed chunks with all 3 providers
│ └── test-eval.ts # Quick test of the evaluation harness
│
├── download_arxiv.py # Fetch paper metadata from ArXiv
├── download_full_papers.py # Download full LaTeX/PDF source
├── embed_ze_modal.py # ZeroEntropy embedding via Modal
├── server.ts # Standalone Express server (alternative to Next.js)
├── next.config.js
├── tsconfig.json
└── package.json
- Node.js 20+
- API keys in
.env:GOOGLE_GENERATIVE_AI_API_KEY— Gemini (used by all agents)ZEROENTROPY_API_KEYOPENAI_API_KEYCOHERE_API_KEY
# Download corpus data and pre-computed embeddings (~1.5 GB)
./download.sh
npm install
npm run build
NODE_OPTIONS="--max-old-space-size=8192" npx next start -p 3000 -H 0.0.0.0Then open http://localhost:3000 and click "Start Race".
If you need to re-embed (e.g., after changing chunks):
# ZeroEntropy (via Modal for parallelism)
python3 embed_ze_modal.py
# OpenAI + Cohere (via Vercel AI SDK)
npx tsx src/scripts/pre-embed.tsThe embedding scripts are idempotent — they resume from where they left off.
- Mastra — TypeScript AI agent framework (agent definitions, tool calling)
- Vercel AI SDK — Model routing (Gemini via
@ai-sdk/google), embedding provider pattern - Next.js 16 — Server + frontend (API routes + React dashboard)
- Gemini 3 Flash Preview — LLM for all agents (same model across all teams)
What LLM are the agents using? Gemini 3 Flash Preview, same for all teams. The LLM is not the variable — retrieval is.
Why 64 sample points? Enough for a good approximation with the right technique, but not enough to brute-force it. Agents need to discover optimal node placement and stable evaluation methods from the papers.
How is scoring done? score = digits * sqrt(speed), where digits = -log10(mean_absolute_error) capped at [0, 15], and speed = ops/sec relative to a naive 64-point Lagrange baseline. Both accuracy and efficiency matter, but accuracy dominates.
Apache-2.0