RL training task: Teach LLMs to generate graphs matching precise network topology metrics (clustering coefficient, average path length, connectivity).
This task teaches the model to work with network science and graph theory by generating graphs that satisfy strict topological constraints. The model must:
- Choose appropriate graph generation algorithms (Erdős-Rényi, Barabási-Albert, Holme-Kim triadic closure, etc.)
- Understand how different parameters affect network properties
- Use iterative refinement techniques (edge rewiring, pruning) to hit precise targets
- Balance multiple competing objectives under resource constraints
Why this matters for ML engineers: Graph generation with constraints is fundamental in many domains:
- Synthetic network datasets for GNN training
- Network topology optimization for distributed systems
- Social network simulation and analysis
- Biological network modeling (protein interactions, neural connectivity)
Given a target profile for three key graph metrics:
- C (Clustering Coefficient): Measures local triangle density, typically 0.08-0.12
- L (Average Path Length): Average shortest path in the giant connected component, typically 4.2-6.0
- CC (Connected Components): Must be exactly 1 (fully connected graph)
The model must generate a graph of 400 nodes that achieves a score ≥ 80-85% (randomly chosen per run).
Score = 100 × average(closeness(C), closeness(L), closeness(CC))
where closeness(metric) = 1 - |actual - target| / tolerance
Hard constraint: If ANY metric achieves < 30% closeness, the total score is zero. This prevents the model from ignoring difficult metrics.
- Triadic closure models (Holme-Kim): Best for achieving realistic C values through triangle formation
- Barabási-Albert: Power-law degree distribution, but lower clustering
- Erdős-Rényi: Uniform random, poor for this task (C too low)
- Configuration model: Custom degree sequences
- m parameter: Higher m → denser graph → shorter L, higher C
- Triadic probability p: Higher p → more triangles → higher C
- Trade-offs: C and L are often anti-correlated (densification reduces distances)
- Standard rewiring: Preserves degree distribution, adjusts C/L
- Triangle-boosting rewiring: Smart swaps that increase clustering
- Edge pruning: Carefully remove edges to increase L while maintaining connectivity
- 120 rewire operations max
- 60 metric computation calls max
- 25 agent steps max
- Penalty for excessive metric calls (>25)
The task includes a "patience" mechanism: if the model shows no improvement for 6 consecutive checks after 10 attempts, it should abandon the current approach and try a different strategy.
Linux/macOS:
curl -LsSf https://astral.sh/uv/install.sh | shWindows (PowerShell):
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/ElkinStas/graph-generation-rl-task
cd graph-generation-rl-taskexport ANTHROPIC_API_KEY='your_api_key_here'uv run main.pyuv will automatically:
- Create a virtual environment
- Install dependencies (
anthropic>=0.67.0,networkx>=3.0) - Execute the test suite
The script runs 10 independent trials by default. Each trial:
- Randomizes target metrics (C, L) and pass threshold
- Gives the model a unique seed for reproducibility
- Runs the agent loop with tool access
- Grades the final submission
- Logs detailed execution to
logs/
To change number of runs, edit the last line in main.py:
asyncio.run(main(concurrent=False, num_runs=20)) # Run 20 trialsExecution modes:
concurrent=False: Sequential (easier to debug, recommended)concurrent=True: Parallel execution (faster)
✅ Correct answers are accepted if:
- Score ≥ randomly chosen threshold (80-85%)
- Graph is connected (CC = 1)
- All metrics within tolerance bands
❌ Common failure modes:
- Using Erdős-Rényi (C too low, always fails)
- Generating disconnected graphs (CC > 1)
- Getting stuck in local optima (not exploring enough)
- Excessive metric calls (penalty applied after 25 calls)
- Early termination due to patience threshold
Expected pass rate: 10-40% (per XOR requirements)
This is challenging because:
- Tight constraints: Each metric must be ≥30% close to target
- Multi-objective optimization: C and L trade off (harder to satisfy both)
- No direct analytical solution: Model must learn heuristics through trial-and-error
- Resource constraints: Limited rewire/metric budgets force efficiency
- Randomized targets: Each run has different C/L goals (generalization required)
Why models fail:
- 40%: Wrong graph generator choice (e.g., ER instead of triadic)
- 30%: Poor parameter selection (m/p values)
- 20%: Insufficient refinement iterations
- 10%: Budget exhaustion or early stopping
Why models succeed:
- Use triadic closure model (Holme-Kim) with appropriate m/p
- Compute metrics immediately after generation
- Apply targeted refinement (rewire_triangles for C, prune_edges for L)
- Submit as soon as threshold reached (avoid over-optimization)
The model has access to 6 tools:
| Tool | Purpose | Budget |
|---|---|---|
generate_graph(kind, params, seed) |
Create initial graph | Unlimited |
rewire(k) |
Random edge swaps (1-5) | 120 total |
rewire_triangles(k) |
Clustering-aware swaps (1-5) | 120 total |
prune_edges(k) |
Remove edges safely (1-3) | 120 total |
compute_metrics() |
Get C, L, CC + score | 60 calls |
submit_graph() |
Final submission | Once |
Key design decisions:
- Shared rewire budget: All rewire operations (rewire + rewire_triangles + prune_edges) share the 120-call budget
- Auto-submission: When score ≥ threshold, graph is auto-submitted (no need to call submit explicitly)
- Deterministic connectivity: When CC=1 is required, rewire operations that break connectivity are automatically rejected
- Efficient L computation: Uses 22 sampled BFS sources (exact APSP would be O(n³) on 400 nodes)
Every run generates:
- Detailed log:
logs/run_<id>_<timestamp>.log- step-by-step execution - Conversation history:
logs/run_<id>_<timestamp>.conversation.json- full transcript (on failure) - Summary files:
logs/summary_<timestamp>.jsonl- Machine-readable resultslogs/summary_<timestamp>.csv- Excel-friendly format
Seed logging: Each run logs its seed prominently for exact reproducibility.
============================================================
Running 10 test iterations sequentially...
Target ranges: C=(0.08, 0.12), L=(4.2, 6.0)
Pass threshold range: (80, 85)
============================================================
Run 1: SUCCESS - score=82.3 (threshold=81.2), best=82.3
Metrics: C=0.0987, L=5.124, CC=1, GCC_size=400
Duration: 45.2s
Log: logs/run_1_20250103_143022.log
Run 2: FAILURE - score=68.4 < 83.7, best=68.4
Metrics: C=0.0654, L=5.892, CC=1, GCC_size=400
C<30% (C=0.0654, target≈0.095±0.035, closeness=26.3%)
Duration: 38.1s
Log: logs/run_2_20250103_143101.log
[... 8 more runs ...]
============================================================
TEST RESULTS
============================================================
Passed: 3/10
Failed: 7/10
Pass Rate: 30.0%
Avg Score: 64.2
Score Range: [0.0, 86.1]
Pass rate in target range (10-40%)
- Lines of code: ~850 (main task logic)
- Dependencies: 2 (anthropic, networkx)
- Avg inference cost: ~$0.50-1.00 per 10 runs (with Claude Sonnet 4)
- Avg runtime: 40-60 seconds per run (sequential)