diff --git a/experiments/kdd 2026/new_tests_results.md b/experiments/kdd 2026/new_tests_results.md new file mode 100644 index 0000000..da7a0d1 --- /dev/null +++ b/experiments/kdd 2026/new_tests_results.md @@ -0,0 +1,17 @@ +# Results: Newly Added Tests + +Assertion-weighted mean scores (0-100) on the **36 newly added tests** only: 17 Box, 5 Google Calendar, 7 Linear, and 6 Slack. All runs included API documentation. + +| Model | Weighted Mean +|---|---|---| +| openai/gpt-5 | 88.10 +| openai/gpt-5-mini | 87.61 +| deepseek/deepseek-v3.2 | 84.26 +| mistralai/devstral-2512 | 80.38 +| google/gemini-3.1-pro-preview | 77.69 +| moonshotai/kimi-k2-0905 | 74.06 +| qwen/qwen3-vl-235b-a22b-instruct | 74.06 +| google/gemini-3-flash-preview | 67.78 +| x-ai/grok-4.1-fast | 65.44 +| meta-llama/llama-4-scout | 28.63 +| openai/gpt-oss-120b | 27.90