From 4a7fc8a84a68aa417b961b46eb125c77d50d0083 Mon Sep 17 00:00:00 2001 From: Artem Zhuravel Date: Sat, 11 Apr 2026 17:13:04 +0530 Subject: [PATCH] Create new_tests_results.md --- experiments/kdd 2026/new_tests_results.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 experiments/kdd 2026/new_tests_results.md diff --git a/experiments/kdd 2026/new_tests_results.md b/experiments/kdd 2026/new_tests_results.md new file mode 100644 index 0000000..da7a0d1 --- /dev/null +++ b/experiments/kdd 2026/new_tests_results.md @@ -0,0 +1,17 @@ +# Results: Newly Added Tests + +Assertion-weighted mean scores (0-100) on the **36 newly added tests** only: 17 Box, 5 Google Calendar, 7 Linear, and 6 Slack. All runs included API documentation. + +| Model | Weighted Mean +|---|---|---| +| openai/gpt-5 | 88.10 +| openai/gpt-5-mini | 87.61 +| deepseek/deepseek-v3.2 | 84.26 +| mistralai/devstral-2512 | 80.38 +| google/gemini-3.1-pro-preview | 77.69 +| moonshotai/kimi-k2-0905 | 74.06 +| qwen/qwen3-vl-235b-a22b-instruct | 74.06 +| google/gemini-3-flash-preview | 67.78 +| x-ai/grok-4.1-fast | 65.44 +| meta-llama/llama-4-scout | 28.63 +| openai/gpt-oss-120b | 27.90