Skip to content

Add OpenClaw + DeepSeek V4 Pro results (49 tasks, 0.918)#12

Open
konghuihua wants to merge 2 commits into
agentscope-ai:mainfrom
konghuihua:main
Open

Add OpenClaw + DeepSeek V4 Pro results (49 tasks, 0.918)#12
konghuihua wants to merge 2 commits into
agentscope-ai:mainfrom
konghuihua:main

Conversation

@konghuihua

Copy link
Copy Markdown

No description provided.

@konghuihua

Copy link
Copy Markdown
Author

Zhiyin OpenClaw + DeepSeek V4 Pro Submission

Overview

Slice Averages

Slice Score Tasks
Text 0.941 32
Multimodal 0.895 9
Code Generation 0.801 6
Document Extraction 1.000 2

Notes

  • Auto scores only from official graders (�uto_avg field from _results_v2.jsonl)
  • LLM judge dimension not reproduced (requires separate judge run)
  • ~100 tasks produced no output files or timed out, only scorable tasks included
  • Docker-based evaluation with zhiyin-pawbench:latest image

Submitted by

Zhiyin / CausalMind

@helloml0326

Copy link
Copy Markdown
Collaborator

Thanks for the submission. A couple of quick questions:

We already have official deepseek-v4-pro × openclaw results (150 tasks, overall 0.754). This PR adds 49 tasks at 0.918 — what does it add beyond the existing data?

Also, ~101 tasks had no output or timed out. What caused that, and do you plan to re-run the full 150?

For leaderboard inclusion we need the complete task set (missing tasks count as 0). A partial run isn't directly comparable to existing submissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants