-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
Hi team,
Congratulations on the CILR acceptance — awesome work! I tried setting up the WebArena-Lite-v2 evaluation from this repo. After fixing a few small issues (e.g., parsing actions like answer(xxxx) and handling write after cleaning up), I ran the 7B checkpoint with a 15-step budget and got the following results:
| Domain | Pass@1 | Pass@4 | Count |
|---|---|---|---|
| Gitlab | 0.2833 | 0.4000 | 30 |
| Map | 0.1635 | 0.2308 | 26 |
| 0.2632 | 0.3684 | 19 | |
| Shopping | 0.2330 | 0.3182 | 44 |
| ShoppingAdmin | 0.3143 | 0.3714 | 35 |
| --- | --- | --- | --- |
| Average | 0.2532 | 0.3377 | 154 |
This is lower than the reported number (~37). Do you have any insight into what settings or evaluation details might explain the gap (e.g., step limit, prompts, environment versions, or action parsing)?
Thanks a lot!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels