Skip to content

Dlxxx/GuideBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents (ACL2025)

This repository is the official implementation of Benchmarking Domain-Oriented Guideline Following for LLM Agents. Our Paper.

⚙️ Requirements

To install requirements:

pip install -r requirements.txt

📝 Data

All data of GuideBench is placed in ./data . Different category tasks of the dataset is organized in the format of a list respectively, where each element of the list is an instance of the dataset. The format of an instance is as follows.

  • Instruction (string): the overarching task objective.
  • Guidelines (string): a set of domain-specific rules that inform the task structure.
    • type (string): the type of the guideline rule
    • rule_id (string): the index of the guideline rule
    • rule_text (string): the actual text of the guideline rule
  • Context (string): a relevant text passage
  • Groundtruth (string): the human-checked ground truth, which are designed to verify the correctness of the answers generated by LLMs.
    • ReferenceAnswer (integer or string): the 0/1 or numbers or A/B/C/D corresponding to QA tasks and multiple-choice tasks.
    • ReferenceAnalysis (string): the reference analysis for why the reference answer is correct.

Here is an example of GuideBench.

{
    "Instruction": "优惠券数学计算题目",
    "Guidelines": [
      {
        "type": "折扣优惠券",
        "rule_id": "rule_2",
        "rule_text": "折扣优惠券:7 折优惠券,适用于除运动器材外的所有家居用品类商品,每个订单仅限使用一次,可与满减优惠券、固定金额优惠券叠加使用"
      },
      {
        "type": "固定金额优惠券",
        "rule_id": "rule_11",
        "rule_text": "固定金额优惠券:25 元优惠券,无消费门槛,可用于购买平台内除食品饮料类和美妆类外的所有商品,但每个订单仅限使用一次,可与折扣优惠券叠加使用,不可与满减优惠券叠加使用"
      },
      {
        "type": "满减优惠券",
        "rule_id": "rule_15",
        "rule_text": "满减优惠券:满 700 减 180,可与固定金额优惠券叠加使用,但不可与折扣优惠券叠加,每个订单仅限使用一次,适用于平台内除食品饮料类、电子产品类和运动器材类外的所有商品"
      },
      {
        "type": "组合优惠券使用限制",
        "rule_id": "rule_20",
        "rule_text": "同一订单中,若同时使用满减优惠券和固定金额优惠券,需先计算满减金额,再计算固定金额优惠券的抵扣金额"
      },
      {
        "type": "折扣优惠券",
        "rule_id": "rule_8",
        "rule_text": "折扣优惠券:9 折优惠券,适用于平台内除美妆类商品外的所有饰品类商品,每个订单仅限使用一次,不可与满减优惠券、固定金额优惠券叠加使用"
      },
      {
        "type": "满减优惠券",
        "rule_id": "rule_44",
        "rule_text": "满减优惠券:新增满 350 减 90,可与折扣优惠券叠加使用,不可与固定金额优惠券叠加,每个订单仅限使用一次,无适用时间限制,适用于平台内玩具类商品"
      }
    ],
    "Context": "在某电商平台购物,小张准备购买一些家居用品(非运动器材)、一批饰品(非美妆类)以及一些玩具。家居用品总价为 400 元,饰品总价为 150 元,玩具总价为 380 元。那么小张购买这些商品,最少需要支付多少钱?",
    "Groundtruth": {
      "ReferenceAnswer": "695",
      "ReferenceAnalysis": "首先计算家居用品使用7折优惠券后的价格为400×0.7=280元。饰品使用9折优惠券后的价格为150×0.9 =135元,此时总计280+135+380=795元。若不使用9折劵,玩具总价380元,可使用满350减90的满减优惠券,玩具实际需支付380-90=290元,此时商品总价为280+150+290=720元,但此时无法使用700满减券,再使用25元固定金额优惠券,最终需支付720 - 25 = 695元。"
    }
}

💡 Evaluation

Step1: Generating the responses

First, you need to deploy your target LLM and generate responses of it. The prompt template is in generate_resp.py , for both API-based models and open-sourced ones.

We suggest using greedy decoding to avoid the randomness of decoding.

Step2: Evaluation

Then, you can evaluate any desired model via evaluate.py to generate a text file that consists of the model’s accuracy. Just fill in with YOUR_ANSWER_PATH.

👏 Citation

@misc{diao2025guidebenchbenchmarkingdomainorientedguideline,
      title={GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents}, 
      author={Lingxiao Diao and Xinyue Xu and Wanxuan Sun and Cheng Yang and Zhuosheng Zhang},
      year={2025},
      eprint={2505.11368},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.11368}, 
}

Please kindly cite our paper if this paper and the codes are helpful.

About

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents (ACL2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages