GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents (ACL2025)

This repository is the official implementation of Benchmarking Domain-Oriented Guideline Following for LLM Agents. Our Paper.

⚙️ Requirements

To install requirements:

pip install -r requirements.txt

📝 Data

All data of GuideBench is placed in ./data . Different category tasks of the dataset is organized in the format of a list respectively, where each element of the list is an instance of the dataset. The format of an instance is as follows.

Instruction (string): the overarching task objective.
Guidelines (string): a set of domain-specific rules that inform the task structure.
- type (string): the type of the guideline rule
- rule_id (string): the index of the guideline rule
- rule_text (string): the actual text of the guideline rule
Context (string): a relevant text passage
Groundtruth (string): the human-checked ground truth, which are designed to verify the correctness of the answers generated by LLMs.
- ReferenceAnswer (integer or string): the 0/1 or numbers or A/B/C/D corresponding to QA tasks and multiple-choice tasks.
- ReferenceAnalysis (string): the reference analysis for why the reference answer is correct.

Here is an example of GuideBench.

{
    "Instruction": "优惠券数学计算题目",
    "Guidelines": [
      {
        "type": "折扣优惠券",
        "rule_id": "rule_2",
        "rule_text": "折扣优惠券：7 折优惠券，适用于除运动器材外的所有家居用品类商品，每个订单仅限使用一次，可与满减优惠券、固定金额优惠券叠加使用"
      },
      {
        "type": "固定金额优惠券",
        "rule_id": "rule_11",
        "rule_text": "固定金额优惠券：25 元优惠券，无消费门槛，可用于购买平台内除食品饮料类和美妆类外的所有商品，但每个订单仅限使用一次，可与折扣优惠券叠加使用，不可与满减优惠券叠加使用"
      },
      {
        "type": "满减优惠券",
        "rule_id": "rule_15",
        "rule_text": "满减优惠券：满 700 减 180，可与固定金额优惠券叠加使用，但不可与折扣优惠券叠加，每个订单仅限使用一次，适用于平台内除食品饮料类、电子产品类和运动器材类外的所有商品"
      },
      {
        "type": "组合优惠券使用限制",
        "rule_id": "rule_20",
        "rule_text": "同一订单中，若同时使用满减优惠券和固定金额优惠券，需先计算满减金额，再计算固定金额优惠券的抵扣金额"
      },
      {
        "type": "折扣优惠券",
        "rule_id": "rule_8",
        "rule_text": "折扣优惠券：9 折优惠券，适用于平台内除美妆类商品外的所有饰品类商品，每个订单仅限使用一次，不可与满减优惠券、固定金额优惠券叠加使用"
      },
      {
        "type": "满减优惠券",
        "rule_id": "rule_44",
        "rule_text": "满减优惠券：新增满 350 减 90，可与折扣优惠券叠加使用，不可与固定金额优惠券叠加，每个订单仅限使用一次，无适用时间限制，适用于平台内玩具类商品"
      }
    ],
    "Context": "在某电商平台购物，小张准备购买一些家居用品（非运动器材）、一批饰品（非美妆类）以及一些玩具。家居用品总价为 400 元，饰品总价为 150 元，玩具总价为 380 元。那么小张购买这些商品，最少需要支付多少钱？",
    "Groundtruth": {
      "ReferenceAnswer": "695",
      "ReferenceAnalysis": "首先计算家居用品使用7折优惠券后的价格为400×0.7=280元。饰品使用9折优惠券后的价格为150×0.9 =135元，此时总计280+135+380=795元。若不使用9折劵，玩具总价380元，可使用满350减90的满减优惠券，玩具实际需支付380-90=290元，此时商品总价为280+150+290=720元，但此时无法使用700满减券，再使用25元固定金额优惠券，最终需支付720 - 25 = 695元。"
    }
}

💡 Evaluation

Step1: Generating the responses

First, you need to deploy your target LLM and generate responses of it. The prompt template is in generate_resp.py , for both API-based models and open-sourced ones.

We suggest using greedy decoding to avoid the randomness of decoding.

Step2: Evaluation

Then, you can evaluate any desired model via evaluate.py to generate a text file that consists of the model’s accuracy. Just fill in with YOUR_ANSWER_PATH.

👏 Citation

@misc{diao2025guidebenchbenchmarkingdomainorientedguideline,
      title={GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents}, 
      author={Lingxiao Diao and Xinyue Xu and Wanxuan Sun and Cheng Yang and Zhuosheng Zhang},
      year={2025},
      eprint={2505.11368},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.11368}, 
}

Please kindly cite our paper if this paper and the codes are helpful.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
generate_resp.py		generate_resp.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents (ACL2025)

⚙️ Requirements

📝 Data

💡 Evaluation

Step1: Generating the responses

Step2: Evaluation

👏 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents (ACL2025)

⚙️ Requirements

📝 Data

💡 Evaluation

Step1: Generating the responses

Step2: Evaluation

👏 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages