This repository is the official implementation of Benchmarking Domain-Oriented Guideline Following for LLM Agents. Our Paper.
To install requirements:
pip install -r requirements.txt
All data of GuideBench is placed in ./data . Different category tasks of the dataset is organized in the format of a list respectively, where each element of the list is an instance of the dataset. The format of an instance is as follows.
Instruction(string): the overarching task objective.Guidelines(string): a set of domain-specific rules that inform the task structure.type(string): the type of the guideline rulerule_id(string): the index of the guideline rulerule_text(string): the actual text of the guideline rule
Context(string): a relevant text passageGroundtruth(string): the human-checked ground truth, which are designed to verify the correctness of the answers generated by LLMs.ReferenceAnswer(integer or string): the 0/1 or numbers or A/B/C/D corresponding to QA tasks and multiple-choice tasks.ReferenceAnalysis(string): the reference analysis for why the reference answer is correct.
Here is an example of GuideBench.
{
"Instruction": "优惠券数学计算题目",
"Guidelines": [
{
"type": "折扣优惠券",
"rule_id": "rule_2",
"rule_text": "折扣优惠券:7 折优惠券,适用于除运动器材外的所有家居用品类商品,每个订单仅限使用一次,可与满减优惠券、固定金额优惠券叠加使用"
},
{
"type": "固定金额优惠券",
"rule_id": "rule_11",
"rule_text": "固定金额优惠券:25 元优惠券,无消费门槛,可用于购买平台内除食品饮料类和美妆类外的所有商品,但每个订单仅限使用一次,可与折扣优惠券叠加使用,不可与满减优惠券叠加使用"
},
{
"type": "满减优惠券",
"rule_id": "rule_15",
"rule_text": "满减优惠券:满 700 减 180,可与固定金额优惠券叠加使用,但不可与折扣优惠券叠加,每个订单仅限使用一次,适用于平台内除食品饮料类、电子产品类和运动器材类外的所有商品"
},
{
"type": "组合优惠券使用限制",
"rule_id": "rule_20",
"rule_text": "同一订单中,若同时使用满减优惠券和固定金额优惠券,需先计算满减金额,再计算固定金额优惠券的抵扣金额"
},
{
"type": "折扣优惠券",
"rule_id": "rule_8",
"rule_text": "折扣优惠券:9 折优惠券,适用于平台内除美妆类商品外的所有饰品类商品,每个订单仅限使用一次,不可与满减优惠券、固定金额优惠券叠加使用"
},
{
"type": "满减优惠券",
"rule_id": "rule_44",
"rule_text": "满减优惠券:新增满 350 减 90,可与折扣优惠券叠加使用,不可与固定金额优惠券叠加,每个订单仅限使用一次,无适用时间限制,适用于平台内玩具类商品"
}
],
"Context": "在某电商平台购物,小张准备购买一些家居用品(非运动器材)、一批饰品(非美妆类)以及一些玩具。家居用品总价为 400 元,饰品总价为 150 元,玩具总价为 380 元。那么小张购买这些商品,最少需要支付多少钱?",
"Groundtruth": {
"ReferenceAnswer": "695",
"ReferenceAnalysis": "首先计算家居用品使用7折优惠券后的价格为400×0.7=280元。饰品使用9折优惠券后的价格为150×0.9 =135元,此时总计280+135+380=795元。若不使用9折劵,玩具总价380元,可使用满350减90的满减优惠券,玩具实际需支付380-90=290元,此时商品总价为280+150+290=720元,但此时无法使用700满减券,再使用25元固定金额优惠券,最终需支付720 - 25 = 695元。"
}
}First, you need to deploy your target LLM and generate responses of it. The prompt template is in generate_resp.py , for both API-based models and open-sourced ones.
We suggest using greedy decoding to avoid the randomness of decoding.
Then, you can evaluate any desired model via evaluate.py to generate a text file that consists of the model’s accuracy. Just fill in with YOUR_ANSWER_PATH.
@misc{diao2025guidebenchbenchmarkingdomainorientedguideline,
title={GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents},
author={Lingxiao Diao and Xinyue Xu and Wanxuan Sun and Cheng Yang and Zhuosheng Zhang},
year={2025},
eprint={2505.11368},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.11368},
}
Please kindly cite our paper if this paper and the codes are helpful.