- 2024/09/26 WenMind Benchmark paper has been accepted by NeurIPS 2024.
WenMind is a comprehensive benchmark dedicated for evaluating Large Language Models (LLMs) in Chinese Classical Literature and Language Arts (CCLLA). WenMind covers the sub-domains of Ancient Prose, Ancient Poetry, and Ancient Literary Culture, comprising 4,875 question-answer pairs, spanning 42 fine-grained tasks (as shown in the figure 1), 3 question formats (Fill-in-the-Blank questions, Multiple-Choice questions and Question-and-Answer questions), and 2 evaluation scenarios (domain-oriented and capability-oriented).
Figure 1: Overview of WenMind Benchmark, which covers 3 sub-domains and 42 fine-gained tasks.
You can obtain the complete WenMind evaluation dataset from WenMind Benchmark folder on GitHub.
{
"id": 2464,
"domain": "ancient literary culture",
"capability": "knowledge",
"question_format": "QA",
"coarse_grained_task_zh": "成语",
"coarse_grained_task_en": "idiom",
"fine_grained_task_zh": "成语解释",
"fine_grained_task_en": "idiom explanation",
"question": "解释下面成语的意思:\n暮去朝来",
"answer": "黄昏过去,清晨又到来。形容时光流逝。"
}
The following is an explanation of the various fields in the data samples:
-
id: The unique identifier for the data sample, used to distinguish different samples. -
domain: The domain to which the data sample belongs, including ancient prose, ancient poetry and ancient literary culture. -
capability: The type of capability of the data sample, including knowledge, understanding and generation. -
question_format: The format of the question, indicating the type of question in the sample, including FB, MCQ and QA. -
coarse_grained_task_zh: The Chinese name of the coarse-grained task classification. Describes the coarse-grained task category of the sample, with a total of 26 categories. -
coarse_grained_task_en: The English name of the coarse-grained task classification. Corresponds tocoarse_grained_task_zh, describing the coarse-grained task category of the sample, with a total of 26 categories. -
fine_grained_task_zh: The Chinese name of the fine-grained task classification. Describes the fine-grained task category of the sample, with a total of 42 categories. -
fine_grained_task_en: The English name of the fine-grained task classification. Corresponds tofine_grained_task_zh, describing the fine-grained task category of the sample, with a total of 42 categories. -
question: The actual content of the question. The question to be answered in the sample. -
answer: The answer to the corresponding question. Provides a detailed response to the question.
- Task Description: Correct word order for inverted sentences.
- Capability: Understanding
- Scale: 18
- Task Description: Answer the omitted information in the elliptical sentence.
- Capability: Understanding
- Scale: 32
- Task Description: Identify the inversion type of inverted sentences.
- Capability: Understanding
- Scale: 7
- Task Description: Identify the sentence's syntactic type.
- Capability: Understanding
- Scale: 43
- Task Description: Translate classical Chinese into modern Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Translate modern Chinese into classical Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Extract named entities from Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Add punctuation to Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Select theme categories based on Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Explain the words and phrases in Classical Chinese sentences.
- Capability: Understanding
- Scale: 100
- Task Description: Read Classical Chinese texts and answer related questions.
- Capability: Understanding
- Scale: 100
- Task Description: Answer the usage of function words in classical Chinese sentences.
- Capability: Understanding
- Scale: 100
- Task Description: Identify whether a character is a homophone.
- Capability: Understanding
- Scale: 200
- Task Description: Distinguish between different meanings of the same character.
- Capability: Understanding
- Scale: 200
- Task Description: Writing in classical Chinese.
- Capability: Generation
- Scale: 100
- Task Description: Answer appreciation questions based on ancient poetry.
- Capability: Understanding
- Scale: 150
- Task Description: Conduct a free and detailed analysis of ancient poetry.
- Capability: Understanding
- Scale: 100
- Task Description: Compose a poem based on the theme.
- Capability: Generation
- Scale: 30
- Task Description: Compose a ci based on the theme.
- Capability: Generation
- Scale: 50
- Task Description: Compose a qu based on the theme.
- Capability: Generation
- Scale: 20
- Task Description: Answer the complete content of ancient poetry according to the title and author.
- Capability: Knowledge
- Scale: 200
- Task Description: Answer the title and author according to the content of ancient poetry.
- Capability: Knowledge
- Scale: 200
- Task Description: Write the next sentence according to the previous sentence in the ancient poem.
- Capability: Knowledge
- Scale: 100
- Task Description: Write the previous sentence according to the next sentence in the ancient poem.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide ancient poetry sentences that meet the requirements.
- Capability: Knowledge
- Scale: 30
- Task Description: Judge the genre of ancient poetry.
- Capability: Knowledge
- Scale: 120
- Task Description: Translate ancient poetry into modern Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Judge the sentiment contained in ancient poetry.
- Capability: Understanding
- Scale: 200
- Task Description: Translate ancient poetry into English.
- Capability: Understanding
- Scale: 50
- Task Description: Provide a detailed introduction of the poet.
- Capability: Knowledge
- Scale: 110
- Task Description: Provide the meanings of the imagery.
- Capability: Knowledge
- Scale: 185
- Task Description: Create the following couplet based on the previous one.
- Capability: Generation
- Scale: 100
- Task Description: Write a couplet based on the theme.
- Capability: Generation
- Scale: 100
- Task Description: Write HengPi based on the content of a couplet.
- Capability: Generation
- Scale: 100
- Task Description: Provide the synonym for the idiom.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide the source of the idiom.
- Capability: Knowledge
- Scale: 100
- Task Description: Extract idioms from ancient Chinese sentences and provide their meanings.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide the meaning of idioms.
- Capability: Knowledge
- Scale: 100
- Task Description: Guess the answer based on clues or clever hints.
- Capability: Knowledge
- Scale: 100
- Task Description: Complete the second half of the proverb based on the first half.
- Capability: Knowledge
- Scale: 100
- Task Description: Answer questions about ancient Chinese phonetics and rhymes.
- Capability: Knowledge
- Scale: 100
- Task Description: Answer questions about Sinology.
- Capability: Knowledge
- Scale: 130
The construction pipeline of WenMind includes data collection and data processing, as illustrated in Figure 2.
Figure 2: Construction pipeline of WenMind Benchmark.
Table 1 provides the statistics of the WenMind dataset.
Table 1: The statistics of the WenMind Benchmark. "Q" represents "Question" and "A" represents "Answer".
| Domain | Tasks | #Q | Max. #Q | Min. #Q | Avg. Q Tokens | Avg. A Tokens |
|---|---|---|---|---|---|---|
| Ancient Prose | 15 | 1,900 | 200 | 7 | 107.51 | 62.12 |
| Ancient Poetry | 16 | 1,845 | 200 | 20 | 73.42 | 94.93 |
| Ancient Literary Culture | 11 | 1,130 | 100 | 100 | 26.68 | 14.26 |
| Overall | 42 | 4,875 | 200 | 7 | 75.87 | 63.44 |
For open-source models, we perform inference locally, only requiring the model path and the output file path for the answers.
--model_path The path to the model, defaults to loading from huggingface
--output_path The file path for the model's answer output, defaults to {model_name}_result.json
e.g.
CUDA_VISIBLE_DEVICES=0,1 python Evaluation_Code/Inference/Test_Baichuan2-7B-Chat.py \
--model_path baichuan-inc/Baichuan2-7B-Chat \
--output_path Baichuan2-7B-Chat_result.json
For GPT-3.5 and GPT-4 models, provide two parameters: api_base and api_key.
For ERNIE-3.5 and ERNIE-4.0 models, provide two parameters: api_key and secret_key.
For Spark models, provide three parameters: api_key, secret_key, and appid.
Refer to the official documentation of each API model for details.
e.g.
python Test_ERNIE-3.5-8K-0329.py \
--API_KEY {api_key} \
--SECRET_KEY {secret_key} \
--output_path {output_path}
Step 1: Check whether the LLM response file is consistent with the format of the JSON/LLM_Response_Examples.json file.
Step 2: Open the Evaluation_Code/LLM_Scoring.py file, input the API_KEY and SECRET_KEY for the scoring model ERNIE-3.5, replace LLM_response_path with the storage path of the LLM response file, replace LLM_score_path with the path where the scoring results will be saved, and replace LLM_prompt_path with the storage path of JSON/Task_Score_Prompt.json.
Step 3: Run the following command to obtain the scoring results:
python Evaluation_Code/LLM_Scoring.py
Step 1: Check whether the scoring file is consistent with the format of the JSON/LLM_Score_Examples.json file.
Step 2: Open the Evaluation_Code/Calculate_Score.py file and replace LLM_score_path with the storage path of the scoring file.
Step 3: Run the following command to obtain the model's score:
python Evaluation_Code/Calculate_Score.py
Table 2: Results of all evaluated models on different domains and capabilities.
- SCUT-C2MChn
- WYWEB
- Daizhige
- ACLUE
- Websites-A Related to Ancient Poetry
- Websites-B Related to Ancient Poetry
- Sou Yun
- THU-FSPC
- Han Dian
The work is licensed under a MIT License.
The WenMind benchmark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Important Notice:
The original data of this dataset are collected from publicly accessible sources such as the Internet, and the copyright remains with the original content providers. The curated and annotated dataset reported in this case is intended for non-commercial use only and is currently licensed exclusively to universities and research institutions. If you wish to apply for access to this dataset, please complete the required application form in accordance with the instructions provided on the dataset website. The signature section of the application must be signed by a full-time staff member of a university or research institute. Where possible, please affix an official institutional seal (a seal from a secondary-level unit is acceptable) to facilitate the review and approval process.


