WenMind Benchmark

2024/09/26 WenMind Benchmark paper has been accepted by NeurIPS 2024.

WenMind is a comprehensive benchmark dedicated for evaluating Large Language Models (LLMs) in Chinese Classical Literature and Language Arts (CCLLA). WenMind covers the sub-domains of Ancient Prose, Ancient Poetry, and Ancient Literary Culture, comprising 4,875 question-answer pairs, spanning 42 fine-grained tasks (as shown in the figure 1), 3 question formats (Fill-in-the-Blank questions, Multiple-Choice questions and Question-and-Answer questions), and 2 evaluation scenarios (domain-oriented and capability-oriented).

Figure 1: Overview of WenMind Benchmark, which covers 3 sub-domains and 42 fine-gained tasks.

Download

You can obtain the complete WenMind evaluation dataset from WenMind Benchmark folder on GitHub.

Data Format

  {
    "id": 2464,
    "domain": "ancient literary culture",
    "capability": "knowledge",
    "question_format": "QA",
    "coarse_grained_task_zh": "成语",
    "coarse_grained_task_en": "idiom",
    "fine_grained_task_zh": "成语解释",
    "fine_grained_task_en": "idiom explanation",
    "question": "解释下面成语的意思：\n暮去朝来",
    "answer": "黄昏过去，清晨又到来。形容时光流逝。"
  }

The following is an explanation of the various fields in the data samples:

id: The unique identifier for the data sample, used to distinguish different samples.
domain: The domain to which the data sample belongs, including ancient prose, ancient poetry and ancient literary culture.
capability: The type of capability of the data sample, including knowledge, understanding and generation.
question_format: The format of the question, indicating the type of question in the sample, including FB, MCQ and QA.
coarse_grained_task_zh: The Chinese name of the coarse-grained task classification. Describes the coarse-grained task category of the sample, with a total of 26 categories.
coarse_grained_task_en: The English name of the coarse-grained task classification. Corresponds to coarse_grained_task_zh, describing the coarse-grained task category of the sample, with a total of 26 categories.
fine_grained_task_zh: The Chinese name of the fine-grained task classification. Describes the fine-grained task category of the sample, with a total of 42 categories.
fine_grained_task_en: The English name of the fine-grained task classification. Corresponds to fine_grained_task_zh, describing the fine-grained task category of the sample, with a total of 42 categories.
question: The actual content of the question. The question to be answered in the sample.
answer: The answer to the corresponding question. Provides a detailed response to the question.

Task List

T1-1: Inverted Sentence Structure (倒装句语序)

Task Description: Correct word order for inverted sentences.
Capability: Understanding
Scale: 18

T1-2: Elliptical Sentence (省略句)

Task Description: Answer the omitted information in the elliptical sentence.
Capability: Understanding
Scale: 32

T1-3: Inverted Sentence Types (倒装句类型)

Task Description: Identify the inversion type of inverted sentences.
Capability: Understanding
Scale: 7

T1-4: Sentence Structure Identification (判断句式)

Task Description: Identify the sentence's syntactic type.
Capability: Understanding
Scale: 43

T2: Classical Chinese to Modern Chinese (文白翻译)

Task Description: Translate classical Chinese into modern Chinese.
Capability: Understanding
Scale: 200

T3: Modern Chinese to Classical Chinese (白文翻译)

Task Description: Translate modern Chinese into classical Chinese.
Capability: Understanding
Scale: 200

T4: Named Entity Recognition (命名实体识别)

Task Description: Extract named entities from Classical Chinese sentences.
Capability: Understanding
Scale: 200

T5: Punctuation (句读)

Task Description: Add punctuation to Classical Chinese sentences.
Capability: Understanding
Scale: 200

T6: Topic Classification (主题分类)

Task Description: Select theme categories based on Classical Chinese sentences.
Capability: Understanding
Scale: 200

T7: Word Explanation (字词解释)

Task Description: Explain the words and phrases in Classical Chinese sentences.
Capability: Understanding
Scale: 100

T8: Reading Comprehension (阅读理解)

Task Description: Read Classical Chinese texts and answer related questions.
Capability: Understanding
Scale: 100

T9: Function Words (虚词)

Task Description: Answer the usage of function words in classical Chinese sentences.
Capability: Understanding
Scale: 100

T10: Homophones (通假字)

Task Description: Identify whether a character is a homophone.
Capability: Understanding
Scale: 200

T11: Polysemy (单字多义)

Task Description: Distinguish between different meanings of the same character.
Capability: Understanding
Scale: 200

T12: Classical Chinese Writing (文言文写作)

Task Description: Writing in classical Chinese.
Capability: Generation
Scale: 100

T13-1: Appreciation Exam Questions (赏析真题)

Task Description: Answer appreciation questions based on ancient poetry.
Capability: Understanding
Scale: 150

T13-2: Free Appreciation (自由赏析)

Task Description: Conduct a free and detailed analysis of ancient poetry.
Capability: Understanding
Scale: 100

T14-1: Poetry Writing (诗创作)

Task Description: Compose a poem based on the theme.
Capability: Generation
Scale: 30

T14-2: Ci Writing (词创作)

Task Description: Compose a ci based on the theme.
Capability: Generation
Scale: 50

T14-3: Qu Writing (曲创作)

Task Description: Compose a qu based on the theme.
Capability: Generation
Scale: 20

T15-1: Content Q&A (内容问答)

Task Description: Answer the complete content of ancient poetry according to the title and author.
Capability: Knowledge
Scale: 200

T15-2: Title and Author Q&A (题目作者问答)

Task Description: Answer the title and author according to the content of ancient poetry.
Capability: Knowledge
Scale: 200

T15-3: Write the Next Sentence (下句默写)

Task Description: Write the next sentence according to the previous sentence in the ancient poem.
Capability: Knowledge
Scale: 100

T15-4: Write the Previous Sentence (上句默写)

Task Description: Write the previous sentence according to the next sentence in the ancient poem.
Capability: Knowledge
Scale: 100

T15-5: Comprehension Dictation (理解性默写)

Task Description: Provide ancient poetry sentences that meet the requirements.
Capability: Knowledge
Scale: 30

T15-6: Genre Judgment (判断体裁)

Task Description: Judge the genre of ancient poetry.
Capability: Knowledge
Scale: 120

T16: Ancient Poetry Translation (古诗词翻译)

Task Description: Translate ancient poetry into modern Chinese.
Capability: Understanding
Scale: 200

T17: Sentiment Classification (情感分类)

Task Description: Judge the sentiment contained in ancient poetry.
Capability: Understanding
Scale: 200

T18: Ancient Poetry to English (古诗词英文翻译)

Task Description: Translate ancient poetry into English.
Capability: Understanding
Scale: 50

T19: Poet Introduction (诗人介绍)

Task Description: Provide a detailed introduction of the poet.
Capability: Knowledge
Scale: 110

T20: Analysis of Imagery (意象解析)

Task Description: Provide the meanings of the imagery.
Capability: Knowledge
Scale: 185

T21-1: Couplet Following (接下联)

Task Description: Create the following couplet based on the previous one.
Capability: Generation
Scale: 100

T21-2: Couplet Writing (主题创作)

Task Description: Write a couplet based on the theme.
Capability: Generation
Scale: 100

T21-3: HengPi Writing (拟横批)

Task Description: Write HengPi based on the content of a couplet.
Capability: Generation
Scale: 100

T22-1: Synonyms (近义词)

Task Description: Provide the synonym for the idiom.
Capability: Knowledge
Scale: 100

T22-2: The Origin of Idiom (成语出处)

Task Description: Provide the source of the idiom.
Capability: Knowledge
Scale: 100

T22-3: Idiom Finding (成语蕴含)

Task Description: Extract idioms from ancient Chinese sentences and provide their meanings.
Capability: Knowledge
Scale: 100

T22-4: Idiom Explanation (解释含义)

Task Description: Provide the meaning of idioms.
Capability: Knowledge
Scale: 100

T23: Riddle (谜语)

Task Description: Guess the answer based on clues or clever hints.
Capability: Knowledge
Scale: 100

T24: Xiehouyu (歇后语)

Task Description: Complete the second half of the proverb based on the first half.
Capability: Knowledge
Scale: 100

T25: Historical Chinese Phonology (古汉语音韵)

Task Description: Answer questions about ancient Chinese phonetics and rhymes.
Capability: Knowledge
Scale: 100

T26: Knowledge of Sinology Q&A (国学常识问答)

Task Description: Answer questions about Sinology.
Capability: Knowledge
Scale: 130

Data Construction

The construction pipeline of WenMind includes data collection and data processing, as illustrated in Figure 2.

Figure 2: Construction pipeline of WenMind Benchmark.

Data Statistics

Table 1 provides the statistics of the WenMind dataset.

Table 1: The statistics of the WenMind Benchmark. "Q" represents "Question" and "A" represents "Answer".

Domain	Tasks	#Q	Max. #Q	Min. #Q	Avg. Q Tokens	Avg. A Tokens
Ancient Prose	15	1,900	200	7	107.51	62.12
Ancient Poetry	16	1,845	200	20	73.42	94.93
Ancient Literary Culture	11	1,130	100	100	26.68	14.26
Overall	42	4,875	200	7	75.87	63.44

Inference

a. Obtain the model’s responses

Open-source Model

For open-source models, we perform inference locally, only requiring the model path and the output file path for the answers.

--model_path The path to the model, defaults to loading from huggingface
--output_path The file path for the model's answer output, defaults to {model_name}_result.json

e.g.

CUDA_VISIBLE_DEVICES=0,1 python Evaluation_Code/Inference/Test_Baichuan2-7B-Chat.py \  
    --model_path baichuan-inc/Baichuan2-7B-Chat \  
    --output_path Baichuan2-7B-Chat_result.json

API Model

For GPT-3.5 and GPT-4 models, provide two parameters: api_base and api_key.
For ERNIE-3.5 and ERNIE-4.0 models, provide two parameters: api_key and secret_key.
For Spark models, provide three parameters: api_key, secret_key, and appid.
Refer to the official documentation of each API model for details.

e.g.

python Test_ERNIE-3.5-8K-0329.py \
    --API_KEY {api_key} \
    --SECRET_KEY {secret_key} \
    --output_path {output_path}

b. Use ERNIE-3.5 to score the responses

Step 1: Check whether the LLM response file is consistent with the format of the JSON/LLM_Response_Examples.json file.

Step 2: Open the Evaluation_Code/LLM_Scoring.py file, input the API_KEY and SECRET_KEY for the scoring model ERNIE-3.5, replace LLM_response_path with the storage path of the LLM response file, replace LLM_score_path with the path where the scoring results will be saved, and replace LLM_prompt_path with the storage path of JSON/Task_Score_Prompt.json.

Step 3: Run the following command to obtain the scoring results:

python Evaluation_Code/LLM_Scoring.py

c. Calculate the model’s score

Step 1: Check whether the scoring file is consistent with the format of the JSON/LLM_Score_Examples.json file.

Step 2: Open the Evaluation_Code/Calculate_Score.py file and replace LLM_score_path with the storage path of the scoring file.

Step 3: Run the following command to obtain the model's score:

python Evaluation_Code/Calculate_Score.py

Evaluation Result

Table 2: Results of all evaluated models on different domains and capabilities.

Acknowledgement

License

The work is licensed under a MIT License.

The WenMind benchmark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Important Notice:

The original data of this dataset are collected from publicly accessible sources such as the Internet, and the copyright remains with the original content providers. The curated and annotated dataset reported in this case is intended for non-commercial use only and is currently licensed exclusively to universities and research institutions. If you wish to apply for access to this dataset, please complete the required application form in accordance with the instructions provided on the dataset website. The signature section of the application must be signed by a full-time staff member of a university or research institute. Where possible, please affix an official institutional seal (a seal from a secondary-level unit is acceptable) to facilitate the review and approval process.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Evaluation_Code		Evaluation_Code
Images		Images
JSON		JSON
WenMind_Benchmark		WenMind_Benchmark
README.md		README.md

SCUT-DLVCLab/WenMind

Folders and files

Latest commit

History

Repository files navigation

WenMind Benchmark

Download

Data Format

Task List

T1-1: Inverted Sentence Structure (倒装句语序)

T1-2: Elliptical Sentence (省略句)

T1-3: Inverted Sentence Types (倒装句类型)

T1-4: Sentence Structure Identification (判断句式)

T2: Classical Chinese to Modern Chinese (文白翻译)

T3: Modern Chinese to Classical Chinese (白文翻译)

T4: Named Entity Recognition (命名实体识别)

T5: Punctuation (句读)

T6: Topic Classification (主题分类)

T7: Word Explanation (字词解释)

T8: Reading Comprehension (阅读理解)

T9: Function Words (虚词)

T10: Homophones (通假字)

T11: Polysemy (单字多义)

T12: Classical Chinese Writing (文言文写作)

T13-1: Appreciation Exam Questions (赏析真题)

T13-2: Free Appreciation (自由赏析)

T14-1: Poetry Writing (诗创作)

T14-2: Ci Writing (词创作)

T14-3: Qu Writing (曲创作)

T15-1: Content Q&A (内容问答)

T15-2: Title and Author Q&A (题目作者问答)

T15-3: Write the Next Sentence (下句默写)

T15-4: Write the Previous Sentence (上句默写)

T15-5: Comprehension Dictation (理解性默写)

T15-6: Genre Judgment (判断体裁)

T16: Ancient Poetry Translation (古诗词翻译)

T17: Sentiment Classification (情感分类)

T18: Ancient Poetry to English (古诗词英文翻译)

T19: Poet Introduction (诗人介绍)

T20: Analysis of Imagery (意象解析)

T21-1: Couplet Following (接下联)

T21-2: Couplet Writing (主题创作)

T21-3: HengPi Writing (拟横批)

T22-1: Synonyms (近义词)

T22-2: The Origin of Idiom (成语出处)

T22-3: Idiom Finding (成语蕴含)

T22-4: Idiom Explanation (解释含义)

T23: Riddle (谜语)

T24: Xiehouyu (歇后语)

T25: Historical Chinese Phonology (古汉语音韵)

T26: Knowledge of Sinology Q&A (国学常识问答)

Data Construction

Data Statistics

Inference

a. Obtain the model’s responses

Open-source Model

API Model

b. Use ERNIE-3.5 to score the responses

c. Calculate the model’s score

Evaluation Result

Acknowledgement

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages