GitHub - llmeval/LLMEval-Fair: 中文大语言模型评测第三期

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Note: For the Chinese version of this README, please refer to README_zh.md.

📚 Benchmark Content and Format

LLMEval-Fair focuses on evaluating professional knowledge capabilities, covering 13 academic disciplines as defined by the Ministry of Education: Philosophy, Economics, Law, Education, Literature, History, Science, Engineering, Agriculture, Medicine, Military Science, Management, and Arts. It includes over 50 sub-disciplines and a total of approximately 200,000 standardized generative question-answering items (we will continue to expand the question bank to 1 million).

Question sources mainly include undergraduate homework, undergraduate mid-term and final exams, and graduate entrance exams. To prevent large models from being exposed to a significant portion of the evaluation data during pre-training, LLMEval-Fair sources its questions from non-public channels where possible. The data is in PDF and Word formats, which undergo OCR and data cleaning before being formatted. A standardized interface is provided for different question types to enable a fully automated process for the models under test.

Unlike other knowledge benchmarks that use a multiple-choice format, LLMEval-Fair treats all questions as generative knowledge question-answering. It includes a variety of formats such as short answer, calculation, true/false, analysis, and essay questions. Compared to standardized multiple-choice questions, the generative format used in LLMEval-Fair better reflects real-world user needs and the language capabilities of the models.

The question bank dataset is available in data/.

🔬 Methodology

Evaluation Pipeline

Preventing cheating is a key consideration for LLMEval-Fair. Existing public benchmarks suffer from test set leakage, which can lead to unfair practices like "leaderboard hacking" or score inflation. In LLMEval-Fair, each participating system must complete 1,000 questions randomly sampled from the total question bank. For models from the same institution, we ensure that the questions are not repeated in subsequent evaluations. The evaluation is conducted online, with questions sent sequentially in a single round; the next question is only sent after the previous one has been answered, preventing malicious crawling.

This round of evaluation uses an automated scoring method, with GPT-4 Turbo as the current evaluation model. Each question is scored on a scale of 0-3 points. The scoring focuses on the core correctness of the answer and the validity of the explanation, with core correctness being the primary metric. The evaluation prompt used is as follows:

Please evaluate the following response from the LLM regarding a discipline-specific question based on the following criteria. You must score it on a scale of 0, 1, 2 or 3 stars:

Overall Rating:
0 stars indicate wrong answer with a wrong explanation
1 star indicates wrong answer but a partially reasonable explanation
2 stars indicate a correct answer with a partially reasonable explanation
3 stars indicate a correct answer with a reasonable explanation

User: {question}

LLM:{answer_from_llm}

The correct answer to user's question is: {correct_answer}

You must provide your feedback in the following format:
{"Overall Rating":numbers of its stars(int)}

Scoring

To mitigate systematic bias introduced by randomly sampling 1,000 questions, LLMEval-Fair uses both relative scores and absolute scores.

Relative Score Calculation: Given the rapid development of large language model technology, we introduce a relative score to measure the gap between a model and the current state-of-the-art performance. We select the top-performing model on the leaderboard as the SOTA baseline, which is currently Doubao-1.5-Thinking-Pro:

$$R_{\text{SOTA}}^{\text{model}}=\frac{S_{model}}{S_\text{sotamodel}} \times 100 $$

Absolute Score Calculation: The absolute score represents the model's raw performance on N=1,000 questions. It is calculated by normalizing each question's score (0-3 points) to a 0-100 scale:

$$S_{model}=\sum_{i=1}^N{\frac{s_i}{s_{max}} \times 100} \quad (1)$$

Where $s_i$ is the score for question i, and $s_{max}=3$.

Scoring Notes: $S_{model}$ is the absolute score (0-100 scale), $R_{\text{SOTA}}^{\text{model}}$ is the relative score (with the SOTA model as the 100% baseline), and discipline-specific scores use a 10-point scale.

🏆 Current Leaderboard (As of December 2025)

📋 Overall Scores

Model Name	Organization	Access Type	Evaluation Date	Relative Score	Absolute Score
Doubao-1.5-Thinking-Pro	ByteDance	API	2025.7.21	100.00	93.67
DeepSeek-R1	DeepSeek	API	2025.7.21	97.40	91.23
Gemini-2.5-Pro-Preview	Google	API	2025.7.21	97.22	91.07
Gemini-2.5-Pro-Preview-Thinking	Google	API	2025.7.21	97.15	91.00
DeepSeek-V3	DeepSeek	API	2025.7.21	96.48	90.37
Qwen3-235B	Alibaba Cloud	API	2025.7.21	96.44	90.33
Doubao-1.5-Pro-256K	ByteDance	API	2025.7.21	95.69	89.63
GLM-4.6	Zhipu AI	API	2025.9.30	95.26	89.23
QwQ-32B	Alibaba Cloud	API	2025.8.1	94.52	88.54
Kimi-K2	Moonshot AI	API	2025.9.5	94.27	88.30
GPT-5	OpenAI	API	2025.8.7	93.84	87.90
Claude-Sonnet-4.5-Thinking	Anthropic	API	2025.9.29	93.48	87.57
o1-2024-12-17	OpenAI	API	2025.7.21	93.35	87.43
Claude-Sonnet-4.5	Anthropic	API	2025.9.29	93.31	87.40
Gemini-2.5-Flash-Thinking	Google	API	2025.8.1	92.74	86.87
DeepSeek-V3.2	DeepSeek	API	2025.12.1	92.27	86.43
Qwen3-32B	Alibaba Cloud	API	2025.7.21	92.21	86.37
Claude-Sonnet-4-Thinking	Anthropic	API	2025.7.21	91.03	85.27
Claude-Sonnet-4	Anthropic	API	2025.7.21	91.00	85.23
GPT-4o-Search-Preview	OpenAI	API	2025.7.21	89.40	83.73
GLM-4-32B	Tsinghua&Zhipu.AI	API	2025.8.1	88.43	82.83
GPT-4o-2024-11-20	OpenAI	API	2025.7.21	88.08	82.50
Gemini-1.5-Pro	Google	API	2025.8.1	85.92	80.47
Qwen2.5-32B-Instruct	Alibaba Cloud	API	2025.8.1	85.07	79.68
o3-Mini	OpenAI	API	2025.7.21	84.13	78.80
Qwen-Turbo-1101	Alibaba Cloud	API	2025.8.1	83.71	78.41
Claude-3.5-Sonnet	Anthropic	API	2025.8.1	83.38	78.10
o1-Mini-2024-09-12	OpenAI	API	2025.8.1	78.93	73.93
GPT-4 Turbo(gpt-4-1106-preview)	OpenAI	API	2023.11.18	78.56	73.6
GPT-4-0125-Preview	OpenAI	API	2024.1.28	76.44	71.6
Baidu-4.0	Baidu	API	2023.11.1	75.09	70.33
Yi-34B-Chat	01.AI	API	2023.12.1	70.17	65.70
Baidu-3.5	Baidu	API	2023.11.1	69.14	64.73
ChatGLM-Pro	Tsinghua&Zhipu.AI	API	2023.11.1	69.14	64.73
Megrez-3B-Instruct	Megrez	API	2024.12.16	67.01	62.77
GPT-4-0613	OpenAI	API	2023.9.29	66.17	61.97
iFlytek-Spark-v3.0	iFlytek	API	2023.11.7	65.64	61.47
Qwen2-7B-Instruct	Alibaba Cloud	API	2024.6.6	65.15	61.03
Nanbeige-Plus	NanBeiGe LLM Lab	API	2023.12.1	65.14	61.00
Phi-4-Final	Microsoft	API	2024.12.12	63.98	59.93
Claude-3-Haiku	Anthropic	API	2025.8.1	62.95	58.97
Llama-3.2-90B-Vision-Instruct	Meta	API	2025.8.1	61.74	57.83
Llama-3.3-70B	Meta	API	2025.8.1	60.85	57.00
Baichuan2-13B-Chat	Baichuan	Weights	2023.9.29	58.31	54.6
Gemini-Pro	Google	API	2024.1.10	58.20	54.5
Qwen-Plus	Alibaba Cloud	API	2023.11.1	56.60	53.0
Qwen-Turbo	Alibaba Cloud	API	2023.11.1	55.78	52.23
Nanbeige-16B	NanBeiGe LLM Lab	API	2023.10.23	55.46	51.93
GPT-3.5-Turbo	OpenAI	API	2023.9.29	55.42	51.9
MiniMax-Abab5	MiniMax	Weights	2023.11.1	55.33	51.83
Mixtral-8x7B-Instruct	Mistral AI	Weights	2024.1.10	51.69	48.4
ChatGLM2-6B	Tsinghua&Zhipu.AI	Weights	2023.9.29	42.32	39.63
Llama-3.1-8B	Meta	API	2024.7.23	41.24	38.63
Ziya-v1.1-13B	IDEA	Weights	2023.9.29	40.18	37.63
InternLM-Chat-7B	Shanghai AI Lab&SenseTime	Weights	2023.9.29	38.73	36.27
Linly-Chinese-Llama-2-13B-HF	National Engineering Lab	Weights	2023.10.3	37.06	34.7
Phi-3-Medium-128K-Instruct	Microsoft	API	2025.8.1	36.94	34.60
BELLE-Llama2-13B-Chat-0.4M	LianjiaTech	Weights	2023.10.1	36.28	33.97
Llama-2-7B-Chat-HF	Meta	Weights	2023.9.29	25.24	23.63

📊 Discipline-Specific Performance

Model Name	Overall	Engineering	Economics	Education	Law	Literature	Management	Science	History	Medicine	Military
Doubao-1.5-Thinking-Pro	93.67	9.47	9.67	9.43	9.77	8.93	9.53	9.23	9.70	8.97	8.97
DeepSeek-R1	91.23	9.47	9.43	9.27	9.37	8.83	9.37	9.03	9.53	8.50	8.43
Gemini-2.5-Pro-Preview	91.07	9.20	9.47	9.20	9.30	8.43	9.63	9.07	9.40	8.50	8.87
Gemini-2.5-Pro-Preview-Thinking	91.00	9.13	9.50	9.37	9.47	8.40	9.63	9.20	9.27	8.30	8.73
DeepSeek-V3	90.37	9.30	9.57	8.93	9.23	8.60	9.13	8.97	9.47	8.83	8.33
Qwen3-235B	90.33	9.23	9.43	9.03	9.50	8.23	9.43	8.97	9.17	8.73	8.60
Doubao-1.5-Pro-256K	89.63	8.83	9.03	9.13	9.43	8.57	9.27	8.83	9.10	8.60	8.83
GLM-4.6	89.23	8.80	9.27	8.70	9.23	8.40	9.63	8.90	9.30	8.43	8.57
QwQ-32B	88.54	8.30	9.46	9.23	9.33	7.83	9.46	8.65	9.27	8.57	8.43
Kimi-K2	88.30	9.23	9.17	8.80	9.00	8.40	9.17	8.77	9.13	8.53	8.10
GPT-5	87.90	8.83	9.37	8.90	8.87	8.10	9.10	8.90	9.03	8.50	8.30
Claude-Sonnet-4.5-Thinking	87.57	8.90	9.17	8.80	8.97	8.00	9.23	8.90	9.00	8.27	8.33
o1-2024-12-17	87.43	8.90	9.30	8.67	8.77	7.73	9.27	8.90	8.97	8.17	8.77
Claude-Sonnet-4.5	87.40	8.80	8.97	8.93	8.73	8.37	9.10	8.97	8.93	8.13	8.47
Gemini-2.5-Flash-Thinking	86.87	8.67	9.27	8.70	9.00	7.80	8.93	8.90	9.00	8.03	8.57
DeepSeek-V3.2	86.43	8.73	9.13	8.53	8.70	7.40	9.33	8.87	9.37	8.53	7.83
Qwen3-32B	86.37	8.43	9.10	8.57	9.10	7.77	9.47	8.67	9.30	7.70	8.27
Claude-Sonnet-4-Thinking	85.27	8.57	9.00	8.63	8.73	7.57	9.10	8.93	8.70	7.97	8.07
Claude-Sonnet-4	85.23	8.57	8.80	8.50	8.70	7.80	9.03	8.80	8.80	8.17	8.07
GPT-4o-Search-Preview	83.73	8.27	8.77	8.43	8.67	7.77	8.80	8.20	8.73	8.27	7.83
GLM-4-32B	82.83	7.77	8.97	8.33	8.33	7.03	9.13	8.27	8.77	8.23	8.00
GPT-4o-2024-11-20	82.50	7.90	8.67	8.30	8.33	7.17	8.97	8.57	8.67	7.63	8.30
Gemini-1.5-Pro	80.47	8.13	8.45	8.30	8.37	7.04	8.17	8.43	8.50	7.48	7.60
Qwen2.5-32B-Instruct	79.68	7.70	8.57	8.33	8.33	6.70	8.50	8.17	7.70	7.60	8.08
o3-Mini	78.80	7.97	8.60	8.30	8.20	6.73	8.57	8.53	7.17	7.03	7.70
Qwen-Turbo-1101	78.41	7.97	8.37	8.03	8.23	6.40	8.50	8.10	7.50	7.27	8.05
Claude-3.5-Sonnet	78.10	7.97	8.53	8.27	7.93	7.03	8.50	8.00	7.57	6.70	7.60
o1-Mini-2024-09-12	73.93	7.27	8.43	7.90	7.53	6.27	8.27	8.17	6.43	6.63	7.03
GPT-4 Turbo(gpt-4-1106-preview)	73.6	6.97	8.17	8.33	7.8	6.0	7.57	8.13	7.0	6.43	7.2
GPT-4-0125-Preview	71.6	6.9	7.4	8.03	7.3	6.0	7.47	7.63	6.87	6.33	7.67
Baidu-4.0	70.33	7.27	7.23	7.67	7.43	5.63	6.47	6.8	7.63	7.8	6.4
Yi-34B-Chat	65.70	5.77	6.63	7.37	7.53	5.47	5.77	5.47	7.47	6.3	7.93
Baidu-3.5	64.73	6.2	6.7	7.8	6.83	5.2	5.5	6.0	7.23	6.57	6.7
ChatGLM-Pro	64.73	5.9	7.07	7.03	7.9	5.43	6.33	5.0	6.67	5.97	7.43
Megrez-3B-Instruct	62.77	5.80	6.77	6.80	7.13	5.40	6.87	5.70	6.53	5.70	6.07
GPT-4-0613	61.97	6.5	6.73	6.6	6.73	5.43	6.1	6.47	5.3	5.2	6.9
iFlytek-Spark-v3.0	61.47	5.77	6.5	7.27	7.3	5.7	5.9	5.03	6.5	5.23	6.27
Qwen2-7B-Instruct	61.03	5.47	6.73	6.33	7.60	5.13	6.17	6.17	5.73	5.33	6.37
Nanbeige-Plus	61.00	5.78	5.57	6.77	7.37	5.37	5.93	5.45	6.3	5.67	6.77
Phi-4-Final	59.93	5.80	6.47	6.23	6.53	5.53	6.30	6.27	5.50	5.43	5.87
Claude-3-Haiku	58.97	5.80	6.60	6.97	6.63	4.83	5.93	6.33	4.80	5.23	5.83
Llama-3.2-90B-Vision-Instruct	57.83	5.63	6.33	6.20	5.80	4.73	6.10	6.57	5.03	5.27	6.17
Llama-3.3-70B	57.00	5.80	6.90	5.63	5.70	5.47	5.70	6.30	4.70	4.87	5.93
Baichuan2-13B-Chat	54.6	4.47	5.53	7.4	6.9	4.63	4.8	4.33	6.23	4.6	5.7
Gemini-Pro	54.5	4.87	5.43	7.07	6.43	5.10	4.5	4.65	6.33	4.42	5.7
Qwen-Plus	53.0	4.4	5.1	6.53	6.53	5.0	4.77	4.87	5.17	5.13	5.5
Qwen-Turbo	52.23	4.1	6.07	6.63	6.43	4.43	4.53	4.97	5.27	4.37	5.43
Nanbeige-16B	51.93	4.37	5.3	6.5	6.3	3.97	4.7	4.07	5.9	4.73	6.1
GPT-3.5-Turbo	51.9	4.97	5.37	6.4	6.47	4.43	4.67	5.43	4.2	4.37	5.6
MiniMax-Abab5	51.83	3.87	5.63	6.87	6.97	4.33	4.4	2.93	6.13	4.27	6.43
Mixtral-8x7B-Instruct	48.4	4.27	5.47	6.47	6.4	3.13	4.5	5.07	3.57	4.37	5.17
ChatGLM2-6B	39.63	2.33	3.77	5.97	6.13	2.83	3.83	2.6	3.8	4.0	4.37
Llama-3.1-8B	38.63	3.87	4.20	4.27	4.17	3.50	3.83	4.30	3.17	3.20	4.13
Ziya-v1.1-13B	37.63	2.77	3.97	5.17	5.33	2.8	3.77	2.53	3.7	3.03	4.57
InternLM-Chat-7B	36.27	2.63	3.67	4.87	5.57	3.17	3.33	2.33	4.03	3.13	3.53
Linly-Chinese-Llama-2-13B-HF	34.7	2.2	3.77	4.5	5.0	2.43	3.33	2.53	3.9	2.5	4.53
Phi-3-Medium-128K-Instruct	34.60	2.27	4.17	3.70	4.23	2.87	4.50	3.57	3.20	2.27	3.83
BELLE-Llama2-13B-Chat-0.4M	33.97	2.57	3.07	4.93	4.73	2.83	3.8	2.43	3.33	2.4	3.87
Llama-2-7B-Chat-HF	23.63	1.53	3.43	3.0	3.73	1.73	2.43	1.97	2.17	0.8	2.83

Note: Discipline scores are on a 10-point scale

The performance distribution over time for the currently ranked models is shown in the figure below:

For more experimental details and analysis, please refer to our paper.

📞 Contact Us

This project is open to the public, and we welcome you to participate in our evaluation.

Institutional evaluation requires certification. After registering an account, please contact the administrators for verification and to apply for evaluation permissions.

Unless there are special circumstances, all evaluation results will be added to the leaderboard upon completion.

Website: http://llmeval.com/
Email: mingzhang23@m.fudan.edu.cn
WeChat: zanyingluan

LLMEval-Fair | Building the Future of LLM Evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

📚 Benchmark Content and Format

🔬 Methodology

Evaluation Pipeline

Scoring

🏆 Current Leaderboard (As of December 2025)

📋 Overall Scores

📊 Discipline-Specific Performance

📞 Contact Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
pic		pic
release		release
README.md		README.md
README_zh.md		README_zh.md

Folders and files

Latest commit

History

Repository files navigation

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

📚 Benchmark Content and Format

🔬 Methodology

Evaluation Pipeline

Scoring

🏆 Current Leaderboard (As of December 2025)

📋 Overall Scores

📊 Discipline-Specific Performance

📞 Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages