虎嗅

This year's college entrance examination saw me using 12 top AI systems to take the Chinese and mathematics tests together, and the results were somewhat unexpected.

原文:今年高考,我让12个顶级AI一起考了语文和数学,结果有点意外。

Summary of Key Points

During the 2026 college entrance examination season, the author organized 12 major domestic and international large-scale models (such as GPT-5.5, Claude Opus 4.8, Xiaomi MiMo v2.5 Pro, etc.) to participate in Chinese and mathematics tests modeled after the official exam format. To ensure fairness, a series of measures were taken, including using a unified API for all model interactions, inputting answers in LaTeX纯 text format, and having experienced teachers score the responses anonymously. The results showed that the total scores of the top models were extremely close (with the top 9 models differing by only 2 points), with MiMo and Kimi finishing as champions and runners-up by a mere 0.01 point. Some models exhibited a bias in their strengths; for example, DeepSeek was strong in mathematics but weaker in Chinese, while GLM5.1 performed well in Chinese but slightly less so in mathematics. In the essay portion, teachers emphasized the importance of structure, clarity of argumentation, and relevance to current events. The overall accuracy rate for mathematics questions was high, except for the last fill-in-the-blank question, which was universally answered incorrectly. The comparison of AI performance across these four years (2023–2026) highlights the rapid improvement in AI’s abilities in foundational academic subjects.

I. Fairness on par with the College Entrance Examination: No Model Gets an Advantage

To ensure that the AI models competed fairly like real students, the author implemented several anti-cheating measures:

  • Unified testing rules: All models used the same API, and tools such as searching for answers or writing code to solve problems were prohibited. Except for iFlytek and Baidu, all 10 models utilized the OpenRouter platform to avoid discrepancies due to different interfaces.
  • Standardized answer format: Both Chinese and mathematics questions were input in LaTeX, which created a consistent format for scoring. Scripts were also used to verify the accuracy of the LaTeX conversion to prevent any errors in question content.
  • Anonymous grading: Teachers did not see the names of the models; instead, they only saw codes (e.g., Test Paper A, Test Paper B) to avoid bias based on brand preferences. For subjective questions, scores were determined by the average of three Chinese teachers to minimize individual biases.
  • Flexible scoring for fill-in-the-blank questions: Different notations for numbers (e.g., 1/2 vs. 0.5) were accepted as long as the value was correct, without strict adherence to a particular format.

These measures ensured that all AI models started on an equal footing, just like how real students use the same 2B pencils and sealed test papers in the college entrance examination.

II. Results: The Gap Between Top Models is Imperceptible

The most surprising outcome was the minimal difference in scores among the top models:

  • MiMo and Kimi, the champions and runners-up, differed by only 0.01 point (with MiMo scoring lower in Chinese and higher in mathematics).
  • The scores of the models ranked 3rd to 9th (Claude Opus, GLM5.1, Gemini) also varied by no more than 2 points.

This indicates that today’s top AI models are highly comparable in their foundational abilities in both Chinese and mathematics, similar to top students in a class where small differences in scores can significantly affect overall rankings.

III. AI Models Also Have Strengths and Weaknesses

Like human students, AI models exhibit areas of expertise and weakness:

  • Chinese proficiency: GLM5.1 and Gemini 3.1 Pro ranked first in Chinese but were weaker in mathematics.
  • Mathematics prowess: DeepSeek V4 Pro, MiMo, and Wenxin Ernie 5.1 all performed well in mathematics, but DeepSeek’s Chinese scores were the lowest (mainly due to poor essay performance).
  • Balanced performers: MiMo and Kimi had high overall scores because they were strong in both subjects without any significant weaknesses.

These differences may be related to the training focus of the models; for instance, DeepSeek might be more focused on mathematical reasoning, whereas GLM5.1 may have invested more resources in language understanding, just as some students excel in certain academic areas.

IV. Essay Evaluation: Teachers Focus on Structure and Relevance

The evaluation of Chinese essays revealed common issues with AI-generated responses:

  • Problems encountered: Unclear writing styles, disorganized structures, vague arguments, insufficient examples, and a lack of relevance to current events.
  • Case Study: Although GLM5.1’s essay scored the highest, it was criticized for its unclear structure. DeepSeek’s essay received low scores due to similar issues.

This suggests that AI models have not yet fully grasped the criteria for successful college entrance examination essays—these essays are not about creativity but about following specific guidelines: clear structure, clear arguments, and relevance to contemporary themes.

V. Progress Over Four Years

The improvement in AI performance is evident when comparing tests from 2023 to 2026:

  • 2023: Only GPT-4 was capable of writing an essay, with few domestic models participating.
  • 2024: Domestic models began to show progress but often made mistakes (e.g., answering questions irrelevantly).
  • 2025: Some models reached a level comparable to that of top students in mathematics.
  • 2026: The total scores of the top models were very close, and the testing method evolved from manual copying to automated scripts combined with professional grading platforms.

This four-year progression reflects not only the improvement in AI capabilities but also the author’s approach becoming more rigorous, akin to scientific research, given the significant significance of the college entrance examination in China.

The author notes that the results are for entertainment purposes only, but they demonstrate that AI models are increasingly closing the gap with human performance in foundational academic areas and may soon be capable of taking on complex tasks. However, AI’s inherent strengths and weaknesses (such as subject biases and essay shortcomings) indicate that they still have a way to go before fully understanding and replicating human thinking and expression.