EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

#1 EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing [PDF] [Copy] [Kimi] [REL]

Authors: Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang

Prompt-based essay writing is an effective and common way to assess students' critical thinking skills. Recent work has evaluated the impressive capabilities of Large Language Models (LLMs) on this task. However, most studies focus primarily on English. Those examining LLMs' performance in Chinese often rely on coarse-grained text quality metrics, overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. We therefore propose EssayBench, a multi-genre benchmark specifically designed for Chinese essay writing, along with a fine-grained, genre-specific scoring framework that hierarchically aggregates scores to better align with human preferences. The dataset comprises 728 real-world prompts across four major genres (Argumentative, Narrative, Descriptive, and Expository), and includes both Open-Ended and Constrained types. Our evaluation protocol is validated through a comprehensive human agreement study. The results show that our protocol aligns well with human judgments, achieving a highest Spearman's correlation of 0.816 and outperforming coarse-grained evaluation methods by an average of 8.6\%. Finally, we benchmark 15 large LLMs, analyzing their strengths and limitations across genres and instruction types. We believe EssayBench offers a more reliable framework for evaluating Chinese essay generation and provides valuable insights for improving LLMs in this domain.

Subject: AAAI.2026 - Special Track on AI Alignment

41072@AAAI

#1 EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing [PDF] [Copy] [Kimi] [REL]