PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

#1 PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [PDF⁵] [Copy] [Kimi⁵] [REL]

Authors: Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, Jingren Zhou

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Deepseek-R1-671B and Qwen-QwQ-32B, achieve only 43.4 and 41.8 benchmark scores, with less than 30% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

Subject: Computation and Language

Publish: 2025-04-25 15:39:04 UTC

2504.18428

#1 PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [PDF5] [Copy] [Kimi5] [REL]

#1 PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [PDF⁵] [Copy] [Kimi⁵] [REL]