2507.15581

Total: 1

#1 Metric assessment protocol in the context of answer fluctuation on MCQ tasks [PDF1] [Copy] [Kimi2] [REL]

Authors: Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki

Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.

Subject: Artificial Intelligence

Publish: 2025-07-21 13:01:46 UTC