Evaluating Large Language Models with Enterprise Benchmarks

#1 Evaluating Large Language Models with Enterprise Benchmarks [PDF] [Copy] [Kimi] [REL]

Authors: Bing Zhang, Mikio Takeuchi, Ryo Kawahara, Shubhi Asthana, Md. Maruf Hossain, Guang-Jie Ren, Kate Soule, Yifan Mai, Yada Zhu

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be benchmarked with enterprise datasets for a variety of NLP tasks. This work explores benchmarking strategies focused on LLM evaluation, with a specific emphasis on both English and Japanese. The proposed evaluation framework encompasses 25 publicly available domain-specific English benchmarks from diverse enterprise domains like financial services, legal, climate, cyber security, and 2 public Japanese finance benchmarks. The diverse performance of 8 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.

Subject: NAACL.2025 - Industry Track

2025.naacl-industry.40@ACL

#1 Evaluating Large Language Models with Enterprise Benchmarks [PDF] [Copy] [Kimi] [REL]