ACID Test: A Benchmark for Cultural Safety and Alignment in LALMs

#1 ACID Test: A Benchmark for Cultural Safety and Alignment in LALMs [PDF] [Copy] [Kimi] [REL]

Authors: Bikash Dutta, Adit Jain, Rishabh Ranjan, Mayank Vatsa, Richa Singh

Large Audio Language Models (LALMs) are transforming AI by processing and generating human language directly from audio. As these models proliferate in real-world applications, it becomes critical to evaluate their performance to ensure equitable and safe use across diverse linguistic and cultural contexts. We present the first comprehensive study of cultural bias in LALMs, extending text-based harm frameworks to the audio modality to analyze how linguistic diversity influences model behavior and uncover challenges in interpreting audio nuances. To address this, we introduce the Audio Cultural Intelligence Dataset (ACID), a multilingual audio–text benchmark spanning 1,315 hours across diverse languages and cultural contexts, and we conduct a systematic evaluation of 10 open-source and two closed-source models. Our results reveal substantial performance disparities across languages and cultural settings and show that biases manifest distinctly when models process audio inputs. These findings highlight the need to evaluate LALMs not only for technical accuracy but also for fair and culturally sensitive behavior, motivating the development of inclusive datasets and culturally aware training practices for safer and more equitable audio language models.

Subject: AAAI.2026 - Special Track on AI Alignment

41068@AAAI

#1 ACID Test: A Benchmark for Cultural Safety and Alignment in LALMs [PDF] [Copy] [Kimi] [REL]