SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs

#1 SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs [PDF] [Copy] [Kimi] [REL]

Authors: Zhichao Shi, Shaoling Jing, Yi Cheng, Hao Zhang, Yuanzhuo Wang, Jie Zhang, Huawei Shen, Xueqi Cheng

With the expansion of the application of Large Language Models (LLMs), concerns about their safety have grown among researchers. Numerous studies have demonstrated the potential risks of LLMs generating harmful content and have proposed various safety assessment benchmarks to evaluate these risks. However, the evaluation questions in current benchmarks, especially for Chinese, are too straightforward, making them easily rejected by target LLMs, and difficult to update with practical relevance due to their lack of correlation with real-world events. This hinders the effective application of these benchmarks in continuous evaluation tasks. To address these limitations, we propose SafetyQuizzer, a question-generation framework designed to evaluate the safety of LLMs more sustainably in the Chinese context. SafetyQuizzer leverages a finetuned LLM and jailbreaking attack templates to generate subtly offensive questions, which reduces the decline rate. Additionally, by utilizing retrieval-augmented generation, SafetyQuizzer incorporates the latest real-world events into evaluation questions, improving the adaptability of the benchmarks. Our experiments demonstrate that evaluation questions generated by SafetyQuizzer significantly reduce the decline rate compared to other benchmarks while maintaining a comparable attack success rate. Our code is available at https://github.com/zhichao-stone/SafetyQuizzer. Warning: this paper contains examples that may be offensive or upsetting.

Subject: NAACL.2025 - Long Papers

2025.naacl-long.85@ACL

#1 SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs [PDF] [Copy] [Kimi] [REL]