2025.acl-long.814@ACL

Total: 1

#1 Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? [PDF3] [Copy] [Kimi4] [REL]

Authors: Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, Hongfei Lin

Implicit hate speech has become a significant challenge for online platforms, as it often avoids detection by large language models (LLMs) due to its indirectly expressed hateful intent. This study identifies the limitations of LLMs in detecting implicit hate speech, particularly when disguised as seemingly harmless expressions in a rhetorical device. To address this challenge, we employ a Jailbreaking strategy and Energy-based Constrained Decoding techniques, and design a small model for measuring the energy of metaphorical rhetoric. This approach can lead to LLMs generating metaphorical implicit hate speech. Our research reveals that advanced LLMs, like GPT-4o, frequently misinterpret metaphorical implicit hate speech, and fail to prevent its propagation effectively. Even specialized models, like ShieldGemma and LlamaGuard, demonstrate inadequacies in blocking such content, often misclassifying it as harmless speech. This work points out the vulnerability of current LLMs to implicit hate speech, and emphasizes the improvements to address hate speech threats better.

Subject: ACL.2025 - Long Papers