Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models

#1 Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models [PDF³] [Copy] [Kimi] [REL]

Authors: Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin

Large Language Models (LLMs) have gained significant attention for their exceptional performance across various domains. Despite their advancements, concerns persist regarding their implicit bias, which often leads to negative social impacts. Therefore, it is essential to identify the implicit bias in LLMs and investigate the potential threat posed by it. Our study focused on a specific type of implicit bias, termed the ''Yes-No'' implicit bias, which refers to LLMs' inherent tendency to favor ''Yes'' or ''No'' responses to a single instruction. By comparing the probability of LLMs generating a series of ''Yes'' versus ''No'' responses, we observed different inherent response tendencies exhibited by LLMs when faced with different instructions. To further investigate the impact of such bias, we developed an attack method called Implicit Bias In-Context Manipulation, attempting to manipulate LLMs' behavior. Specifically, we explored whether the ''Yes'' implicit bias could manipulate ''No'' responses into ''Yes'' in LLMs' responses to malicious instructions, leading to harmful outputs. Our findings revealed that the ''Yes'' implicit bias brings a significant security threat, comparable to that of carefully designed attack methods. Moreover, we offered a comprehensive analysis from multiple perspectives to deepen the understanding of this security threat, emphasizing the need for ongoing improvement in LLMs' security.

Subject: AAAI.2025 - Natural Language Processing

34554@AAAI

#1 Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models [PDF3] [Copy] [Kimi] [REL]

#1 Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models [PDF³] [Copy] [Kimi] [REL]