Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

#1 Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques [PDF] [Copy] [Kimi²] [REL]

Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.

Subjects: Computation and Language , Artificial Intelligence , Computers and Society

Publish: 2025-06-17 10:59:51 UTC

2506.21584

#1 Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques [PDF] [Copy] [Kimi2] [REL]

#1 Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques [PDF] [Copy] [Kimi²] [REL]