Aligning to Illusions: Choice Blindness in Human and AI Feedback

#1 Aligning to Illusions: Choice Blindness in Human and AI Feedback [PDF¹] [Copy] [Kimi] [REL]

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2026-03-09 14:10:36 UTC

2603.08412

#1 Aligning to Illusions: Choice Blindness in Human and AI Feedback [PDF1] [Copy] [Kimi] [REL]

#1 Aligning to Illusions: Choice Blindness in Human and AI Feedback [PDF¹] [Copy] [Kimi] [REL]