gye2U9uNXx@OpenReview

Total: 1

#1 Uncovering Gaps in How Humans and LLMs Interpret Subjective Language [PDF1] [Copy] [Kimi2] [REL]

Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt

Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an *enthusiastic* blogpost, while developers might train models to be *helpful* and *harmless* using LLM-based edits. The LLM's *operational semantics* of such subjective phrases---how it adjusts its behavior when each phrase is included in the prompt---thus dictates how aligned it is with human intent. In this work, we uncover instances of *misalignment* between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a reference semantic thesaurus. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more *harassing* outputs when it edits text to be *witty*, and Llama 3 8B Instruct produces *dishonest* articles when instructed to make the articles *enthusiastic*. Our results demonstrate that we can uncover unexpected LLM behavior by characterizing relationships between abstract concepts, rather than supervising individual outputs directly.

Subject: ICLR.2025 - Spotlight