Total: 1
Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. We highlight two gaps in current LLM evaluations for assessing steerability. First, many benchmarks are built with past LLM chats and text scraped from the Internet, which may skew towards common requests, underrepresenting less-common requests by potential users. Second, prior work measures performance as a scalar, which could conceal behavioral shifts in LLM outputs in open-ended generation. To mitigate these gaps, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce intended changes or "side-effects" to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.