Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

#1 Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes [PDF] [Copy] [Kimi] [REL]

Authors: Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, Yftah Ziser

We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with an intent description and harm label. We use AIMS to evaluate intent-aware training across supervised fine-tuning, preference learning, reasoning distillation, and reinforcement learning. Despite its size, AIMS enables competitive safety classifiers across training regimes: DPO from model-generated intent errors improves over SFT, and intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs. Most notably, directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks, while our intent-aware models form the inference latency-F1 Pareto frontier. These results show that faithful intent modeling is a compact, high-quality supervision signal for more robust safety classifiers.

Subject: Computation and Language

Publish: 2026-06-25 16:03:57 UTC

2606.27210

#1 Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes [PDF] [Copy] [Kimi] [REL]