Total: 1
Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models—designed to monitor LLM inputs and outputs and block potentially harmful content—has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: (1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and (2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation scenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements. Experiments demonstrate that RSafe matches state-of-the-art guard models using limited amount of public data in both prompt- and response-level harmfulness detection, while achieving superior out-of-distribution generalization on both emerging harmful category and jailbreak attacks. Furthermore, RSafe provides human-readable explanations for its safety judgments for better interpretability. RSafe offers a robust, adaptive, and interpretable solution for LLM safety moderation, advancing the development of reliable safeguards in dynamic real-world environments. Our code is available at https://anonymous.4open.science/r/RSafe-996D.