Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

#1 Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness [PDF] [Copy] [Kimi] [REL]

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language

Publish: 2026-06-14 19:07:22 UTC

2606.15980

#1 Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness [PDF] [Copy] [Kimi] [REL]