OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

#1 OffQ: Taming Structured Outliers in LLM Quantization by Offsetting [PDF] [Copy] [Kimi] [REL]

Authors: Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language

Publish: 2026-06-05 10:11:34 UTC

2606.07116

#1 OffQ: Taming Structured Outliers in LLM Quantization by Offsetting [PDF] [Copy] [Kimi] [REL]