HIGGS: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

#1 HIGGS: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [PDF] [Copy] [Kimi] [REL]

Authors: Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh

Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a “linearity theorem” establishing a direct relationship between the layer-wise reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-family models, advancing both data-free and non-uniform quantization for large language models.

Subject: NAACL.2025 - Long Papers

2025.naacl-long.543@ACL

#1 HIGGS: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [PDF] [Copy] [Kimi] [REL]