phVWcUSGYP@OpenReview

Total: 1

#1 Matryoshka Quantization [PDF3] [Copy] [Kimi1] [REL]

Authors: Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

Quantizing model weights is critical for reducingthe communication and inference costs of largemodels. However, quantizing models – especiallyto low precisions like int4 or int2 – requires atrade-off in model quality; int2, in particular, isknown to severely degrade model quality. Consequently, practitioners are often forced to maintainmultiple models with different quantization levels or serve a single model that best satisfies thequality-latency trade-off. On the other hand, integer data types, such as int8, inherently possessa nested (Matryoshka) structure where smallerbit-width integers, like int4 or int2, are nestedwithin the most significant bits. Leveraging thisinsight, in this paper, we propose MatryoshkaQuantization (MatQuant), a novel multi-scalequantization technique that alleviates the aforementioned challenge. This technique allows us totrain and maintain a single quantized model butserve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’sco-training and co-distillation, int2 precision models extracted by MatQuant outperform standardint2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively.Finally, we demonstrate that by using an extra bitto represent outliers, a model with an effectiveprecision of 2.05-bit improves further by 6% withOmniQuant as the base algorithm.

Subject: ICML.2025 - Poster