Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

#1 Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [PDF] [Copy] [Kimi] [REL]

Authors: Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal

Post-training quantization reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. Existing methods mitigate these drops by keeping some important weights in higher precision; we develop a new mixed-precision approach, Task-Circuit Quantization (TCQ), that directly conditions the quantization process on specific circuits -- which we define as sets of weights associated with downstream task performance. TCQ draws parallels to automated circuit discovery, introducing a novel method to identify a small number of key weights that are particularly important to task performance; these weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TCQ-based quantization to existing mixed-precision quantization methods and GPTQ when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, text-to-SQL tasks and for both Llama-3 and Qwen2.5 models, we find that TCQ outperforms baselines like SPQR and Slim-LLM using the same calibration data and a lower weight budget, achieving major improvements in the 2- and 3-bit regime. With only 3.1 bits we are able to recover 97% of the model's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. Furthermore, we observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, Slim-LLM. Code: [https://github.com/The-Inscrutable-X/TACQ](https://github.com/The-Inscrutable-X/TACQ)

Subject: COLM.2025

a201nfn3xX@OpenReview

#1 Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [PDF] [Copy] [Kimi] [REL]