MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization

#1 MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization [PDF¹] [Copy] [Kimi] [REL]

Authors: Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, Qingan Li

Small language models (SLMs) are gaining attention for their lower computational and memory needs while maintaining strong performance. However, efficiently deploying SLMs on resource-constrained devices remains a significant challenge. Post-training quantization(PTQ) is a widely used compression technique that reduces memory usage and inference computation, yet existing methods face challenges in inefficient bit-width allocation and insufficient fine-grained quantization adjustments, leading to suboptimal performance, particularly at lower bit-widths. To address these challenges, we propose multi-level weight quantization (MLWQ), which facilitates the efficient deployment of SLMs. Our method enables more effective bit-width allocation by jointly considering inter-layer loss and intra-layer salience. Furthermore, we propose a fine-grained partitioning of intra-layer salience to support the tweaking of quantization parameters within each group. Experimental results indicate that MLWQ achieves competitive performance compared to state-of-the-art methods, providing an effective approach for the efficient deployment of SLMs while maintaining model accuracy.

Subject: EMNLP.2025 - Main

2025.emnlp-main.408@ACL

#1 MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization [PDF1] [Copy] [Kimi] [REL]

#1 MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization [PDF¹] [Copy] [Kimi] [REL]