Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

#1 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [PDF²] [Copy] [Kimi²] [REL]

Authors: Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on **heuristically hand-crafted metrics**, potentially leading to suboptimal performance. We instead propose a novel **optimization-based structural pruning** that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method **eliminates the back-propagation** through the LLM *per se* during the optimization, requiring only **the forward pass of the LLM**. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via *policy gradient estimator* without back-propagation. As a result, our method is able to 1) *support global and heterogeneous pruning* (*i.e.*, our method automatically determines different redundancy for different layers), and 2) *optionally initialize with a metric-based method* (for our Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness.

Subject: ACL.2025 - Long Papers

2025.acl-long.1421@ACL

#1 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [PDF2] [Copy] [Kimi2] [REL]

#1 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [PDF²] [Copy] [Kimi²] [REL]