Masks Can be Learned as an Alternative to Experts

#1 Masks Can be Learned as an Alternative to Experts [PDF⁵] [Copy] [Kimi²] [REL]

Authors: Peiyu Liu, Tianwen Wei, Bo Zhu, Xin Zhao, Shuicheng Yan

In this work, we investigate how to sparsify a pre-trained dense large language model into a mixture-of-experts (MoE) architecture for faster inference. Our approach applies mask matrix to the activations for each expert, constrained by L0 regularization to minimize the number of activated parameters. Starting with all parameters active, the model is progressively sparsified during training, ensuring minimal performance loss. This approach proves more efficient than one-shot sparsification techniques, which typically require significant resources for performance recovery. Moreover, our approach automatically identifies shared, token-specific, and inactive experts, allowing for more efficient allocation of computational resources. Through extensive experiments, we achieve up to 97% performance retention on downstream tasks with only 50% of the feed-forward parameters activated in dense models. Beyond enhancing inference efficiency, this strategy of sharing computational units among experts presents a valuable framework for designing more generalized and efficient MoE architectures, opening avenues for future advancements in expert-based models.

Subject: ACL.2025 - Long Papers

2025.acl-long.768@ACL

#1 Masks Can be Learned as an Alternative to Experts [PDF5] [Copy] [Kimi2] [REL]

#1 Masks Can be Learned as an Alternative to Experts [PDF⁵] [Copy] [Kimi²] [REL]