NCYBdRCpw1@OpenReview

Total: 1

#1 DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models [PDF1] [Copy] [Kimi] [REL]

Authors: Changyi He, Yifu Ding, Jinyang Guo, Ruihao Gong, Haotong Qin, Xianglong Liu

Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2\% with half training cost and even surpass the teacher model with 4.7$\times$ compression.

Subject: ICML.2025 - Poster