Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

#1 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs [PDF¹⁶] [Copy] [Kimi²⁵] [REL]

Authors: Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language

Publish: 2025-03-21 05:58:18 UTC

2503.16870

#1 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs [PDF16] [Copy] [Kimi25] [REL]

#1 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs [PDF¹⁶] [Copy] [Kimi²⁵] [REL]