Distilling Cross-Modal Knowledge via Feature Disentanglement

#1 Distilling Cross-Modal Knowledge via Feature Disentanglement [PDF²] [Copy] [Kimi] [REL]

Authors: Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-11-25 03:45:37 UTC

2511.19887

#1 Distilling Cross-Modal Knowledge via Feature Disentanglement [PDF2] [Copy] [Kimi] [REL]

#1 Distilling Cross-Modal Knowledge via Feature Disentanglement [PDF²] [Copy] [Kimi] [REL]