Total: 1
Most knowledge distillation (KD) methods focus on teacher-student pairs with similar architectures, such as both being CNN models. The potential and flexibility of KD can be greatly improved by expanding it to Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be distilled selectively. However, substantial feature gaps between heterogeneous models (e.g., ViT teacher v.s. CNN student) make CAKD extremely challenging, caused by the distinction of inherent inductive biases and module functions. To this end, we fuse heterogeneous knowledge before transferring it from teacher to student. This fusion combines the advantages of both cross-architecture inductive biases and module functions by merging different combinations of convolution, attention, and MLP modules derived directly from student and teacher module functions. Furthermore, heterogeneous features exhibit diverse spatial distributions, hindering the effectiveness of conventional pixel-wise MSE loss. Therefore, we replace it with a spatial-agnostic InfoNCE loss. Our method is evaluated across various homogeneous models and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, yielding promising performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Code is available at https://github.com/liguopeng0923/FBT.