Total: 1
Medical multi-modal learning requires an effective fusion capability of various heterogeneous modalities.One vital challenge is how to effectively fuse modalities when their data quality varies across different modalities and patients.For example, in the TCGA benchmark, the performance of the same modality can differ between types of cancer. Moreover, data collected at different times, locations, and with varying reagents can introduce inter-modal data quality differences ($i.e.$, $\textbf{Modality Batch Effect}$).In response, we propose ${\textbf{A}}$daptive ${\textbf{M}}$odality Token Re-Balan${\textbf{C}}$ing ($\texttt{AMC}$), a novel top-down dynamic multi-modal fusion approach.The core of $\texttt{AMC}$ is to quantify the significance of each modality (Top) and then fuse them according to the modality importance (Down).Specifically, we access the quality of each input modality and then replace uninformative tokens with inter-modal tokens, accordingly.The more important a modality is, the more informative tokens are retained from that modality.The self-attention will further integrate these mixed tokens to fuse multi-modal knowledge.Comprehensive experiments on both medical and general multi-modal datasets demonstrate the effectiveness and generalizability of $\texttt{AMC}$.