Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

#1 Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion [PDF⁵] [Copy] [Kimi³]

Authors: Zehan Wang ; Ziang Zhang ; Xize Cheng ; Rongjie Huang ; Luping Liu ; Zhenhui Ye ; Haifeng Huang ; Yang Zhao ; Tao Jin ; Peng Gao ; Zhou Zhao

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space by integrating knowledge from extra expert spaces via "molecules space reactions". Specifically, we introduce two kinds of basic space reactions: 1) Space Displacement Reaction and 2) Space Combination Reaction. Based on these defined basic reactions, we design Complex Sequential & Parallel Reactions to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we fuse the audio-image-text space of ImageBind with the image-text and audio-text expert spaces. The resulting space outperforms ImageBind on 5 downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the used image-text and audio-text expert spaces.

2405.04883

#1 Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion [PDF5] [Copy] [Kimi3]

#1 Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion [PDF⁵] [Copy] [Kimi³]