Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

#1 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [PDF⁶] [Copy] [Kimi⁵] [REL]

Authors: Shuhao Gu ; Jialing Zhang ; Siyuan Zhou ; Kevin Yu ; Zhaohu Xing ; Liangdong Wang ; Zhou Cao ; Jintao Jia ; Zhuoyi Zhang ; Yixuan Wang ; Zhenchong Hu ; Bo-Wen Zhang ; Jijie Li ; Dong Liang ; Yingli Zhao ; Yulong Ao ; Yaoqi Liu ; Fangxiang Feng ; Guang Liu

Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.

Subject: Computation and Language

Publish: 2024-10-24 09:03:48 UTC

2410.18558

#1 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [PDF6] [Copy] [Kimi5] [REL]

#1 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [PDF⁶] [Copy] [Kimi⁵] [REL]