Total: 1
Humans naturally use multimodal information, with vision, speech, and text working together to understand the world and solve problems. For artificial intelligence to achieve human-level capability, it must process multimodal information in a similar manner. However, there is a lack of large-scale open-domain datasets that support all three modalities—vision, speech, and text—with high-quality speech transcriptions. To address this gap, we introduce GALAXY, a large-scale, open-domain dataset designed for multimodal learning, containing 8,270 hours of videos, speech, and transcriptions across 16 diverse domains. We describe the data creation pipeline and provide detailed statistics and analyses of the dataset. Using multimodal speech recognition as a case study, we validate GALAXY’s effectiveness and evaluate baseline models’ performance across different data volumes and domains. The results highlight GALAXY’s potential as a valuable resource for advancing multimodal understanding.