Total: 1
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method enables the model to learn fine-grained vision-language correlations in temporal. To support this, we introduce a series of data processing techniques on YouTube videos and closed captions (CC), resulting in 30M pre-training data samples and 1.5M for instruction tuning. Benefiting from our training paradigm, the trained model is powerful at streaming applications and can naturally support real-time video commentary. We also introduce a new benchmark focused on sports commentary and event understanding, a domain where live performance is critical. Experiments show that our model outperforms state-of-the-art models in both accuracy and latency. Additionally, our model achieves state-of-the-art or competitive results on several mainstream benchmarks, demonstrating its broad generalizability. We will release the codes, datasets, and models to facilitate further research.