LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

#1 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale [PDF] [Copy] [Kimi] [REL]

Authors: Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method enables the model to learn fine-grained vision-language correlations in temporal. To support this, we introduce a series of data processing techniques on YouTube videos and closed captions (CC), resulting in 30M pre-training data samples and 1.5M for instruction tuning. Benefiting from our training paradigm, the trained model is powerful at streaming applications and can naturally support real-time video commentary. We also introduce a new benchmark focused on sports commentary and event understanding, a domain where live performance is critical. Experiments show that our model outperforms state-of-the-art models in both accuracy and latency. Additionally, our model achieves state-of-the-art or competitive results on several mainstream benchmarks, demonstrating its broad generalizability. We will release the codes, datasets, and models to facilitate further research.

Subject: CVPR.2025 - Poster

Chen_LiveCC_Learning_Video_LLM_with_Streaming_Speech_Transcription_at_Scale@CVPR2025@CVF

#1 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale [PDF] [Copy] [Kimi] [REL]