Multimedia

2025-02-07 | | Total: 5

#1 Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment [PDF8] [Copy] [Kimi20] [REL]

Authors: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

Subjects: Computer Vision and Pattern Recognition , Computation and Language , Multimedia , Sound , Audio and Speech Processing , Image and Video Processing

Publish: 2025-02-06 18:59:55 UTC


#2 Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis [PDF6] [Copy] [Kimi4] [REL]

Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

Subjects: Audio and Speech Processing , Artificial Intelligence , Computation and Language , Multimedia , Sound

Publish: 2025-02-06 15:04:00 UTC


#3 UniForm: A Unified Diffusion Transformer for Audio-Video Generation [PDF5] [Copy] [Kimi] [REL]

Authors: Lei Zhao, Linfeng Feng, Dongxu Ge, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.

Subjects: Multimedia , Artificial Intelligence , Computer Vision and Pattern Recognition , Sound , Audio and Speech Processing

Publish: 2025-02-06 09:18:30 UTC


#4 MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling [PDF1] [Copy] [Kimi] [REL]

Authors: Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Human-Computer Interaction , Machine Learning , Multimedia

Publish: 2025-02-06 02:26:47 UTC


#5 CDIO: Cross-Domain Inference Optimization with Resource Preference Prediction for Edge-Cloud Collaboration [PDF] [Copy] [Kimi] [REL]

Authors: Zheming Yang, Wen Ji, Qi Guo, Dieli Hu, Chang Zhao, Xiaowei Li, Xuanlei Zhao, Yi Zhao, Chaoyu Gong, Yang You

Currently, massive video tasks are processed by edge-cloud collaboration. However, the diversity of task requirements and the dynamics of resources pose great challenges to efficient inference, resulting in many wasted resources. In this paper, we present CDIO, a cross-domain inference optimization framework designed for edge-cloud collaboration. For diverse input tasks, CDIO can predict resource preference types by analyzing spatial complexity and processing requirements of the task. Subsequently, a cross-domain collaborative optimization algorithm is employed to guide resource allocation in the edge-cloud system. By ensuring that each task is matched with the ideal servers, the edge-cloud system can achieve higher efficiency inference. The evaluation results on public datasets demonstrate that CDIO can effectively meet the accuracy and delay requirements for task processing. Compared to state-of-the-art edge-cloud solutions, CDIO achieves a computing and bandwidth consumption reduction of 20%-40%. And it can reduce energy consumption by more than 40%.

Subject: Multimedia

Publish: 2025-02-06 13:42:07 UTC