Date: Fri, 14 Jun 2024 | Total: 6

#1 Explore the Limits of Omni-modal Pretraining at Scale [PDF17] [Copy] [Kimi16]

Authors: Yiyuan Zhang ; Handong Li ; Jing Liu ; Xiangyu Yue

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence ; Machine Learning ; Multimedia

Publish: 2024-06-13 17:59:53 UTC

#2 PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance [PDF4] [Copy] [Kimi2]

Authors: Qijun Gan ; Song Wang ; Shengtao Wu ; Jianke Zhu

Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The dataset and source code can be accessed at

Subjects: Sound ; Artificial Intelligence ; Computer Vision and Pattern Recognition ; Multimedia ; Audio and Speech Processing

Publish: 2024-06-13 17:05:23 UTC

#3 Towards Multilingual Audio-Visual Question Answering [PDF2] [Copy] [Kimi1]

Authors: Orchid Chetia Phukan ; Priyabrata Mallick ; Swarup Ranjan Behera ; Aalekhya Satya Narayani ; Arun Balaji Buduru ; Rajesh Sharma

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

Subjects: Machine Learning ; Computer Vision and Pattern Recognition ; Multimedia ; Sound ; Audio and Speech Processing

Publish: 2024-06-13 14:18:56 UTC

#4 Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding [PDF1] [Copy] [Kimi1]

Authors: Yue Xu ; Kaizhi Yang ; Jiebo Luo ; Xuejin Chen

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

Subjects: Computer Vision and Pattern Recognition ; Multimedia

Publish: 2024-06-13 08:06:57 UTC

#5 Gaussian-Forest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling [PDF1] [Copy] [Kimi]

Authors: Fengyi Zhang ; Tianjun Zhang ; Lin Zhang ; Helen Huang ; Yadan Luo

The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage challenge, impeding its broader application. To address this challenge, we introduce the Gaussian-Forest modeling framework, which hierarchically represents a scene as a forest of hybrid 3D Gaussians. Each hybrid Gaussian retains its unique explicit attributes while sharing implicit ones with its sibling Gaussians, thus optimizing parameterization with significantly fewer variables. Moreover, adaptive growth and pruning strategies are designed, ensuring detailed representation in complex regions and a notable reduction in the number of required Gaussians. Extensive experiments demonstrate that Gaussian-Forest not only maintains comparable speed and quality but also achieves a compression rate surpassing 10 times, marking a significant advancement in efficient scene modeling. Codes are available at

Subjects: Computer Vision and Pattern Recognition ; Multimedia

Publish: 2024-06-13 02:41:11 UTC

#6 A new approach for predicting the Quality of Experience in multimedia services using machine learning [PDF] [Copy] [Kimi]

Authors: Parsa Hassani Shariat Panahi ; Amir Hossein Jalilvand ; Abolfazl Diyanat

In today's world, the Internet is recognized as one of the essentials of human life, playing a significant role in communications, business, and lifestyle. The quality of internet services can have widespread negative impacts on individual and social levels. Consequently, Quality of Service (QoS) has become a fundamental necessity for service providers in a competitive market aiming to offer superior services. The success and survival of these providers depend on their ability to maintain high service quality and ensure satisfaction.Alongside QoS, the concept of Quality of Experience (QoE) has emerged with the development of telephony networks. QoE focuses on the user's satisfaction with the service, helping operators adjust their services to meet user expectations. Recent research shows a trend towards utilizing machine learning and deep learning techniques to predict QoE. Researchers aim to develop accurate models by leveraging large volumes of data from network and user interactions, considering various real-world scenarios. Despite the complexity of network environments, this research provides a practical framework for improving and evaluating QoE. This study presents a comprehensive framework for evaluating QoE in multimedia services, adhering to the ITU-T P.1203 standard which includes automated data collection processes and uses machine learning algorithms to predict user satisfaction based on key network parameters. By collecting over 20,000 data records from different network conditions and users, the Random Forest model achieved a prediction accuracy of 95.8% for user satisfaction. This approach allows operators to dynamically allocate network resources in real-time, maintaining high levels of customer satisfaction with minimal costs.

Subjects: Networking and Internet Architecture ; Artificial Intelligence ; Multimedia

Publish: 2024-06-12 18:07:06 UTC