Multimedia

#1 VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning [PDF³] [Copy] [Kimi] [REL]

Authors: Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.

Subject: Multimedia

Publish: 2025-07-03 13:52:24 UTC

#2 TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Yubeen Lee, Sangeun Lee, Chaewon Park, Junyeop Cha, Eunil Park

Multimodal emotion recognition often suffers from performance degradation in valence-arousal estimation due to noise and misalignment between audio and visual modalities. To address this challenge, we introduce TAGF, a Time-aware Gated Fusion framework for multimodal emotion recognition. The TAGF adaptively modulates the contribution of recursive attention outputs based on temporal dynamics. Specifically, the TAGF incorporates a BiLSTM-based temporal gating mechanism to learn the relative importance of each recursive step and effectively integrates multistep cross-modal features. By embedding temporal awareness into the recursive fusion process, the TAGF effectively captures the sequential evolution of emotional expressions and the complex interplay between modalities. Experimental results on the Aff-Wild2 dataset demonstrate that TAGF achieves competitive performance compared with existing recursive attention-based models. Furthermore, TAGF exhibits strong robustness to cross-modal misalignment and reliably models dynamic emotional transitions in real-world conditions.

Subjects: Multimedia , Sound

Publish: 2025-07-02 18:31:24 UTC

#3 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation [PDF] [Copy] [Kimi] [REL]

Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Multimedia

Publish: 2025-07-03 03:23:11 UTC

#4 Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System [PDF³] [Copy] [Kimi] [REL]

Authors: Yongsen Zheng, Zongxuan Xie, Guohua Wang, Ziyao Liu, Liang Lin, Kwok-Yan Lam

Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness. Our code is available at https://github.com/zysensmile/HyFairCRS.

Subjects: Information Retrieval , Computation and Language , Multimedia

Publish: 2025-07-01 11:39:42 UTC

#1 VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning [PDF3] [Copy] [Kimi] [REL]

#2 TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation [PDF] [Copy] [Kimi] [REL]

#3 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation [PDF] [Copy] [Kimi] [REL]

#4 Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System [PDF3] [Copy] [Kimi] [REL]

#1 VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning [PDF³] [Copy] [Kimi] [REL]

#4 Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System [PDF³] [Copy] [Kimi] [REL]