Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

#1 Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content? [PDF] [Copy] [Kimi] [REL]

Authors: Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik

Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.

Subject: Audio and Speech Processing

Publish: 2026-03-30 17:33:58 UTC

2603.28717

#1 Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content? [PDF] [Copy] [Kimi] [REL]