Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

#1 Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding [PDF¹] [Copy] [Kimi] [REL]

Authors: Yassir Benhammou, Suman Kalyan, Sujay Kumar

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-11-17 19:13:51 UTC

2511.17596

#1 Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding [PDF1] [Copy] [Kimi] [REL]

#1 Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding [PDF¹] [Copy] [Kimi] [REL]