MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali

2025.naacl-srw.31@ACL

Total: 1

#1 MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali [PDF] [Copy] [Kimi] [REL]

Authors: Anik Mahmud Shanto, Mst. Sanjida Jamal Priya, Fahim Shakil Tamim, Mohammed Moshiul Hoque

Identifying commercial posts in resource-constrained languages among diverse and unstructured content remains a significant challenge for automatic text classification tasks. To address this, this work introduces a novel dataset named MDC3 (Multimodal Dataset for Commercial Content Classification), comprising 5,007 annotated Bengali social media posts classified as commercial and noncommercial. A comprehensive annotation guideline accompanying the dataset is included to aid future dataset creation in resource-constrained languages. Furthermore, we performed extensive experiments on MDC3 considering both unimodal and multimodal domains. Specifically, the late fusion of textual (mBERT) and visual (ViT) models (i.e., ViT+mBERT) achieves the highest F1 score of 90.91, significantly surpassing other baselines.

Subject: NAACL.2025 - Student Research Workshop