Extending Compositional Attention Networks for Social Reasoning in Videos

#1 Extending Compositional Attention Networks for Social Reasoning in Videos [PDF] [Copy] [Kimi¹] [REL]

Authors: Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos

We propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multistep reasoning capabilities of Compositional Attention Networks (MAC) [1], and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art. Index Terms: Video Question Answering, Social Reasoning, Compositional Attention Networks, MAC

Subject: INTERSPEECH.2022 - Language and Multimodal

sartzetaki22@interspeech_2022@ISCA

#1 Extending Compositional Attention Networks for Social Reasoning in Videos [PDF] [Copy] [Kimi1] [REL]

#1 Extending Compositional Attention Networks for Social Reasoning in Videos [PDF] [Copy] [Kimi¹] [REL]