Combine to Describe: Evaluating Compositional Generalization in Image Captioning

#1 Combine to Describe: Evaluating Compositional Generalization in Image Captioning [PDF] [Copy] [Kimi¹]

Authors: George Pantazopoulos ; Alessandro Suglia ; Arash Eshghi

Compositionality – the ability to combine simpler concepts to understand & generate arbitrarily more complex conceptual structures – has long been thought to be the cornerstone of human language capacity. With the recent, notable success of neural models in various NLP tasks, attention has now naturally turned to the compositional capacity of these models. In this paper, we study the compositional generalization properties of image captioning models. We perform a set experiments under controlled conditions using model and data ablations, each designed to benchmark a particular facet of compositional generalization: systematicity is the ability of a model to create novel combinations of concepts out of those observed during training, productivity is here operationalised as the capacity of a model to extend its predictions beyond the length distribution it has observed during training, and substitutivity is concerned with the robustness of the model against synonym substitutions. While previous work has focused primarily on systematicity, here we provide a more in-depth analysis of the strengths and weaknesses of state of the art captioning models. Our findings demonstrate that the models we study here do not compositionally generalize in terms of systematicity and productivity, however, they are robust to some degree to synonym substitutions

2022.acl-srw.11@ACL

#1 Combine to Describe: Evaluating Compositional Generalization in Image Captioning [PDF] [Copy] [Kimi1]

#1 Combine to Describe: Evaluating Compositional Generalization in Image Captioning [PDF] [Copy] [Kimi¹]