Adapting Vision-Language Models for E-commerce Understanding at Scale

#1 Adapting Vision-Language Models for E-commerce Understanding at Scale [PDF] [Copy] [Kimi⁵] [REL]

Authors: Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-02-12 08:59:22 UTC

2602.11733

#1 Adapting Vision-Language Models for E-commerce Understanding at Scale [PDF] [Copy] [Kimi5] [REL]

#1 Adapting Vision-Language Models for E-commerce Understanding at Scale [PDF] [Copy] [Kimi⁵] [REL]