Textural or Textual: How Vision-Language Models Read Text in Images

#1 Textural or Textual: How Vision-Language Models Read Text in Images [PDF¹] [Copy] [Kimi³] [REL]

Typographic attacks are often attributed to the ability of multimodal pre-trained models to fuse textual semantics into visual representations, yet the mechanisms and locus of such interference remain unclear. We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the ToT dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layer-wise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth. These findings reveal how current vision-language models construct text representations through texture-dependent processes, prompting a reconsideration of the gap between visual perception and semantic understanding. The code is available at: https://github.com/Ovsia/Textural-or-Textual

Subject: ICML.2025 - Poster

fUvuEfZSEE@OpenReview

#1 Textural or Textual: How Vision-Language Models Read Text in Images [PDF1] [Copy] [Kimi3] [REL]

#1 Textural or Textual: How Vision-Language Models Read Text in Images [PDF¹] [Copy] [Kimi³] [REL]