Sun_CoTMR_Chain-of-Thought_Multi-Scale_Reasoning_for_Training-Free_Zero-Shot_Composed_Image_Retrieval@ICCV2025@CVF

Total: 1

#1 CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Zelong Sun, Dong Jing, Zhiwu Lu

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and Large Language Models (LLMs) to generate target captions from composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR directly employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning of composed queries. To enhance reasoning reliability, we devise CIRCoT, which guides the LVLM to perform step-by-step reasoning by following predefined subtasks. Additionally, while most existing approaches focus solely on global-level reasoning, CoTMR introduces fine-grained predictions about the presence or absence of key elements at the object scale for more comprehensive reasoning. Furthermore, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.

Subject: ICCV.2025 - Poster