ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

#1 ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [PDF] [Copy] [Kimi] [REL]

Authors: Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang

In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a reasoning chain called ImageGen-CoT prior to image generation. To avoid generating ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple reasoning chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code will be open-sourced.

Subject: ICCV.2025 - Poster

Liao_ImageGen-CoT_Enhancing_Text-to-Image_In-context_Learning_with_Chain-of-Thought_Reasoning@ICCV2025@CVF

#1 ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [PDF] [Copy] [Kimi] [REL]