Hu_Instruct-Imagen_Image_Generation_with_Multi-modal_Instruction@CVPR2024@CVF

Total: 1

#1 Instruct-Imagen: Image Generation with Multi-modal Instruction [PDF154] [Copy] [Kimi143] [REL]

Authors: Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, William Cohen, Ming-Wei Chang, Xuhui Jia

This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision.It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, \etc), such that abundant generation intents can be standardized in a uniform format.We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multi-modal context.Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.

Subjects: CVPR.2024 - Oral, CVPR.2024 - Highlight