Learning Vision and Language Concepts for Controllable Image Generation

#1 Learning Vision and Language Concepts for Controllable Image Generation [PDF⁴] [Copy] [Kimi] [REL]

Authors: Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric Xing, Guangyi Chen, Kun Zhang

Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored.In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability.

Subject: ICML.2025 - Poster

hUHRTaTfvZ@OpenReview

#1 Learning Vision and Language Concepts for Controllable Image Generation [PDF4] [Copy] [Kimi] [REL]

#1 Learning Vision and Language Concepts for Controllable Image Generation [PDF⁴] [Copy] [Kimi] [REL]