Total: 1
Recent advances in diffusion models have significantly advanced image generation; however, existing models remain task-specific, limiting their efficiency and generalizability. While universal models attempt to address these limitations, they face critical challenges, including generalizable instruction design, appropriate task distributions, and unified architectural design. In this work, we propose VisualCloze, a universal image generation framework, to tackle these challenges. Unlike existing methods that rely on language-based task descriptions, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and knowledge transfer. Furthermore, we uncover an intrinsic alignment between image infilling and in-context learning, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying their architectures. Experiments demonstrate that VisualCloze achieves strong performance across more than 100 in-domain tasks while generalizing to unseen tasks in few-shot and zero-shot settings. Our codes and dataset are available at https://visualcloze.github.io/.