Text2Outfit: Controllable Outfit Generation with Multimodal Language Models

#1 Text2Outfit: Controllable Outfit Generation with Multimodal Language Models [PDF] [Copy] [Kimi¹] [REL]

Authors: Yuanhao Zhai, Yen-Liang Lin, Minxu Peng, Larry S. Davis, Ashwin Chandramouli, Junsong Yuan, David Doermann

Existing outfit recommendation frameworks focus on outfit compatibility prediction and complementary item retrieval. We present a text-driven outfit generation framework, Text2Outfit, which generates outfits controlled by text prompts. Our framework supports two forms of outfit recommendation: 1) Text-to-outfit generation, where the prompt includes the specification for each outfit item (e.g., product features), and the model retrieves items that match the prompt and are stylistically compatible. 2) Seed-to-outfit generation, where the prompt includes the specification for a seed item, and the model both predicts which product types the outfit should include (referred to as composition generation) and retrieves the remaining items to build an outfit. We develop a large language model (LLM) framework that learns the cross-modal mapping between text and image set, and predicts a set of embeddings and compositions to retrieve outfit items. We devise an attention masking mechanism in LLM to handle the alignment between text descriptions and image tokens from different categories. We conduct experiments on the Polyvore dataset and evaluate the quality of the generated outfits from two perspectives: 1) feature matching for outfit items, and 2) outfit visual compatibility. The results demonstrate that our approach significantly outperforms the baseline methods in text to outfit generation.

Subject: ICCV.2025 - Poster

Zhai_Text2Outfit_Controllable_Outfit_Generation_with_Multimodal_Language_Models@ICCV2025@CVF

#1 Text2Outfit: Controllable Outfit Generation with Multimodal Language Models [PDF] [Copy] [Kimi1] [REL]

#1 Text2Outfit: Controllable Outfit Generation with Multimodal Language Models [PDF] [Copy] [Kimi¹] [REL]