Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

#1 Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation [PDF¹] [Copy] [Kimi] [REL]

Authors: Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Teare, Huiyu Zhou, Chen Jin

Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or *mask prompts*, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

Subject: ICML.2025 - Poster

9bzgpYtQZn@OpenReview

#1 Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation [PDF1] [Copy] [Kimi] [REL]

#1 Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation [PDF¹] [Copy] [Kimi] [REL]