33091@AAAI

Total: 1

#1 Visual Perturbation for Text-Based Person Search [PDF] [Copy] [Kimi] [REL]

Authors: Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng

Text-based person search aims at locating a person described by natural language in uncropped scene images. Recent works for TBPS mainly focus on aligning multi-granularity vision and language representations, neglecting a key discrepancy between training and inference where the former learns to unify vision and language features where the visual side covers all clues described by language, yet the latter matches image-text pairs where the images may capture only part of the described clues due to perturbations such as occlusions, background clutters and misaligned boundaries. To alleviate this issue, we present ViPer: a Visual Perturbation network that learns to match language descriptions with perturbed visual clues. On top of a CLIP-driven baseline, we design three visual perturbation modules: (1) Spatial ViPer that varies person proposals and produces visual features with misaligned boundaries, (2) Attentive ViPer that estimates visual attention on the fly and manipulates attentive visual tokens within a proposal to produce global features under visual perturbations, and (3) Fine-grained ViPer that learns to recover masked visual clues from detailed language descriptions to encourage matching language features with perturbed visual features at the fine granularity. This overall framework thus simulates real-world scenarios at the training stage to minimize the discrepancy and improve the generalization ability of the model. Experimental results demonstrate that the proposed method clearly surpasses previous TBPS methods on the PRW-TBPS and CUHK-SYSU-TBPS datasets.

Subject: AAAI.2025 - Computer Vision