VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking

#1 VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking [PDF] [Copy] [Kimi] [REL]

Authors: Zekun Qian, Ruize Han, Junhui Hou, Linqi Song, Wei Feng

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack establishes itself as a state-of-the-art solution for the open-vocabulary tracking task.

Subject: ICCV.2025 - Poster

Qian_VOVTrack_Exploring_the_Potentiality_in_Raw_Videos_for_Open-Vocabulary_Multi-Object@ICCV2025@CVF

#1 VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking [PDF] [Copy] [Kimi] [REL]