Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

2511.03332

Total: 1

#1 Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge [PDF³] [Copy] [Kimi¹] [REL]

Authors: Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-05 10:01:31 UTC

2511.03332

#1 Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge [PDF3] [Copy] [Kimi1] [REL]

#1 Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge [PDF³] [Copy] [Kimi¹] [REL]