DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

#1 DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video [PDF²] [Copy] [Kimi¹] [REL]

Authors: Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling, Taiping Zeng, Xiangyang Xue, Jingbo Zhang

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-24 06:42:17 UTC

2511.18814

#1 DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video [PDF2] [Copy] [Kimi1] [REL]

#1 DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video [PDF²] [Copy] [Kimi¹] [REL]