2606.31421

Total: 1

#1 Temporal Preservation over Processing: Diagnosing and Designing Spatiotemporal Single-Stage Video Detectors [PDF] [Copy] [Kimi] [REL]

Authors: Karam Tomotaki-Dawoud, Anna Hilsmann, Peter Eisert, Sebastian Bosse

Single-stage video object detectors are increasingly deployed in time-critical applications, yet it remains unclear whether these models genuinely reason over temporal context or merely exploit a single informative frame-a gap hidden by standard metrics, which reward correct predictions regardless of how they are reached. We address this from two complementary directions: first, we propose TemporalLens, a model-agnostic diagnostic framework probing temporal dependence through controlled perturbations, structured occlusions, temporal shuffling, redundancy injection, and resolution degradation, revealing whether a detector actually uses information across time. Applied to stacked-frame 2D detectors and our YOLO-3D architecture, it exposes behavioural differences invisible to mAP: stacked 2D models collapse when the target frame is removed, while spatiotemporal models recover predictions from earlier frames, a signature of real temporal reliance. Second, we detail YOLO-3D, a modular real-time spatiotemporal detector built on YOLOv8, and show that simply preserving temporal depth through the backbone is the dominant performance driver (+3.7 pp mAP@50 at 32 frames averaged across scales). Together, the diagnostics and architecture turn "does this detector reason over time?" into a measurable, actionable question.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-06-30 09:44:21 UTC