Total: 1
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high performance in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DyTo, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DyTo integrates hierarchical frame selection and bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DyTo. Our method not only sets a new state-of-the-art for zero-shot video understanding when applied to image-trained MLLMs, but also further boosts the performance of models already fine-tuned on video data. Code is available at https://github.com/Jam1ezhang/DYTO.