305@2024@ECCV

Total: 1

#1 Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning [PDF1] [Copy] [Kimi1] [REL]

Authors: Cong Wu, Xiao-Jun Wu, Linze Li, Tianyang Xu, Zhenhua Feng, Josef Kittler

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is been proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, current feature are decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the multi-level contrasts between text-visual and support-query are integrated to provide a comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. Please find our code in supplementary materials.

Subject: ECCV.2024 - Poster