2506.13458

Total: 1

#1 Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images [PDF1] [Copy] [Kimi1] [REL]

Authors: Cristina Mahanta, Gagan Bhatia

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

Subjects: Computer Vision and Pattern Recognition , Computation and Language

Publish: 2025-06-16 13:15:02 UTC