QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

#1 QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [PDF²] [Copy] [Kimi] [REL]

Authors: Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak

In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 06:55:26 UTC

2407.13216

#1 QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [PDF2] [Copy] [Kimi] [REL]

#1 QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [PDF²] [Copy] [Kimi] [REL]