Total: 1
Remote physiological measurement (RPM) based on video and radar has made significant progress in recent years. However, unimodal methods based solely on video or radar sensor have notable limitations due to their measurement principles, and multimodal RPM that combines these modalities has emerged as a promising direction. Despite its potential, the lack of large-scale multimodal data and the significant modality gap between video and radar pose substantial challenges in building robust video-radar RPM models. To handle these problems, we suggest leveraging unimodal pre-training and present the Spatial alignment and Temporal Matching (SATM) Adapter to effectively fine-tune pre-trained unimodal backbones into a multimodal RPM model. Given the distinct measurement principles of video- and radar-based methods, we propose Spatial Alignment to align the spatial distribution of their features. Furthermore, Temporal Matching is applied to mitigate waveform discrepancies between video and radar signals. By integrating these two modules into adapters, the unimodal backbones could retain their modality-specific knowledge while effectively extracting complementary features from each other. Extensive experiments across various challenging scenarios, including low light conditions and head motions, demonstrate that our approach significantly surpasses the state-of-the-art methods.