Total: 1
Is visual information alone sufficient for visual speech recognition (VSR) in challenging real-world scenarios? Humans do not rely solely on visual information for lip-reading but also incorporate additional cues, such as speech-related context and prior knowledge about the task. However, existing methods have largely overlooked such external information in automatic VSR systems. To systematically explore the role of such information for VSR, we introduce the concept of Peripheral Information. We categorize it into three types based on the relevance to the spoken content: (1) Contextual Guidance (e.g., topic or description of speech), (2) Task Expertise (e.g., human prior experience in lip-reading), and (3) Linguistic Perturbation (irrelevant signals processed alongside meaningful information). Considering the disparity that peripheral information provides additional clues with varying significance while visual input serves as the most direct source for VSR, we propose a framework that introduces a hierarchical processing strategy to handle different modalities. With visual-specific adaptation and a dynamic routing mechanism for multi-modal information, our approach reduces the impact of modality conflicts effectively and enables selective utilization of peripheral information with varying relevance. Leveraging readily available peripheral information, our model achieves a WER of 22.03% on LRS3. Further experiments on AVSpeech demonstrate its generalization in real-world scenarios.