Total: 1
Learning discriminative state representations of agents, encompassing the spatial layout and temporal pose trajectory, is essential for effective navigation decisions. However, existing approaches often rely on simplistic plain networks for navigation information fusion, overlooking the complex long-range dependencies across spatio-temporal cues, which leads to suboptimal state perception and potential decision failures. In this paper, we introduce NaviFormer, an effective encoder-decoder navigation transformer, to aggregate discriminative spatio-temporal context information for object navigation. Our navigation encoder not only encodes spatial layouts and temporal agent poses but also innovatively constructs and encodes a passable frontier map, enriching the original state encoding with cues of potential exploration regions. Furthermore, our navigation decoder employs spatio-temporal self-attention and cross-attention mechanisms to model the dependencies among spatial layout encoding, temporal pose encoding, and passable frontier encoding, thereby facilitating comprehensive contextual state feature aggregation. Finally, we leverage these learned spatio-temporal contextual state representations for PPO-based navigation decisions. Extensive experiments on the Gibson, Habitat-Matterport3D (HM3D) and Matterport3D (MP3D) datasets demonstrate the superiority of our approach.