Cameras as Relative Positional Encoding | Cool Papers

#1 Cameras as Relative Positional Encoding [PDF¹⁷] [Copy] [Kimi¹⁴] [REL]

Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-07-14 17:22:45 UTC

2507.10496

#1 Cameras as Relative Positional Encoding [PDF17] [Copy] [Kimi14] [REL]

#1 Cameras as Relative Positional Encoding [PDF¹⁷] [Copy] [Kimi¹⁴] [REL]