Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

#1 Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation [PDF⁴] [Copy] [Kimi] [REL]

Authors: Mingxing Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Subject: Computer Vision and Pattern Recognition

Publish: 2026-01-04 09:59:43 UTC

2601.01457

#1 Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation [PDF4] [Copy] [Kimi] [REL]

#1 Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation [PDF⁴] [Copy] [Kimi] [REL]