Total: 1
Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.