Total: 1
Human-object interaction (HOI) detection relies on fine-grained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.