Total: 1
Model-free reinforcement learning (RL) combined with diffusion models has achieved significant progress in addressing complex continuous control tasks. However, a persistent challenge in RL remains the accurate estimation of Q-values, which critically governs the efficacy of policy optimization. Although recent advances employ parametric distributions to model value distributions for enhanced estimation accuracy, current methodologies predominantly rely on unimodal Gaussian assumptions or quantile representations. These constraints introduce distributional bias between the learned and true value distributions, particularly in some tasks with a nonstationary policy, ultimately degrading performance. To address these limitations, we propose value diffusion reinforcement learning (VDRL), a novel model-free online RL method that utilizes the generative capacity of diffusion models to represent multimodal value distributions. The core innovation of VDRL lies in the use of the variational loss of diffusion-based value distribution, which is theoretically proven to be a tight lower bound for the optimization objective under the KL-divergence measurement. Furthermore, we introduce double value diffusion learning with sample selection to enhance training stability and further improve value estimation accuracy. Extensive experiments conducted on the MuJoCo benchmark demonstrate that VDRL significantly outperforms some SOTA model-free online RL baselines, showcasing its effectiveness and robustness.