Consistent Emphatic Temporal-Difference Learning

#1 Consistent Emphatic Temporal-Difference Learning [PDF] [Copy] [Kimi] [REL]

Authors: Jiamin He, Fengdi Che, Yi Wan, A. Rupam Mahmood We proposed the first practical consistent off-policy TD algorithm and showed its competitive performance.

Off-policy policy evaluation has been a critical and challenging problem in reinforcement learning, and Temporal-Difference (TD) learning is one of the most important approaches for addressing it. Notably, Full Importance-Sampling TD is the only existing off-policy TD method that is guaranteed to find the on-policy TD fixed point in the linear function approximation setting but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of Emphatic TD, which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new consistent algorithm with a transient bias, which strikes a balance between bias and variance. Further, we unify the new algorithm with several existing algorithms and obtain a new family of consistent algorithms called \emph{Consistent Emphatic TD} (CETD($\lambda$, $\beta$, $\nu$)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through theoretical analysis and experiments on a didactic example, we validate the consistency of CETD($\lambda$, $\beta$, $\nu$). Moreover, we show that CETD($\lambda$, $\beta$, $\nu$) converges faster to the lowest error in a complex task with a high variance.

Subject: UAI.2023 - Accept

he23a@v216@PMLR

#1 Consistent Emphatic Temporal-Difference Learning [PDF] [Copy] [Kimi] [REL]