Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated

#1 Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated [PDF²] [Copy] [Kimi¹] [REL]

Authors: Katherine Metcalf, Miguel Sarabia, Masha Fedzechkina, Barry-John Theobald

Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.

Subject: AAAI.2024 - Humans and AI

28877@AAAI

#1 Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated [PDF2] [Copy] [Kimi1] [REL]

#1 Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated [PDF²] [Copy] [Kimi¹] [REL]