Total: 1
We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons, where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective is to identify the optimal action for each state, with the ultimate goal of minimizing the simple regret. We propose an algorithm, Reinforcement Learning with Locally Optimal Weights (RL-LOW), which achieves an exponential rate of simple regret that decays exponentially with the ratio of the number of data samples to an instance-dependent hardness parameter. This hardness parameter depends explicitly on the suboptimality gap of each action. Furthermore, we derive the first instance-dependent lower bound for offline RLHF with pairwise comparisons. Interestingly, the lower and upper bounds on the simple regret match in an order-wise sense in the exponent, demonstrating the order-wise optimality of RL-LOW. Motivated by privacy considerations in practical applications, we further extend RL-LOW to the setting of differential privacy and show, somewhat surprisingly, that the hardness parameter remains unchanged in the asymptotic regime as the number of data samples tends to infinity. This result highlights the inherent efficiency of RL-LOW in preserving the privacy of the observed rewards. By establishing instance-dependent bounds with exponential convergence rates, our work fills an important gap in the existing literature, which has primarily focused on worst-case regret bounds with inverse polynomial convergence rates for offline RLHF with pairwise comparisons.