Offline Multi-Objective Bandits: From Logged Data to Pareto-Optimal Policies

#1 Offline Multi-Objective Bandits: From Logged Data to Pareto-Optimal Policies [PDF] [Copy] [Kimi] [REL]

Authors: Ji Cheng, Song Lai, Shunyu Yao, Bo Xue

Offline policy learning from logged data is a critical paradigm for enabling effective decision-making without costly online exploration. However, its application has been largely confined to single-objective problems, a stark contrast to real-world scenarios where decision-making inherently involves navigating multiple, often conflicting, objectives. This paper introduces a comprehensive framework for Offline Multi-Objective Bandits (OffMOB), providing a principled solution to the fundamental challenge of learning Pareto-optimal policies from a static dataset. Our core contribution is a novel algorithm that uniquely integrates the pessimism principle with multi-objective optimization to safely learn from off-policy data. Crucially, our approach transcends the primary limitation of scalarization techniques, which are restricted to finding a single policy for a pre-defined preference. Instead, OffMOB directly approximates the entire Pareto front, learning a single, flexible policy model capable of generating an optimal action for any desired trade-off. To rigorously evaluate performance, we introduce the Tchebycheff sub-optimality metric and establish the first finite-sample generalization bounds for this problem class, proving that our algorithm converges to the true Pareto front under practical data coverage assumptions. Extensive experiments on complex benchmarks demonstrate that OffMOB significantly outperforms existing methods, identifying the complete set of optimal trade-offs where naive extensions fail.

Subject: AAAI.2026 - Reasoning under Uncertainty

40987@AAAI

#1 Offline Multi-Objective Bandits: From Logged Data to Pareto-Optimal Policies [PDF] [Copy] [Kimi] [REL]