| Total: 15
Partially observable Markov decision processes (POMDPs) rely on the key assumption that probability distributions are precisely known. Robust POMDPs (RPOMDPs) alleviate this concern by defining imprecise probabilities, referred to as uncertainty sets. While robust MDPs have been studied extensively, work on RPOMDPs is limited and primarily focuses on algorithmic solution methods. We expand the theoretical understanding of RPOMDPs by showing that 1) different assumptions on the uncertainty sets affect optimal policies and values; 2) RPOMDPs have a partially observable stochastic game (POSG) semantic; and 3) the same RPOMDP with different assumptions leads to semantically different POSGs and, thus, different policies and values. These novel semantics for RPOMDPs give access to results for POSGs, studied in game theory; concretely, we show the existence of a Nash equilibrium. Finally, we classify the existing RPOMDP literature using our semantics, clarifying under which uncertainty assumptions these existing works operate.
Markov decision processes (MDPs) provide a standard framework for sequential decision making under uncertainty. However, MDPs do not take uncertainty in transition probabilities into account. Robust Markov decision processes (RMDPs) address this shortcoming of MDPs by assigning to each transition an uncertainty set rather than a single probability value. In this work, we consider polytopic RMDPs in which all uncertainty sets are polytopes and study the problem of solving long-run average reward polytopic RMDPs. We present a novel perspective on this problem and show that it can be reduced to solving long-run average reward turn-based stochastic games with finite state and action spaces. This reduction allows us to derive several important consequences that were hitherto not known to hold for polytopic RMDPs. First, we derive new computational complexity bounds for solving long-run average reward polytopic RMDPs, showing for the first time that the threshold decision problem for them is in NP and coNP and that they admit a randomized algorithm with sub-exponential expected runtime. Second, we present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is much more efficient in solving long-run average reward polytopic RMDPs compared to state-of-the-art methods based on value iteration.
We consider the problem of computing high-quality Euclidean shortest paths amidst obstacles on a large scale. By transferring and adapting speed-up techniques from the road network setting, we are able to compute source target paths for problem instances with several million obstacle vertices within few milliseconds after moderate preprocessing. We show experimentally that for small instances where optimal solutions are easily available on average our computed paths are less than 0.3% longer than the optimum. For large instances a new lower-bounding technique shows that on average our computed paths are less than 2% longer than the optimum paths. We compare our approach with the current state-of-the-art on problem instances derived from the OpenStreetMap project.
We propose a new approach based on ranking to learn to guide Greedy Best-First Search (GBFS). As previous ranking approaches, ours is based on the observation that directly learning a heuristic function is overly restrictive, and that GBFS is capable of efficiently finding good plans for a much more flexible class of total quasi-orders over states. In order to learn an optimal ranking function, we introduce a new ranking framework capable of leveraging any neural network regression model and efficiently handling the training data through batching. Compared with previous ranking approaches for planning, ours does not require complex loss functions and allows training on states outside the optimal plan with minimal overhead. Our experiments on the domains of the latest planning competition learning track show that our approach substantially improves the coverage of the underlying neural network models without degrading plan quality.
General policies represent reactive strategies for solving large families of planning problems like the infinite collection of solvable instances from a given domain. Methods for learning such policies from a collection of small training instances have been developed successfully for classical domains. In this work, we extend the formulations and the resulting combinatorial methods for learning general policies over fully observable, non-deterministic (FOND) domains. We also evaluate the resulting approach experimentally over a number of benchmark domains in FOND planning, present the general policies that result in some of these domains, and prove their correctness. The method for learning general policies for FOND planning can actually be seen as an alternative planning FOND planning method that searches for solutions not in the given state space but in an abstract state space defined by features that must be learned as well.
We present an A*-based algorithm to compute policies for finite-horizon Dec-POMDPs. Our goal is to sacrifice optimality in favor of scalability for larger horizons. The main ingredients of our approach are (1) using clustered sliding window memory, (2) pruning the A* search tree, and (3) using novel A* heuristics. Our experiments show competitive performance to the state-of-the-art. Moreover, for multiple benchmarks, we achieve superior performance. In addition, we provide an A* algorithm that finds upper bounds for the optimum, tailored towards problems with long horizons. The main ingredient is a new heuristic that periodically reveals the state, thereby limiting the number of reachable beliefs. Our experiments demonstrate the efficacy and scalability of the approach.
To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce Δ-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO₂ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
In recent years, there has been growing interest in utilizing modern machine learning techniques to learn heuristic functions for forward search algorithms. Despite this, there has been little theoretical understanding of what they should learn, how to train them, and why we do so. This lack of understanding has resulted in the adoption of diverse training targets (suboptimal vs optimal costs vs admissible heuristics) and loss functions (e.g., square vs absolute errors) in the literature. In this work, we focus on how to effectively utilize the information provided by admissible heuristics in heuristic learning. We argue that learning from poly-time admissible heuristics by minimizing mean square errors (MSE) is not the correct approach, since its result is merely a noisy, inadmissible copy of an efficiently computable heuristic. Instead, we propose to model the learned heuristic as a truncated gaussian, where admissible heuristics are used not as training targets but as lower bounds of this distribution. This results in a different loss function from the MSE commonly employed in the literature, which implicitly models the learned heuristic as a gaussian distribution. We conduct experiments where both MSE and our novel loss function are applied to learning a heuristic from optimal plan costs. Results show that our proposed method converges faster during training and yields better heuristics.
We consider a problem of placing generators of rewards to be collected by randomly moving agents in a network. In many settings, the precise mobility pattern may be one of several possible, based on parameters outside our control, such as weather conditions. The placement should be robust to this uncertainty, to gain a competent total reward across possible networks. To study such scenarios, we introduce the Robust Reward Placement problem (RRP). Agents move randomly by a Markovian Mobility Model with a predetermined set of locations whose connectivity is chosen adversarially from a known set Π of candidates. We aim to select a set of reward states within a budget that maximizes the minimum ratio, among all candidates in Π, of the collected total reward over the optimal collectable reward under the same candidate. We prove that RRP is NP-hard and inapproximable, and develop Ψ-Saturate, a pseudo-polynomial time algorithm that achieves an ϵ-additive approximation by exceeding the budget constraint by a factor that scales as O(ln |Π|/ϵ). In addition, we present several heuristics, most prominently one inspired by a dynamic programming algorithm for the max–min 0–1 KNAPSACK problem. We corroborate our theoretical analysis with an experimental evaluation on synthetic and real data.
The paper studies information-theoretic opacity, an information-flow privacy property, in a setting involving two agents: A planning agent who controls a stochastic system and an observer who partially observes the system states. The goal of the observer is to infer some secret, represented by a random variable, from its partial observations, while the goal of the planning agent is to make the secret maximally opaque to the observer while achieving a satisfactory total return. Modeling the stochastic system using a Markov decision process, two classes of opacity properties are considered---Last-state opacity is to ensure that the observer is uncertain if the last state is in a specific set and initial-state opacity is to ensure that the observer is unsure of the realization of the initial state. As the measure of opacity, we employ the Shannon conditional entropy capturing the information about the secret revealed by the observable. Then, we develop primal-dual policy gradient methods for opacity-enforcement planning subject to constraints on total returns. We propose novel algorithms to compute the policy gradient of entropy for each observation, leveraging message passing within the hidden Markov models. This gradient computation enables us to have stable and fast convergence. We demonstrate our solution of opacity-enforcement control through a grid world example.
Hub Labeling and A* are two well-established algorithms for shortest path computation in large graphs. Hub Labeling offers excellent query times for distance computation, but at the cost of a high space consumption for label storage. Landmark-based A* search requires less space but answers queries much slower. Recently, Landmark Hub Labeling (LHL) has been proposed, which combines both concepts and achieves a smaller space consumption than Hub Labeling and also much better query times than A*. However, the known algorithms for computing a LHL do not scale to large graphs, limiting its applicability. In this paper, we devise novel algorithms for LHL construction that work on graphs with millions of edges. We also further improve the LHL query answering algorithm and investigate how to reduce the space consumption of labeling techniques by performing bounded suboptimal pathfinding. In an extensive experimental study, we demonstrate the effectiveness of our methods and illuminate that sensible trade-offs between space consumption, query time, and path quality can be achieved with LHL.
Building upon recent advancements in formalising Fully Observable Non-Deterministic (FOND) Hierarchical Task Network (HTN) planning, we present the first approach to find strong solutions for HTN problems with uncertainty in action outcomes. We present a search algorithm, along with a compilation that relaxes a FOND HTN problem to a deterministic one. This allows the utilisation of existing grounders and heuristics from the deterministic HTN planning literature.
The Capacitated Location Routing Problem is an important planning and routing problem in logistics, which generalizes the capacitated vehicle routing problem and the uncapacitated facility location problem. In this problem, we are given a set of depots and a set of customers where each depot has an opening cost and each customer has a demand, and we need to use minimum cost to open some depots and route capacitated vehicles in the opened depots to satisfy all customers' demand. In this paper, we propose a 4.169-approximation algorithm for this problem, improving the best-known 4.38-approximation ratio (Transportation Science 2013). Moreover, if the demand of each customer is allowed to be delivered by multiple tours, we propose a more refined 4.092-approximation algorithm. Experimental study on benchmark instances shows that the quality of our computed solutions is better than that of previous algorithms and is also much closer to optimality than the provable approximation factor.
The bipartite traveling tournament problem (BTTP) was initially introduced by Hoshino and Kawarabayashi (AAAI 2011) to address inter-league sports scheduling, which aims to design a feasible bipartite tournament between two n-team leagues under some constraints such that the total traveling distance of all participating teams is minimized. Since its introduction, several heuristic methods have been developed to design feasible schedules for NBA, NPB and so on. In terms of solution quality with a theoretical guarantee, previously only a (2+ε) approximation is known for the case that n≡0 (mod 3). Whether there are similar results for the cases that n≡1 (mod 3) and n≡2 (mod 3) was asked in the literature. In this paper, we answer this question positively by proposing a (3/2+ε)-approximation algorithm for any n and any constant ε>0, which also improves the previous ratio for the case that n≡0 (mod 3).
Any-angle path planning is an extension of traditional path-planning algorithms that aims to generate smoother and shorter paths in graphs by allowing any-angle moves between vertices, rather than being restricted by edges. Many any-angle path-planning algorithms have been proposed, such as Theta*, Block A* and Anya, but most of them are designed only for static environments, which is not applicable when dynamic obstacles are present. Time-Optimal Any-Angle Safe-Interval Path Planning (TO-AA-SIPP) was developed to fill this gap, which can find an optimal collision-free any-angle path that minimizes the traversal time. However, as indicated by its authors, TO-AA-SIPP may not be efficient enough to be used in multi-agent pathfinding (MAPF). Therefore, this paper presents a new algorithm Zeta*-SIPP to improve TO-AA-SIPP by means of 1) reducing useless search nodes that have no contribution to finding optimal solutions, and 2) introducing Field of View (FoV) instead of Line of Sight (LoS) to speed up visibility checks with static obstacles. Benchmark experiments showed that Zeta*-SIPP reduced the computation time of TO-AA-SIPP by around 70%-90% on average.