Total: 14

We propose a new class of computationally fast algorithms to find close to optimal policy for Markov Decision Processes (MDP) with large finite horizon T.The main idea is that instead of planning until the time horizon T, we plan only up to a truncated horizon H << T and use an estimate of the true optimal value function as the terminal value. Our approach of finding the terminal value function is to learn a mapping from an MDP to its value function by solving many similar MDPs during a training phase and fit a regression estimator. We analyze the method by providing an error propagation theorem that shows the effect of various sources of errors to the quality of the solution. We also empirically validate this approach in a real-world application of designing an energy management system for Hybrid Electric Vehicles with promising results.

We consider optimal route planning when the objective function is a general nonlinear and non-monotonic function. Such an objective models user behavior more accurately, for example, when a user is risk-averse, or the utility function needs to capture a penalty for early arrival. It is known that as non-linearity arises, the problem can become NP-hard and little is known on computing optimal solutions when in addition there is no monotonicity guarantee. We show that an approximately optimal non-simple path can be efficiently computed under some natural constraints. In particular, we provide a fully polynomial approximation scheme under hop constraints. Our approximation algorithm can extend to run in pseudo-polynomial time under an additional linear constraint that sometimes is useful. As a by-product, we show that our algorithm can be applied to the problem of finding a path that is most likely to be on time for a given deadline.

The global growth in urbanisation increases the demand for services including road transport infrastructure, presenting challenges in terms of mobility. In this scenario, optimising the exploitation of urban road networks is a pivotal challenge. Existing urban traffic control approaches, based on complex mathematical models, can effectively deal with planned-ahead events, but are not able to cope with unexpected situations --such as roads blocked due to car accidents or weather-related events-- because of their huge computational requirements. Therefore, such unexpected situations are mainly dealt with manually, or by exploiting pre-computed policies. Our goal is to show the feasibility of using mixed discrete-continuous planning to deal with unexpected circumstances in urban traffic control. We present a PDDL+ formulation of urban traffic control, where continuous processes are used to model flows of cars, and show how planning can be used to efficiently reduce congestion of specified roads by controlling traffic light green phases. We present simulation results on two networks (one of them considers Manchester city centre) that demonstrate the effectiveness of the approach, compared with fixed-time and reactive techniques.

We cast the Proactive Learning (PAL) problem—Active Learning (AL) with multiple reluctant, fallible, cost-varying oracles—as a Partially Observable Markov Decision Process (POMDP). The agent selects an oracle at each time step to label a data point, while it maintains a belief over the true underlying correctness of its current dataset’s labels. The goal is to minimize labeling costs while considering the value of obtaining correct labels, thus maximizing final resultant classifier accuracy. We prove three properties that show our particular formulation leads to a structured and bounded-size set of belief points, enabling strong performance of point-based methods to solve the POMDP. Our method is compared with the original three algorithms proposed by Donmez and Carbonell and a simple baseline. We demonstrate that our approach matches or improves upon the original approach within five different oracle scenarios, each on two datasets. Finally, our algorithm provides a general, well-defined mathematical foundation to build upon.

The Temporal Network with Uncertainty (TNU) modeling framework is used to represent temporal knowledge in presence of qualitative temporal uncertainty. Dynamic Controllability (DC) is the problem of deciding the existence of a strategy for scheduling the controllable time points of the network observing past happenings only. In this paper, we address the DC problem for a very general class of TNU, namely Disjunctive Temporal Network with Uncertainty. We make the following contributions. First, we define strategies in the form of an executable language; second, we propose the first decision procedure to check whether a given strategy is a solution for the DC problem; third we present an efficient algorithm for strategy synthesis based on techniques derived from Timed Games and Satisfiability Modulo Theory. The experimental evaluation shows that the approach is superior to the state-of-the-art.

Partially Observable Markov Decision Processes (POMDPs) are often used to model planning problems under uncertainty. The goal in Risk-Sensitive POMDPs (RS-POMDPs) is to find a policy that maximizes the probability that the cumulative cost is within some user-defined cost threshold. In this paper, unlike existing POMDP literature, we distinguish between the two cases of whether costs can or cannot be observed and show the empirical impact of cost observations. We also introduce a new search-based algorithm to solve RS-POMDPs and show that it is faster and more scalable than existing approaches in two synthetic domains and a taxi domain generated with real-world data.

Goal recognition design involves the offline analysis of goal recognition models by formulating measures that assess the ability to perform goal recognition within a model and finding efficient ways to compute and optimize them. In this work we relax the full observability assumption of earlier work by offering a new generalized model for goal recognition design with non-observable actions. A model with partial observability is relevant to goal recognition applications such as assisted cognition and security, which suffer from reduced observability due to sensor malfunction or lack of sufficient budget. In particular we define a worst case distinctiveness (wcd) measure that represents the maximal number of steps an agent can take in a system before the observed portion of his trajectory reveals his objective. We present a method for calculating wcd based on a novel compilation to classical planning and propose a method to improve the design using sensor placement. Our empirical evaluation shows that the proposed solutions effectively compute and improve wcd.

Uncertainty in activity durations is a key characteristic of many real world scheduling problems in manufacturing, logistics and project management. RCPSP/max with durational uncertainty is a general model that can be used to represent durational uncertainty in a wide variety of scheduling problems where there exist resource constraints. However, computing schedules or execution strategies for RCPSP/max with durational uncertainty is NP-hard and hence we focus on providing approximation methods in this paper. We provide a principled approximation approach based on Sample Average Approximation (SAA) to compute proactive schedules for RCPSP/max with durational uncertainty. We further contribute an extension to SAA for improving scalability significantly without sacrificing on solution quality. Not only is our approach able to compute schedules at comparable runtimes as existing approaches, it also provides lower α-quantile makespan (also referred to as α-robust makespan) values than the best known approach on benchmark problems from the literature.

In cooperative multi-agent sequential decision making under uncertainty, agents must coordinate to find an optimal joint policy that maximises joint value. Typical algorithms exploit additive structure in the value function, but in the fully-observable multi-agent MDP (MMDP) setting such structure is not present. We propose a new optimal solver for transition-independent MMDPs, in which agents can only affect their own state but their reward depends on joint transitions. We represent these dependencies compactly in conditional return graphs (CRGs). Using CRGs the value of a joint policy and the bounds on partially specified joint policies can be efficiently computed. We propose CoRe, a novel branch-and-bound policy search algorithm building on CRGs. CoRe typically requires less runtime than available alternatives and finds solutions to previously unsolvable problems.

In contingent planning under partial observability with sensing actions, agents actively use sensing to discover meaningful facts about the world. For this class of problems the solution can be represented as a plan tree, branching on various possible observations. Recent successful approaches translate the partially observable contingent problem into a non-deterministic fully observable problem, and then use a planner for non-deterministic planning. While this approach has been successful in many domains, the translation may become very large, encumbering the task of the non-deterministic planner. In this paper we suggest a different approach - using an online contingent solver repeatedly to construct a plan tree. We execute the plan returned by the online solver until the next observation action, and then branch on the possible observed values, and replan for every branch independently. In many cases a plan tree can be exponential in the number of state variables, but still, the tree has a structure that allows us to compactly represent it using a directed graph. We suggest a mechanism for tailoring such a graph that reduces both the computational effort and the storage space. Furthermore, unlike recent state of the art offline planners, our approach is not bounded to a specific class of contingent problems, such as limited problem width, or simple contingent problems. We present a set of experiments, showing our approach to scale better than state of the art offline planners.

Goal Recognition Design involves identifying the best ways to modify an underlying environment that agents operate in, typically by making asubset of feasible actions infeasible, so that agents are forced to reveal their goals as early as possible. Thus far, existing work has focused exclusively on imperative classical planning. In this paper, we address the same problem with a different paradigm, namely, declarative approaches based on Answer Set Programming (ASP). Our experimental results show that one of our ASP encodings is more scalable and is significantly faster by up to three orders of magnitude than thecurrent state of the art.

Policy Iteration (PI) (Howard 1960) is a classical method for computing an optimal policy for a finite Markov Decision Problem (MDP). The method is conceptually simple: starting from some initial policy, “policy improvement” is repeatedly performed to obtain progressively dominating policies, until eventually, an optimal policy is reached. Being remarkably efficient in practice, PI is often favoured over alternative approaches such as Value Iteration and Linear Programming. Unfortunately, even after several decades of study, theoretical bounds on the complexity of PI remain unsatisfactory. For an MDP with n states and k actions, Mansour and Singh (1999) bound the number of iterations taken by Howard’s PI, the canonical variant of the method, by O(kn / n). This bound merely improves upon the trivial bound of kn by a linear factor. However, a randomised variant of PI introduced by Mansour and Singh (1999) does yield an exponential improvement, with its expected number of iterations bounded by O(((1 + 2/log2(k)) k / 2)n).With the objective of furnishing improved upper bounds for PI, we introduce two randomised procedures in this paper. Our first contribution is a routine to find a good initial policy for PI. After evaluating a number of randomly generated policies, this procedure applies a novel criterion to pick one to initialise PI. When PI is subsequently applied, we show that the expected number of policy evaluations—including both the initialisation and the improvement stages—remains bounded in expectation by O(kn/2). The key construction employed in this routine is a total order on the set of policies. Our second contribution is a randomised action-switching rule for PI, which admits a bound of O((2 + ln(k – 1))n) on the expected number of iterations. To the best of our knowledge, this is the tightest complexity bound known for PI when k >= 3.

We study transportation problems where robots have to deliver packages and can transfer the packages among each other. Specifically, we study the package-exchange robot-routing problem (PERR), where each robot carries one package, any two robots in adjacent locations can exchange their packages, and each package needs to be delivered to a given destination. We prove that exchange operations make all PERR instances solvable. Yet, we also show that PERR is NP-hard to approximate within any factor less than 4/3 for makespan minimization and is NP-hard to solve for flowtime minimization, even when there are only two types of packages. Our proof techniques also generate new insights into other transportation problems, for example, into the hardness of approximating optimal solutions to the standard multi-agent path-finding problem (MAPF). Finally, we present optimal and suboptimal PERR solvers that are inspired by MAPF solvers, namely a flow-based ILP formulation and an adaptation of conflict-based search. Our empirical results demonstrate that these solvers scale well and that PERR instances often have smaller makespans and flowtimes than the corresponding MAPF instances.

We consider recently-derived error bounds that can be used to bound the quality of solutions found by heuristic search algorithms for stochastic shortest path problems. In their original form, the bounds can only be used for problems with positive action costs. We show how to generalize the bounds so that they can be used in solving any stochastic shortest path problem, regardless of cost structure. In addition, we introduce a simple new heuristic search algorithm that performs as well or better than previous algorithms for this class of problems, while being easier to implement and analyze.