| Total: 39

We devise a method to exactly compute the length of the longest simple path in factored state spaces, like state spaces encountered in classical planning. Although the complexity of this problem is NEXP-Hard, we show that our method can be used to compute practically useful upper-bounds on lengths of plans. We show that the computed upper-bounds are significantly better than bounds produced by state-of-the-art bounding techniques and that they can be used to improve the SAT-based planning.

We consider the problem of designing policies for Markov decision processes (MDPs) with dynamic coherent risk objectives and constraints. We begin by formulating the problem in a Lagrangian framework. Under the assumption that the risk objectives and constraints can be represented by a Markov risk transition mapping, we propose an optimization-based method to synthesize Markovian policies that lower-bound the constrained risk-averse problem. We demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints. Finally, we illustrate the effectiveness of the proposed method with numerical experiments on a rover navigation problem involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) coherent risk measures.

Contract scheduling is a general technique that allows to design a system with interruptible capabilities, given an algorithm that is not necessarily interruptible. Previous work on this topic has largely assumed that the interruption is a worst-case deadline that is unknown to the scheduler. In this work, we study the setting in which there is a potentially erroneous prediction concerning the interruption. Specifically, we consider the setting in which the prediction describes the time that the interruption occurs, as well as the setting in which the prediction is obtained as a response to a single or multiple binary queries. For both settings, we investigate tradeoffs between the robustness (i.e., the worst-case performance assuming adversarial prediction) and the consistency (i.e, the performance assuming that the prediction is error-free), both from the side of positive and negative results.

We consider the problem of responsibility attribution in the setting of parametric Markov chains. Given a family of Markov chains over a set of parameters, and a property, responsibility attribution asks how the difference in the value of the property should be attributed to the parameters when they change from one point in the parameter space to another. We formalize responsibility as path-based attribution schemes studied in cooperative game theory. An attribution scheme in a game determines how a value (a surplus or a cost) is distributed among a set of participants. Path-based attribution schemes include the well-studied Aumann-Shapley and the Shapley-Shubik schemes. In our context, an attribution scheme measures the responsibility of each parameter on the value function of the parametric Markov chain. We study the decision problem for path-based attribution schemes. Our main technical result is an algorithm for deciding if a path-based attribution scheme for a rational (ratios of polynomials) cost function is over a rational threshold. In particular, it is decidable if the Aumann-Shapley value for a player is at least a given rational number. As a consequence, we show that responsibility attribution is decidable for parametric Markov chains and for a general class of properties that include expectation and variance of discounted sum and long-run average rewards, as well as specifications in temporal logic.

Symbolic search has proven to be a useful approach to optimal classical planning. In Hierarchical Task Network (HTN) planning, however, there is little work on optimal planning. One reason for this is that in HTN planning, most algorithms are based on heuristic search, and admissible heuristics have to incorporate the structure of the task network in order to be informative. In this paper, we present a novel approach to optimal (totally-ordered) HTN planning, which is based on symbolic search. An empirical analysis shows that our symbolic approach outperforms the current state of the art for optimal totally-ordered HTN planning.

The NP-hard Material Consumption Scheduling Problem and closely related problems have been thoroughly studied since the 1980's. Roughly speaking, the problem deals with minimizing the makespan when scheduling jobs that consume non-renewable resources. We focus on the single-machine case without preemption: from time to time, the resources of the machine are (partially) replenished, thus allowing for meeting a necessary pre-condition for processing further jobs, each of which having individual resource demands. We initiate a systematic exploration of the parameterized computational complexity landscape of the problem, providing parameterized tractability as well as intractability results. Doing so, we mainly investigate how parameters related to the resource supplies influence the computational complexity. Thereby, we get a deepened understanding of this fundamental scheduling problem.

It has been observed that in many of the benchmark planning domains, atomic goals can be reached with a simple polynomial exploration procedure, called IW, that runs in time exponential in the problem width. Such problems have indeed a bounded width: a width that does not grow with the number of problem variables and is often no greater than two. Yet, while the notion of width has become part of the state- of-the-art planning algorithms like BFWS, there is still no good explanation for why so many benchmark domains have bounded width. In this work, we address this question by relating bounded width and serialized width to ideas of generalized planning, where general policies aim to solve multiple instances of a planning problem all at once. We show that bounded width is a property of planning domains that admit optimal general policies in terms of features that are explicitly or implicitly represented in the domain encoding. The results are extended to the larger class of domains with bounded serialized width where the general policies do not have to be optimal. The study leads also to a new simple, meaningful, and expressive language for specifying domain serializations in the form of policy sketches which can be used for encoding domain control knowledge by hand or for learning it from traces. The use of sketches and the meaning of the theoretical results are all illustrated through a number of examples.

Successor-style representations have many advantages for reinforcement learning: for example, they can help an agent generalize from past experience to new goals, and they have been proposed as explanations of behavioral and neural data from human and animal learners. They also form a natural bridge between model-based and model-free RL methods: like the former they make predictions about future experiences, and like the latter they allow efficient prediction of total discounted rewards. However, successor-style representations are not optimized to generalize across policies: typically, we maintain a limited-length list of policies, and share information among them by representation learning or GPI. Successor-style representations also typically make no provision for gathering information or reasoning about latent variables. To address these limitations, we bring together ideas from predictive state representations, belief space value iteration, successor features, and convex analysis: we develop a new, general successor-style representation, together with a Bellman equation that connects multiple sources of information within this representation, including different latent states, policies, and reward functions. The new representation is highly expressive: for example, it lets us efficiently read off an optimal policy for a new reward function, or a policy that imitates a new demonstration. For this paper, we focus on exact computation of the new representation in small, known environments, since even this restricted setting offers plenty of interesting questions. Our implementation does not scale to large, unknown environments --- nor would we expect it to, since it generalizes POMDP value iteration, which is difficult to scale. However, we believe that future work will allow us to extend our ideas to approximate reasoning in large, unknown environments. We conduct experiments to explore which of the potential barriers to scaling are most pressing.

We address the problem of efficient exploration for transition model learning in the relational model-based reinforcement learning setting without extrinsic goals or rewards. Inspired by human curiosity, we propose goal-literal babbling (GLIB), a simple and general method for exploration in such problems. GLIB samples relational conjunctive goals that can be understood as specific, targeted effects that the agent would like to achieve in the world, and plans to achieve these goals using the transition model being learned. We provide theoretical guarantees showing that exploration with GLIB will converge almost surely to the ground truth model. Experimentally, we find GLIB to strongly outperform existing methods in both prediction and planning on a range of tasks, encompassing standard PDDL and PPDDL planning benchmarks and a robotic manipulation task implemented in the PyBullet physics simulator. Video: https://youtu.be/F6lmrPT6TOY Code: https://git.io/JIsTB

Uncertain partially observable Markov decision processes (uPOMDPs) allow the probabilistic transition and observation functions of standard POMDPs to belong to a so-called uncertainty set. Such uncertainty, referred to as epistemic uncertainty, captures uncountable sets of probability distributions caused by, for instance, a lack of data available. We develop an algorithm to compute finite-memory policies for uPOMDPs that robustly satisfy specifications against any admissible distribution. In general, computing such policies is theoretically and practically intractable. We provide an efficient solution to this problem in four steps. (1) We state the underlying problem as a nonconvex optimization problem with infinitely many constraints. (2) A dedicated dualization scheme yields a dual problem that is still nonconvex but has finitely many constraints. (3) We linearize this dual problem and (4) solve the resulting finite linear program to obtain locally optimal solutions to the original problem. The resulting problem formulation is exponentially smaller than those resulting from existing methods. We demonstrate the applicability of our algorithm using large instances of an aircraft collision-avoidance scenario and a novel spacecraft motion planning case study.

Generalized planning is concerned with the computation of general policies that solve multiple instances of a planning domain all at once. It has been recently shown that these policies can be computed in two steps: first, a suitable abstraction in the form of a qualitative numerical planning problem (QNP) is learned from sample plans, then the general policies are obtained from the learned QNP using a planner. In this work, we introduce an alternative approach for computing more expressive general policies which does not require sample plans or a QNP planner. The new formulation is very simple and can be cast in terms that are more standard in machine learning: a large but finite pool of features is defined from the predicates in the planning examples using a general grammar, and a small subset of features is sought for separating “good” from “bad” state transitions, and goals from non-goals. The problems of finding such a “separating surface” while labeling the transitions as “good” or “bad” are jointly addressed as a single combinatorial optimization problem expressed as a Weighted Max-SAT problem. The advantage of looking for the simplest policy in the given feature space that solves the given examples, possibly non-optimally, is that many domains have no general, compact policies that are optimal. The approach yields general policies for a number of benchmark domains.

In classical planning as search, duplicate state pruning is a standard method to avoid unnecessarily handling the same state multiple times. In decoupled search, similar to symbolic search approaches, search nodes, called decoupled states, do not correspond to individual states, but to sets of states. Therefore, duplicate state pruning is less effective in decoupled search, and dominance pruning is employed, taking into account the state sets. We observe that the time required for dominance checking dominates the overall runtime, and propose two ways to tackle this issue. Our main contribution is a stronger variant of dominance checking for optimal planning, where efficiency and pruning power are most crucial. The new variant greatly improves the latter, without incurring a computational overhead. Moreover, we develop three methods that make the dominance check more efficient: exact duplicate checking, which, albeit resulting in weaker pruning, can pay off due to the use of hashing; avoiding the dominance check in non-optimal planning if leaf state spaces are invertible; and exploiting the transitivity of the dominance relation to only check against the relevant subset of visited decoupled states. We show empirically that all our improvements are indeed beneficial in many standard benchmarks.

We introduce a natural but seemingly yet unstudied generalization of the problem of scheduling jobs on a single machine so as to minimize the number of tardy jobs. Our generalization lies in simultaneously considering several instances of the problem at once. In particular, we have n clients over a period of m days, where each client has a single job with its own processing time and deadline per day. Our goal is to provide a schedule for each of the m days, so that each client is guaranteed to have their job meet its deadline in at least k <= m days. This corresponds to an equitable schedule where each client is guaranteed a minimal level of service throughout the period of m days. We provide a thorough analysis of the computational complexity of three main variants of this problem, identifying both efficient algorithms and worst-case intractability results.

Landmarks (LMs) are state features that need to be made true or tasks that need to be contained in every solution of a planning problem. They are a valuable source of information in planning and can be exploited in various ways. LMs have been used both in classical and hierarchical planning, but while there is much work in classical planning, the techniques in hierarchical planning are less evolved. We introduce a novel LM generation method for Hierarchical Task Network (HTN) planning and show that it is sound and incomplete. We show that every complete approach is as hard as the co-class of the underlying HTN problem, i.e. coNP-hard for our setting (while our approach is in P). On a widely used benchmark set, our approach finds more than twice the number of landmarks than the approach from the literature. Though our focus is on LM generation, we show that the newly discovered landmarks bear information beneficial for solvers.

Detection of redundant operators that can be safely removed from the planning task is an essential technique allowing to greatly improve performance of planners. In this paper, we employ structure-preserving maps on labeled transition systems (LTSs), namely endomorphisms well known from model theory, in order to detect redundancy. Computing endomorphisms of an LTS induced by a planning task is typically infeasible, so we show how to compute some of them on concise representations of planning tasks such as finite domain representations and factored LTSs. We formulate the computation of endomorphisms as a constraint satisfaction problem (CSP) that can be solved by an off-the-shelf CSP solver. Finally, we experimentally verify that the proposed method can find a sizeable number of redundant operators on the standard benchmark set.

Motivated by the Bike Angels Program in New York's Citi Bike and Boston's Blue Bikes, we study the use of (registered) volunteers to re-position empty bikes for riders in a bike sharing system. We propose a method that can be used to deploy the volunteers in the system, based on the real time distribution of the bikes in different stations. To account for (random) route demand in the network, we solve a related transshipment network design model and construct a sparse structure to restrict the re-balancing activities of the volunteers (concentrating re-balancing activities on essential routes). We also develop a comprehensive simulation model using a threshold-based policy to deploy the volunteers in real time, to test the effect of choice restriction on volunteers (suitably deployed) to re-position bikes. We use the Hubway system in Boston (with 60 stations) to demonstrate that using a sparse structure to concentrate the re-balancing activities of the volunteers, instead of allowing all admissible flows in the system (as in current practice), can reduce the number of re-balancing moves by a huge amount, losing only a small proportion of demand satisfied.

This paper presents a Branch and Price approach for a real-life Bus Driver Scheduling problem with a complex set of break constraints. The column generation uses a set partitioning model as master problem and a resource constrained shortest path problem as subproblem. Due to the complex constraints, the branch and price algorithm adopts several novel ideas to improve the column generation in the presence of a high-dimensional subproblem, including exponential arc throttling and a dedicated two-stage dominance algorithm. Evaluation on a publicly available set of benchmark instances shows that the approach provides the first provably optimal solutions for small instances, improving best-known solutions or proving them optimal for 48 out of 50 instances, and yielding an optimality gap of less than 1% for more than half the instances.

We propose an approach to learn an extensional representation of a discrete deterministic planning domain from observations in a continuous space navigated by the agent actions. This is achieved through the use of a perception function providing the likelihood of a real-value observation being in a given state of the planning domain after executing an action. The agent learns an extensional representation of the domain (the set of states, the transitions from states to states caused by actions) and the perception function on-line, while it acts for accomplishing its task. In order to provide a practical approach that can scale up to large state spaces, a “draft” intensional (PDDL-based) model of the planning domain is used to guide the exploration of the environment and learn the states and state transitions. The proposed approach uses a novel algorithm to (i) construct the extensional representation of the domain by interleaving symbolic planning in the PDDL intensional representation and search in the state transition graph of the extensional representation; (ii) incrementally refine the intensional representation taking into account information about the actions that the agent cannot execute. An experimental analysis shows that the novel approach can scale up to large state spaces, thus overcoming the limits in scalability of the previous work.

Probabilistic planning subject to multi-objective probabilistic temporal logic (PLTL) constraints models the problem of computing safe and robust behaviours for agents in stochastic environments. We present novel admissible heuristics to guide the search for cost-optimal policies for these problems. These heuristics project and decompose LTL formulae obtained by progression to estimate the probability that an extension of a partial policy satisfies the constraints. Their computation with linear programming is integrated with the recent PLTL-dual heuristic search algorithm, enabling more aggressive pruning of regions violating the constraints. Our experiments show that they further widen the scalability gap between heuristic search and verification approaches to these planning problems.

Online solvers for partially observable Markov decision processes have difficulty scaling to problems with large action spaces. Monte Carlo tree search with progressive widening attempts to improve scaling by sampling from the action space to construct a policy search tree. The performance of progressive widening search is dependent upon the action sampling policy, often requiring problem-specific samplers. In this work, we present a general method for efficient action sampling based on Bayesian optimization. The proposed method uses a Gaussian process to model a belief over the action-value function and selects the action that will maximize the expected improvement in the optimal action value. We implement the proposed approach in a new online tree search algorithm called Bayesian Optimized Monte Carlo Planning (BOMCP). Several experiments show that BOMCP is better able to scale to large action space POMDPs than existing state-of-the-art tree search solvers.

Online solvers for partially observable Markov decision processes have difficulty scaling to problems with large action spaces. This paper proposes a method called PA-POMCPOW to sample a subset of the action space that provides varying mixtures of exploitation and exploration for inclusion in a search tree. The proposed method first evaluates the action space according to a score function that is a linear combination of expected reward and expected information gain. The actions with the highest score are then added to the search tree during tree expansion. Experiments show that PA-POMCPOW is able to outperform existing state-of-the-art solvers on problems with large discrete action spaces.

Automated temporal planning is the problem of synthesizing, starting from a model of a system, a course of actions to achieve a desired goal when temporal constraints, such as deadlines, are present in the problem. Despite considerable successes in the literature, scalability is still a severe limitation for existing planners, especially when confronted with real-world, industrial scenarios. In this paper, we aim at exploiting recent advances in reinforcement learning, for the synthesis of heuristics for temporal planning. Starting from a set of problems of interest for a specific domain, we use a customized reinforcement learning algorithm to construct a value function that is able to estimate the expected reward for as many problems as possible. We use a reward schema that captures the semantics of the temporal planning problem and we show how the value function can be transformed in a planning heuristic for a semi-symbolic heuristic search exploration of the planning model. We show on two case-studies how this method can widen the reach of current temporal planners with encouraging results.

In Hierarchical Task Network (HTN) planning, compound tasks need to be refined into executable (primitive) action sequences. In contrast to their primitive counterparts, compound tasks do not specify preconditions or effects. Thus, their implications on the states in which they are applied are not explicitly known: they are "hidden" in and depending on the decomposition structure. We formalize several kinds of preconditions and effects that can be inferred for compound tasks in totally ordered HTN domains. As relevant special case we introduce a problem relaxation which admits reasoning about preconditions and effects in polynomial time. We provide procedures for doing so, thereby extending previous work, which could only deal with acyclic models. We prove our procedures to be correct and complete for any totally ordered input domain. These results are embedded into an encompassing complexity analysis of the inference of preconditions and effects of compound tasks, an investigation that has not been made so far.

In this paper we give a structural characterization and extend the tractability frontier of the Simple Temporal Problem (STP) by defining the class of the Extended Simple Temporal Problem (ESTP), which augments STP with strict inequalities and monotone Boolean formulae on inequations (i.e., formulae involving the operations of conjunction, disjunction and parenthesization). A polynomial-time algorithm is provided to solve ESTP, faster than previous state-of-the-art algorithms for other extensions of STP that had been considered in the literature, all encompassed by ESTP. We show the practical competitiveness of our approach through a proof-of-concept implementation and an experimental evaluation involving also state-of-the-art SMT solvers.

In wearable-sensor-based activity recognition, it is often assumed that the training and test samples follow the same data distribution. This assumption neglects practical scenarios where the activity patterns inevitably vary from person to person. To solve this problem, transfer learning and domain adaptation approaches are often leveraged to reduce the gaps between different participants. Nevertheless, these approaches require additional information (i.e., labeled or unlabeled data, meta-information) from the target domain during the training stage. In this paper, we introduce a novel method named Generalizable Independent Latent Excitation (GILE) for human activity recognition, which greatly enhances the cross-person generalization capability of the model. Our proposed method is superior to existing methods in the sense that it does not require any access to the target domain information. Besides, this novel model can be directly applied to various target domains without re-training or fine-tuning. Specifically, the proposed model learns to automatically disentangle domain-agnostic and domain-specific features, the former of which are expected to be invariant across various persons. To further remove correlations between the two types of features, a novel Independent Excitation mechanism is incorporated in the latent feature space. Comprehensive experimental evaluations are conducted on three benchmark datasets to demonstrate the superiority of the proposed method over the state-of-the-art solutions.