| Total: 32

Intelligent robots frequently need to explore the objects in their working environments. Modern sensors have enabled robots to learn object properties via perception of multiple modalities. However, object exploration in the real world poses a challenging trade-off between information gains and exploration action costs. Mixed observability Markov decision process (MOMDP) is a framework for planning under uncertainty, while accounting for both fully and partially observable components of the state. Robot perception frequently has to face such mixed observability. This work enables a robot equipped with an arm to dynamically construct query-oriented MOMDPs for multi-modal predicate identification (MPI) of objects. The robot's behavioral policy is learned from two datasets collected using real robots. Our approach enables a robot to explore object properties in a way that is significantly faster while improving accuracies in comparison to existing methods that rely on hand-coded exploration strategies.

We consider a single machine, a set of unit-time jobs, and a set of unit-time errors. We assume that the time-slot at which each error will occur is not known in advance but, for every error, there exists an uncertainty area during which the error will take place. In order to find if the error occurs in a specific time-slot, it is necessary to issue a query to it. In this work, we study two problems: (i) the error-query scheduling problem, whose aim is to reveal enough error-free slots with the minimum number of queries, and (ii) the lexicographic error-query scheduling problem where we seek the earliest error-free slots with the minimum number of queries. We consider both the off-line and the on-line versions of the above problems. In the former, the whole instance and its characteristics are known in advance and we give a polynomial-time algorithm for the error-query scheduling problem. In the latter, the adversary has the power to decide, in an on-line way, the time-slot of appearance for each error. We propose then both lower bounds and algorithms whose competitive ratios asymptotically match these lower bounds.

We introduce two novel structural parameters for acyclic planning (planning restricted to instances with acyclic causal graphs): up-depth and down-depth. We show that cost-optimal acyclic planning restricted to instances with bounded domain size and bounded up- or down-depth can be solved in polynomial time. For example, many of the tractable subclasses based on polytrees are covered by our result. We analyze the parameterized complexity of planning with bounded up- and down-depth: in a certain sense, down-depth has better computational properties than up-depth. Finally, we show that computing up- and down-depth are fixed-parameter tractable problems, just as many other structural parameters that are used in computer science. We view our results as a natural step towards understanding the complexity of acyclic planning with bounded treewidth and other parameters.

In temporal planning, agents must schedule a set of events satisfying a set of predetermined constraints. These scheduling problems become more difficult when the duration of certain actions are outside the agent's control. Delay controllability is the generalized notion of whether a schedule can be constructed in the face of uncertainty if the agent eventually learns when events occur. Our work introduces the substantially more complex setting of determining variable-delay controllability, where an agent learns about events after some unknown but bounded amount of time has passed. We provide an efficient O(n^3) variable-delay controllability checker and show how to create an execution strategy for variable-delay controllability problems. To our knowledge, these essential capabilities are absent from existing controllability checking algorithms. We conclude by providing empirical evaluations of the quality of variable-delay controllability results as compared to approximations that use fixed delays to model the same problems.

Generalized planning is concerned with the characterization and computation of plans that solve many instances at once. In the standard formulation, a generalized plan is a mapping from fea- ture or observation histories into actions, assuming that the instances share a common pool of features and actions. This assumption, however, excludes the standard relational planning domains where actions and objects change across instances. In this work, we extend the standard formulation of generalized planning to such domains. This is achieved by projecting the actions over the features, resulting in a common set of abstract actions which can be tested for soundness and completeness, and which can be used for generating general policies such as “if the gripper is empty, pick the clear block above x and place it on the table” that achieve the goal clear(x) in any Blocksworld instance. In this policy, “pick the clear block above x” is an abstract action that may represent the action Unstack(a, b) in one situation and the action Unstack(b, c) in another. Transformations are also introduced for computing such policies by means of fully observable non-deterministic (FOND) planners. The value of generalized representations for learning general policies is also discussed.

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities, and a sampling-based model for unknown distributions. We develop polynomial-time value and policy iteration methods for both cases, and provide a polynomial-time linear programming solution for the first case.

In this paper, we address the problem of LTL realizability and synthesis. State of the art techniques rely on so-called bounded synthesis methods, which reduce the problem to a safety game. Realizability is determined by solving synthesis in a dual game. We provide a unified view of duality, and introduce novel bounded realizability methods via reductions to reachability games. Further, we introduce algorithms, based on AI automated planning, to solve these safety and reachability games. This is the the first complete approach to LTL realizability and synthesis via automated planning. Experiments illustrate that reductions to reachability games are an alternative to reductions to safety games, and show that planning can be a competitive approach to LTL realizability and synthesis.

Partially-observable Markov decision processes (POMDPs) with discounted-sum payoff are a standard framework to model a wide range of problems related to decision making under uncertainty. Traditionally, the goal has been to obtain policies that optimize the expectation of the discounted-sum payoff. A key drawback of the expectation measure is that even low probability events with extreme payoff can significantly affect the expectation, and thus the obtained policies are not necessarily risk averse. An alternate approach is to optimize the probability that the payoff is above a certain threshold, which allows to obtain risk-averse policies, but ignore optimization of the expectation. We consider the expectation optimization with probabilistic guarantee (EOPG) problem where the goal is to optimize the expectation ensuring that the payoff is above a given threshold with at least a specified probability. We present several results on the EOPG problem, including the first algorithm to solve it.

We consider the stochastic shortest path (SSP) problem for succinct Markov decision processes (MDPs), where the MDP consists of a set of variables, and a set of nondeterministic rules that update the variables. First, we show that several examples from the AI literature can be modeled as succinct MDPs. Then we present computational approaches for upper and lower bounds for the SSP problem: (a) for computing upper bounds, our method is polynomial-time in the implicit description of the MDP; (b) for lower bounds, we present a polynomial-time (in the size of the implicit description) reduction to quadratic programming. Our approach is applicable even to infinite-state MDPs. Finally, we present experimental results to demonstrate the effectiveness of our approach on several classical examples from the AI literature.

Problem difficulty for greedy best first search (GBFS) is not entirely understood, though existing work points to deep local minima and poor correlation between the h-values and the distance to goal as factors that have significant negative effect on the search effort. In this work, we show that there is a very strong exponential correlation between the depth of the single deepest local minima encountered in a search and the overall search effort. Furthermore, we find that the distribution of local minima depth changes dramatically based on the constrainedness of problems, suggesting an explanation for the previously observed heavy-tailed behavior in GBFS. In combinatorial search, a similar result led to the use of randomized restarts to escape deep subtrees with no solution and corresponding significant speed-ups. We adapt this method and propose a randomized restarting GBFS variant that improves GBFS performance by escaping deep local minima, and does so even in the presence of other, randomization-based, search enhancements.

For a given state space and admissible heuristic function h there is always a tie-breaking strategy for which A* expands the minimum number of states [Dechter and Pearl, 1985]. We say that these strategies have optimal expansion. Although such a strategy always exists it may depend on the instance, and we currently do not know a tie-breaker that always guarantees optimal expansion. In this paper, we study tie-breaking strategies for A*. We analyze common strategies from the literature and prove that they do not have optimal expansion. We propose a novel tie-breaking strategy using cost adaptation that has always optimal expansion. We experimentally analyze the performance of A* using several tie-breaking strategies on domains from the IPC and zero-cost domains. Our best strategy solves significantly more instances than the standard method in the literature and more than the previous state-of-the-art strategy. Our analysis improves the understanding of how to develop effective tie-breaking strategies and our results also improve the state-of-the-art of tie-breaking strategies for A*.

This paper poses the planning problem faced by the dispatcher responding to urban emergencies as a Hybrid (Discrete and Continuous) State and Action Markov Decision Process (HSA-MDP). We evaluate the performance of three online planning algorithms based on hindsight optimization for HSA- MDPs on real-world emergency data in the city of Corvallis, USA. The approach takes into account and respects the policy constraints imposed by the emergency department. We show that our algorithms outperform a heuristic policy commonly used by dispatchers by significantly reducing the average response time as well as lowering the fraction of unanswered calls. Our results give new insights into the problem such as withholding of resources for future emergencies in some situations.

We study planning for LTLf and LDLf temporally extended goals in nondeterministic fully observable domains (FOND). We consider both strong and strong cyclic plans, and develop foundational automata-based techniques to deal with both cases. Using these techniques we provide the computational characterization of both problems, separating the complexity in the size of the domain specification from that in the size of the formula. Specifically we establish them to be EXPTIME-complete and 2EXPTIME-complete, respectively, for both problems. In doing so, we also show 2EXPTIME-hardness for strong cyclic plans, which was open.

The problem of optimally scheduling the charging demand of electric vehicles within the constraints of the electricity infrastructure is called the charge scheduling problem. The models of the charging speed, horizon, and charging demand determine the computational complexity of the charge scheduling problem. We show that for about 20 variants the problem is either in P or weakly NP-hard and dynamic programs exist to compute optimal solutions. About 10 other variants of the problem are strongly NP-hard, presenting a potentially significant obstacle to their use in practical situations of scale. An experimental study establishes up to what parameter values the dynamic programs can determine optimal solutions in a couple of minutes.

We study the intersection signalling control problem for cars with heterogeneous valuations of time (VoT). We are interested in a control algorithm that has some desirable properties: (1) it induces cars to report their VoT truthfully, (2) it minimizes the value of time lost for cars waiting at the intersection, and (3) it is computationally efficient. We obtain three main results: (1) We describe a computationally efficient heuristic forward search approach to solve the static problem. Simulation results show that this method is significantly faster than the dynamic-programming approach to solve the static problem (which is by itself polynomial time). We therefore believe that our algorithm can be commercially implemented. (2) We extend the solution of the static problem to the dynamic case. We couple our algorithm with a carefully designed payment scheme which yields an incentive compatible mechanism. In other words, it is the best interest of each car to truthfully report its VoT. (3) We describe simulation results that compare the social welfare obtained by our scheduling algorithm, as measured by the total value of waiting time, to the social welfare obtained by other intersection signalling control methods.

Red-black relaxation in classical planning allows to interpolate between delete-relaxed and real planning. Yet the traditional use of relaxations to generate heuristics restricts relaxation usage to tractable fragments. How to actually tap into the red-black relaxation's interpolation power? Prior work has devised red-black state space search (RBS) for intractable red-black planning, and has explored two uses: proving unsolvability, generating seed plans for plan repair. Here, we explore the generation of plans directly through RBS. We design two enhancements to this end: (A) use a known tractable fragment where possible, use RBS for the intractable parts; (B) check RBS state transitions for realizability, spawn relaxation refinements where the check fails. We show the potential merits of both techniques on IPC benchmarks.

In this work we study the model checking problem for probabilistic multiagent systems with respect to the probabilistic epistemic logic PETL, which can specify both temporal and epistemic properties. We show that under the realistic assumption of uniform schedulers, i.e., the choice of every agent depends only on its observation history, PETL model checking is undecidable. By restricting the class of schedulers to be memoryless schedulers, we show that the problem becomes decidable. More importantly, we design a novel algorithm which reduces the model checking problem into a mixed integer non-linear programming problem, which can then be solved by using an SMT solver. The algorithm has been implemented in an existing model checker and experiments are conducted on examples from the IPPC competitions.

Partially observable Markov decision processes (POMDPs) are the standard models for planning under uncertainty with both finite and infinite horizon. Besides the well-known discounted-sum objective, indefinite-horizon objective (aka Goal-POMDPs) is another classical objective for POMDPs. In this case, given a set of target states and a positive cost for each transition, the optimization objective is to minimize the expected total cost until a target state is reached. In the literature, RTDP-Bel or heuristic search value iteration (HSVI) have been used for solving Goal-POMDPs. Neither of these algorithms has theoretical convergence guarantees, and HSVI may even fail to terminate its trials. We give the following contributions: (1) We discuss the challenges introduced in Goal-POMDPs and illustrate how they prevent the original HSVI from converging. (2) We present a novel algorithm inspired by HSVI, termed Goal-HSVI, and show that our algorithm has convergence guarantees. (3) We show that Goal-HSVI outperforms RTDP-Bel on a set of well-known examples.

We envision an intelligent agent that analyzes conversations during human team meetings in order to infer the team’s plan, with the purpose of providing decision support to strengthen that plan. We present a novel learning technique to infer teams' final plans directly from a processed form of their planning conversation. Our method employs reinforcement learning to train a model that maps features of the discussed plan and patterns of dialogue exchange among participants to a final, agreed-upon plan. We employ planning domain models to efficiently search the large space of possible plans, and the costs of candidate plans serve as the reinforcement signal. We demonstrate that our technique successfully infers plans within a variety of challenging domains, with higher accuracy than prior art. With our domain-independent feature set, we empirically demonstrate that our model trained on one planning domain can be applied to successfully infer team plans within a novel planning domain.

Epistemic planning extends classical planning with knowledge and is based on dynamic epistemic logic (DEL). The epistemic planning problem is undecidable in general. We exhibit a small undecidable subclass of epistemic planning over 2-agent S5 models with a fixed repertoire of one action, 6 propositions and a fixed goal. We furthermore consider a variant of the epistemic planning problem where the initial knowledge state is an automatic structure, hence possibly infinite. In that case, we show the epistemic planning problem with 1 public action and 2 propositions to be undecidable, while it is known to be decidable with public actions over finite models. Our results are obtained by reducing the reachability problem over small universal cellular automata. While our reductions yield a goal formula that displays the common knowledge operator, we show, for each of our considered epistemic problems, a reduction into an epistemic planning problem for a common-knowledge-operator-free goal formula by using 2 additional actions.

This paper studies an effect-abstraction based relaxation for reasoning about linear numeric planning problems. The effect-abstraction decomposes non-constant linear numeric effects into actions with conditional effects over additive constant numeric effects. With little effort, on this compiled version, it is possible to use known subgoaling based relaxations and relative heuristics. The combination of these two steps leads to a novel relaxation based heuristic. Theoretically, the relaxation is proved tighter than previous interval based relaxation and leading to safe-pruning heuristics. Empirically, a heuristic developed on this relaxation leads to substantial improvements for a class of problems that are currently out of the reach of state-of-the-art numeric planners.

Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This elegant planning strategy has been mostly explored in the tabular setting. The aim of this paper is to revisit sample-based planning, in stochastic and continuous domains with learned models. We first highlight the flexibility afforded by a model over Experience Replay (ER). Replay-based methods can be seen as stochastic planning methods that repeatedly sample from a buffer of recent agent-environment interactions and perform updates to improve data efficiency. We show that a model, as opposed to a replay buffer, is particularly useful for specifying which states to sample from during planning, such as predecessor states that propagate information in reverse from a state more quickly. We introduce a semi-parametric model learning approach, called Reweighted Experience Models (REMs), that makes it simple to sample next states or predecessors. We demonstrate that REM-Dyna exhibits similar advantages over replay-based methods in learning in continuous state problems, and that the performance gap grows when moving to stochastic domains, of increasing size.

We propose a novel Stackelberg game model of MDP interdiction in which the defender modifies the initial state of the planner, who then responds by computing an optimal policy starting with that state. We first develop a novel approach for MDP interdiction in factored state space that allows the defender to modify the initial state. The resulting approach can be computationally expensive for large factored MDPs. To address this, we develop several interdiction algorithms that leverage variations of reinforcement learning using both linear and non-linear function approximation. Finally, we extend the interdiction framework to consider a Bayesian interdiction problem in which the interdictor is uncertain about some of the planner's initial state features. Extensive experiments demonstrate the effectiveness of our approaches.

In non-cooperative multi-agent systems, agents might want to prevent the opponents from achieving their goals. One alternative to solve this task would be using counterplanning to generate a plan that allows an agent to block other's to reach their goals. In this paper, we introduce a fully automated domain-independent approach for counterplanning. It combines; goal recognition to infer an opponent's goal; landmarks' computation to identify subgoals that can be used to block opponents' goals achievement; and classical automated planning to generate plans that prevent the opponent's goals achievement. Experimental results in several domains show the benefits of our novel approach.

In this paper, we leverage the efficiency of Binarized Neural Networks (BNNs) to learn complex state transition models of planning domains with discretized factored state and action spaces. In order to directly exploit this transition structure for planning, we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Boolean Satisfiability (FD-SAT-Plan) as well as Binary Linear Programming (FD-BLP-Plan). Experimentally, we show the effectiveness of learning complex transition models with BNNs, and test the runtime efficiency of both encodings on the learned factored planning problem. After this initial investigation, we present an incremental constraint generation algorithm based on generalized landmark constraints to improve the planning accuracy of our encodings. Finally, we show how to extend the best performing encoding (FD-BLP-Plan+) beyond goals to handle factored planning problems with rewards.