| Total: 25

The difficulty of deterministic planning increases exponentially with search-tree depth. Black-box planning presents an even greater challenge, since planners must operate without an explicit model of the domain. Heuristics can make search more efficient, but goal-aware heuristics for black-box planning usually rely on goal counting, which is often quite uninformative. In this work, we show how to overcome this limitation by discovering macro-actions that make the goal-count heuristic more accurate. Our approach searches for macro-actions with focused effects (i.e. macros that modify only a small number of state variables), which align well with the assumptions made by the goal-count heuristic. Focused macros dramatically improve black-box planning efficiency across a wide range of planning domains, sometimes beating even state-of-the-art planners with access to a full domain model.

This paper addresses the challenge of online generalization in tree search. We propose Multiple Estimator Monte Carlo Tree Search (ME-MCTS), with a two-fold contribution: first, we introduce a formalization of online generalization that can represent existing techniques such as "history heuristics", "RAVE", or "OMA" -- contextual action value estimators or abstractors that generalize across specific contexts. Second, we incorporate recent advances in estimator averaging that enable guiding search by combining the online action value estimates of any number of such abstractors or similar types of action value estimators. Unlike previous work, which usually proposed a single abstractor for either the selection or the rollout phase of MCTS simulations, our approach focuses on the combination of multiple estimators and applies them to all move choices in MCTS simulations. As the MCTS tree itself is just another value estimator -- unbiased, but without abstraction -- this blurs the traditional distinction between action choices inside and outside of the MCTS tree. Experiments with three abstractors in four board games show significant improvements of ME-MCTS over MCTS using only a single abstractor, both for MCTS with random rollouts as well as for MCTS with static evaluation functions. While we used deterministic, fully observable games, ME-MCTS naturally extends to more challenging settings.

In many public health settings, it is important for patients to adhere to health programs, such as taking medications and periodic health checks. Unfortunately, beneficiaries may gradually disengage from such programs, which is detrimental to their health. A concrete example of gradual disengagement has been observed by an organization that carries out a free automated call-based program for spreading preventive care information among pregnant women. Many women stop picking up calls after being enrolled for a few months. To avoid such disengagements, it is important to provide timely interventions. Such interventions are often expensive and can be provided to only a small fraction of the beneficiaries. We model this scenario as a restless multi-armed bandit (RMAB) problem, where each beneficiary is assumed to transition from one state to another depending on the intervention. Moreover, since the transition probabilities are unknown a priori, we propose a Whittle index based Q-Learning mechanism and show that it converges to the optimal solution. Our method improves over existing learning-based methods for RMABs on multiple benchmarks from literature and also on the maternal healthcare dataset.

Previous work on satisficing planning using greedy best-first search (GBFS) has shown that non-greedy, randomized exploration can help escape uninformative heuristic regions and solve hard problems faster. Despite their success when used with GBFS, such exploration techniques cannot be directly applied to bounded suboptimal algorithms like Weighted A* (WA*) without losing the solution-quality guarantees. In this work, we present Type-WA*, a novel bounded suboptimal planning algorithm that augments WA* with type-based exploration while still satisfying WA*'s theoretical solution-quality guarantee. Our empirical analysis shows that Type-WA* significantly increases the number of solved problems, when used in conjunction with each of three popular heuristics. Our analysis also provides insight into the runtime vs. solution cost trade-off.

Classical planning tasks are commonly described in PDDL, while most planning systems operate on a grounded finite-domain representation (FDR). The translation of PDDL into FDR is complex and has a lot of choice points---it involves identifying so called mutex groups---but most systems rely on the translator that comes with Fast Downward. Yet the translation choice points can strongly impact performance. Prior work has considered optimizing FDR encodings in terms of the number of variables produced. Here we go one step further by proposing to custom-design FDR encodings, optimizing the encoding to suit particular planning techniques. We develop such a custom design here for red-black planning, a partial delete relaxation technique. The FDR encoding affects the causal graph and the domain transition graph structures, which govern the tractable fragment of red-black planning and hence affects the respective heuristic function. We develop integer linear programming techniques optimizing the scope of that fragment in the resulting FDR encoding. We empirically show that the performance of red-black planning can be improved through such FDR custom design.

In Goal Recognition Design (GRD), the objective is to modify a domain to facilitate early detection of the goal of a subject agent. Most previous work studies this problem in the offline setting, in which the observing agent performs its interventions before the subject begins acting. In this paper, we generalize GRD to the online setting in which time passes and the observer's actions are interleaved with those of the subject. We illustrate weaknesses of existing metrics for GRD and propose an alternative better suited to online settings. We provide a formal definition of this Active GRD (AGRD) problem and study an algorithm for solving it. AGRD occupies an interesting middle ground between passive goal recognition and strategic two-player game settings.

We consider a selection problem with stochastic probing. There is a set of items whose values are drawn from independent distributions. The distributions are known in advance. Each item can be \emph{tested} repeatedly. Each test reduces the uncertainty about the realization of its value. We study a testing model, where the first test reveals if the realized value is smaller or larger than the median of the underlying distribution. Subsequent tests allow to further narrow down the interval in which the realization is located. There is a limited number of possible tests, and our goal is to design near-optimal testing strategies that allow to maximize the expected value of the chosen item. We study both identical and non-identical distributions and develop polynomial-time algorithms with constant approximation factors in both scenarios.

Key to the effectiveness of schedule-driven approaches to real-time traffic control is an ability to accurately predict when sensed vehicles will arrive at and pass through the intersection. Prior work in schedule-driven traffic control has assumed a static vehicle arrival model. However, this static predictive model ignores the fact that the queue count and the incurred delay should vary as different partial signal timing schedules (i.e., different possible futures) are explored during the online planning process. In this paper, we propose an alternative arrival time model that incorporates queueing dynamics into this forward search process for a signal timing schedule, to more accurately capture how the intersection’s queues vary over time. As each search state is generated, an incremental queueing delay is dynamically projected for each vehicle. The resulting total queueing delay is then considered in addition to the cumulative delay caused by signal operations. We demonstrate the potential of this approach through microscopic traffic simulation of a real-world road network, showing a 10-15% reduction in average wait times over the schedule-driven traffic signal control system in heavy traffic scenarios.

Recent advances in symbolic dynamic programming (SDP) have significantly broadened the class of MDPs for which exact closed-form value functions can be derived. However, no existing solution methods can solve complex discrete and continuous state MDPs where a linear program determines state transitions --- transitions that are often required in problems with underlying constrained flow dynamics arising in problems ranging from traffic signal control to telecommunications bandwidth planning. In this paper, we present a novel SDP solution method for MDPs with LP transitions and continuous piecewise linear dynamics by introducing a novel, fully symbolic argmax operator. On three diverse domains, we show the first automated exact closed-form SDP solution to these challenging problems and the significant advantages of our SDP approach over discretized approximations.

We investigate the computational complexity of finding temporally disjoint paths or walks in temporal graphs. There, the edge set changes over discrete time steps and a temporal path (resp. walk) uses edges that appear at monotonically increasing time steps. Two paths (or walks) are temporally disjoint if they never use the same vertex at the same time; otherwise, they interfere. This reflects applications in robotics, traffic routing, or finding safe pathways in dynamically changing networks. On the one extreme, we show that on general graphs the problem is computationally hard. The "walk version" is W[1]-hard when parameterized by the number of routes. However, it is polynomial-time solvable for any constant number of walks. The "path version" remains NP-hard even if we want to find only two temporally disjoint paths. On the other extreme, restricting the input temporal graph to have a path as underlying graph, quite counterintuitively, we find NP-hardness in general but also identify natural tractable cases.

The General Data Protection Regulations (GDPR) entitle individuals to explanations for automated decisions. The form, comprehensibility, and even existence of such explanations remain open problems, investigated as part of explainable AI. We adopt the approach of counterfactual explanations and apply it to decisions made by declarative optimization models. We argue that inverse combinatorial optimization is particularly suited for counterfactual explanations but that the computational difficulties and relatively nascent literature make its application a challenge. To make progress, we address the case of counterfactual explanations that isolate the minimal differences for an individual. We show that under two common optimization functions, full inverse optimization is unnecessary. In particular, we show that for functions of the form of the sum of weighted binary variables, which includes frameworks such as weighted MaxSAT, a solution can be found by solving a slightly modified version of the original optimization model. In contrast, the sum of weighted integer variables can be solved with a binary search over a series of modifications to the original model.

Decision-making policies for agents are often synthesized with the constraint that a formal specification of behaviour is satisfied. Here we focus on infinite-horizon properties. On the one hand, Linear Temporal Logic (LTL) is a popular example of a formalism for qualitative specifications. On the other hand, Steady-State Policy Synthesis (SSPS) has recently received considerable attention as it provides a more quantitative and more behavioural perspective on specifications, in terms of the frequency with which states are visited. Finally, rewards provide a classic framework for quantitative properties. In this paper, we study Markov decision processes (MDP) with the specification combining all these three types. The derived policy maximizes the reward among all policies ensuring the LTL specification with the given probability and adhering to the steady-state constraints. To this end, we provide a unified solution reducing the multi-type specification to a multi-dimensional long-run average reward. This is enabled by Limit-Deterministic Büchi Automata (LDBA), recently studied in the context of LTL model checking on MDP, and allows for an elegant solution through a simple linear programme. The algorithm also extends to the general omega-regular properties and runs in time polynomial in the sizes of the MDP as well as the LDBA.

The automated learning of action models is widely recognised as a key and compelling challenge to address the difficulties of the manual specification of planning domains. Most state-of-the-art methods perform this learning offline from an input set of plan traces generated by the execution of (successful) plans. However, how to generate informative plan traces for learning action models is still an open issue. Moreover, plan traces might not be available for a new environment. In this paper, we propose an algorithm for learning action models online, incrementally during the execution of plans. Such plans are generated to achieve goals that the algorithm decides online in order to obtain informative plan traces and reach states from which useful information can be learned. We show some fundamental theoretical properties of the algorithm, and we experimentally evaluate the online learning of the action models over a large set of IPC domains.

Polynomial-time heuristic functions for planning are commonplace since 20 years. But polynomial-time in which input? Almost all existing approaches are based on a grounded task representation, not on the actual PDDL input which is exponentially smaller. This limits practical applicability to cases where the grounded representation is "small enough". Previous attempts to tackle this problem for the delete relaxation leveraged symmetries to reduce the blow-up. Here we take a more radical approach, applying an additional relaxation to obtain a heuristic function that runs in time polynomial in the size of the PDDL input. Our relaxation splits the predicates into smaller predicates of fixed arity K. We show that computing a relaxed plan is still NP-hard (in PDDL input size) for K>=2, but is polynomial-time for K=1. We implement a heuristic function for K=1 and show that it can improve the state of the art on benchmarks whose grounded representation is large.

Multi-Agent Path Finding (MAPF) is the challenging problem of computing collision-free paths for multiple agents. Algorithms for solving MAPF can be categorized on a spectrum. At one end are (bounded-sub)optimal algorithms that can find high-quality solutions for small problems. At the other end are unbounded-suboptimal algorithms that can solve large problems but usually find low-quality solutions. In this paper, we consider a third approach that combines the best of both worlds: anytime algorithms that quickly find an initial solution using efficient MAPF algorithms from the literature, even for large problems, and that subsequently improve the solution quality to near-optimal as time progresses by replanning subgroups of agents using Large Neighborhood Search. We compare our algorithm MAPF-LNS against a range of existing work and report significant gains in scalability, runtime to the initial solution, and speed of improving the solution.

Influenced by the era of the sharing economy and mobile payment, Dockless Bike-Sharing System (Dockless BSS) is expanding in many major cities. The mobility of users constantly leads to supply and demand imbalance, which seriously affects the total profit and customer satisfaction. In this paper, we propose the Spatio-Temporal Mixed Integer Program (STMIP) with Flow-graphed Community Discovery (FCD) approach to rebalancing the system. Different from existing studies that ignore the route of trucks and adopt a centralized rebalancing, our approach considers the spatio-temporal information of trucks and discovers station communities for truck-based rebalancing. First, we propose the FCD algorithm to detect station communities. Significantly, rebalancing communities decomposes the centralized system into a distributed multi-communities system. Then, by considering the routing and velocity of trucks, we design the STMIP model with the objective of maximizing total profit, to find a repositioning policy for each station community. We design a simulator built on real-world data from DiDi Chuxing to test the algorithm performance. The extensive experimental results demonstrate that our approach outperforms in terms of service level, profit, and complexity compared with the state-of-the-art approach.

We consider the problem of synthesizing good-enough (GE)-strategies for linear temporal logic (LTL) over finite traces or LTLf for short. The problem of synthesizing GE-strategies for an LTL formula φ over infinite traces reduces to the problem of synthesizing winning strategies for the formula (∃Oφ)⇒φ where O is the set of propositions controlled by the system. We first prove that this reduction does not work for LTLf formulas. Then we show how to synthesize GE-strategies for LTLf formulas via the Good-Enough (GE)-synthesis of LTL formulas. Unfortunately, this requires to construct deterministic parity automata on infinite words, which is computationally expensive. We then show how to synthesize GE-strategies for LTLf formulas by a reduction to solving games played on deterministic Büchi automata, based on an easier construction of deterministic automata on finite words. We show empirically that our specialized synthesis algorithm for GE-strategies outperforms the algorithms going through GE-synthesis of LTL formulas by orders of magnitude.

Incorporating humans into AI planning is an important feature of flexible planning technology. Such human integration allows to incorporate previously unknown constraints, and is also an integral part of automated modeling assistance. As a foundation for integrating user requests, we study the computational complexity of determining the existence of changes to an existing model, such that the resulting model allows for specific user-provided solutions. We are provided with a planning problem modeled either in the classical (non-hierarchical) or hierarchical task network (HTN) planning formalism, as well as with a supposed-to-be solution plan, which is actually not a solution for the current model. Considering changing decomposition methods as well as preconditions and effects of actions, we show that most change requests are NP-complete though some turn out to be tractable.

Temporal plan preferences are natural and important in a variety of applications. Yet users often find it difficult to formalize their preferences. Here we explore the possibility to learn preferences from example plans. Focusing on one preference at a time, the user is asked to annotate examples as good/bad. We leverage prior work on LTL formula learning to extract a preference from these examples. We conduct an empirical study of this approach in an oversubscription planning context, using hidden target formulas to emulate the user preferences. We explore four different methods for generating example plans, and evaluate performance as a function of domain and formula size. Overall, we find that reasonable-size target formulas can often be learned effectively.

Stubborn sets are a pruning technique for state-space search which is well established in optimal classical planning. In this paper, we show that weak stubborn sets introduced in recent work in planning are actually not weak stubborn sets in Valmari's original sense. Based on this finding, we introduce weak stubborn sets in the original sense for planning by providing a generalized definition analogously to generalized strong stubborn sets in previous work. We discuss the relationship of strong, weak and the previously called weak stubborn sets, thus providing a further step in getting an overall picture of the stubborn set approach in planning.

Recent work in classical planning has introduced dedicated techniques for detecting unsolvable states, i.e., states from which no goal state can be reached. We approach the problem from a generalized planning perspective and learn first-order-like formulas that characterize unsolvability for entire planning domains. We show how to cast the problem as a self-supervised classification task. Our training data is automatically generated and labeled by exhaustive exploration of small instances of each domain, and candidate features are automatically computed from the predicates used to define the domain. We investigate three learning algorithms with different properties and compare them to heuristics from the literature. Our empirical results show that our approach often captures important classes of unsolvable states with high classification accuracy. Additionally, the logical form of our heuristics makes them easy to interpret and reason about, and can be used to show that the characterizations learned in some domains capture exactly all unsolvable states of the domain.

We study the two-player zero-sum extension of the partially observable stochastic shortest-path problem where one agent has only partial information about the environment. We formulate this problem as a partially observable stochastic game (POSG): given a set of target states and negative rewards for each transition, the player with imperfect information maximizes the expected undiscounted total reward until a target state is reached. The second player with the perfect information aims for the opposite. We base our formalism on POSGs with one-sided observability (OS-POSGs) and give the following contributions: (1) we introduce a novel heuristic search value iteration algorithm that iteratively solves depth-limited variants of the game, (2) we derive the bound on the depth guaranteeing an arbitrary precision, (3) we propose a novel upper-bound estimation that allows early terminations, and (4) we experimentally evaluate the algorithm on a pursuit-evasion game.

Heuristic search is among the best performing approaches to classical satisficing planning, with its performance heavily relying on informative and fast heuristics, as well as search-boosting and pruning techniques. While both heuristics and pruning techniques have gained much attention recently, search-boosting techniques in general, and preferred operators in particular have received less attention in the last decade. Our work aims at bringing the light back to preferred operators research, with the introduction of preferred operators pruning technique, based on the concept of novelty. Continuing the research on novelty with respect to an underlying heuristic, we present the definition of preferred operators for such novelty heuristics. For that, we extend the previously defined concepts to operators, allowing us to reason about the novelty of the preferred operators. Our experimental evaluation shows the practical benefit of our suggested approach, compared to the currently used methods.

Robots assisting us in factories or homes must learn to make use of objects as tools to perform tasks, e.g., a tray for carrying objects. We consider the problem of learning commonsense knowledge of when a tool may be useful and how its use may be composed with other tools to accomplish a high-level task instructed by a human. We introduce TANGO, a novel neural model for predicting task-specific tool interactions. TANGO is trained using demonstrations obtained from human teachers instructing a virtual robot in a physics simulator. TANGO encodes the world state consisting of objects and symbolic relationships between them using a graph neural network. The model learns to attend over the scene using knowledge of the goal and the action history, finally decoding the symbolic action to execute. Crucially, we address generalization to unseen environments where some known tools are missing, but alternative unseen tools are present. We show that by augmenting the representation of the environment with pre-trained embeddings derived from a knowledge-base, the model can generalize effectively to novel environments. Experimental results show a 60.5-78.9% improvement over the baseline in predicting successful symbolic plans in unseen settings for a simulated mobile manipulator.

The Traveling Tournament Problem is a well-known benchmark problem in tournament timetabling, which asks us to design a schedule of home/away games of n teams (n is even) under some feasibility requirements such that the total traveling distance of all the n teams is minimized. In this paper, we study TTP-2, the traveling tournament problem where at most two consecutive home games or away games are allowed, and give an effective algorithm for n/2 being odd. Experiments on the well-known benchmark sets show that we can beat previously known solutions for all instances with n/2 being odd by an average improvement of 2.66%. Furthermore, we improve the theoretical approximation ratio from 3/2+O(1/n) to 1+O(1/n) for n/2 being odd, answering a challenging open problem in this area.