| Total: 23

In hedonic games, a set of n agents, having preferences over all possible coalition structures, needs to agree on a stable outcome. In this work, we initiate the study of hedonic games with fixed-size coalitions, where the set of possible coalition structures is restricted as follows: there are k coalitions, each coalition has a fixed size, and the sum of the sizes of all coalitions equals n. We focus on the basic model of additively separable hedonic games with symmetric preferences, where an agent's preference is captured by a utility function which sums up a contribution due to any other agent in the same coalition. In this setting, an outcome is stable if no pair of agents can exchange coalitions and improve their utilities. Conditioned on the definition of improvement, three stability notions arise: swap stability under transferable utilities, which requires to improve the sum of the utilities of both agents, swap stability, which requires to improve the utility of one agent without decreasing the utility of the other one, and strict swap stability, requiring to improve the utilities of both agents simultaneously. We analyse the fundamental questions of existence, complexity and efficiency of stable outcomes, and that of complexity of a social optimum.

When humans collaborate with each other, they often make decisions by observing others and considering the consequences that their actions may have on the entire team, instead of greedily doing what is best for just themselves. We would like our AI agents to effectively collaborate in a similar way by capturing a model of their partners. In this work, we propose and analyze a decentralized Multi-Armed Bandit (MAB) problem with coupled rewards as an abstraction of more general multi-agent collaboration. We demonstrate that naive extensions of single-agent optimal MAB algorithms fail when applied for decentralized bandit teams. Instead, we propose a Partner-Aware strategy for joint sequential decision-making that extends the well-known single-agent Upper Confidence Bound algorithm. We analytically show that our proposed strategy achieves logarithmic regret, and provide extensive experiments involving human-AI and human-robot collaboration to validate our theoretical findings. Our results show that the proposed partner-aware strategy outperforms other known methods, and our human subject studies suggest humans prefer to collaborate with AI agents implementing our partner-aware strategy.

The Moran process is a classic stochastic process that models invasion dynamics on graphs. A single mutant (e.g., a new opinion, strain, social trait etc.) invades a population of residents spread over the nodes of a graph. The mutant fitness advantage δ>=0 determines how aggressively mutants propagate to their neighbors. The quantity of interest is the fixation probability, i.e., the probability that the initial mutant eventually takes over the whole population. However, in realistic settings, the invading mutant has an advantage only in certain locations. E.g., the ability to metabolize a certain sugar is an advantageous trait to bacteria only when the sugar is actually present in their surroundings. In this paper we introduce the positional Moran process, a natural generalization in which the mutant fitness advantage is only realized on specific nodes called active nodes, and study the problem of fixation maximization: given a budget k, choose a set of k active nodes that maximize the fixation probability of the invading mutant. We show that the problem is NP-hard, while the optimization function is not submodular, thus indicating strong computational hardness. We focus on two natural limits. In the limit of δ to infinity (strong selection), although the problem remains NP-hard, the optimization function becomes submodular and thus admits a constant-factor approximation using a simple greedy algorithm. In the limit of δ to 0 (weak selection), we show that we can obtain a tight approximation in O(n^{2×ω}) time, where ω is the matrix-multiplication exponent. An experimental evaluation of the new algorithms along with some proposed heuristics corroborates our results.

Multi-Agent Path Finding (MAPF) is the problem of finding collision-free paths for multiple agents that minimize the sum of path costs. EECBS is a leading two-level algorithm that solves MAPF bounded-suboptimally, that is, within some factor w of the minimum sum of path costs C*. It uses focal search to find bounded-suboptimal paths on the low level and Explicit Estimation Search (EES) to resolve collisions on the high level. EES keeps track of a lower bound LB on C* to find paths whose sum of path costs is at most w LB in order to solve MAPF bounded-suboptimally. However, the costs of many paths are often much smaller than w times their minimum path costs, meaning that the sum of path costs is much smaller than w C*. In this paper, we therefore propose Flexible EECBS (FEECBS), which uses a flex(ible) distribution of the path costs (that relaxes the requirement to find bounded-suboptimal paths on the low level) in order to reduce the number of collisions that need to be resolved on the high level while still guaranteeing to solve MAPF bounded suboptimally. We address the drawbacks of flex distribution via techniques such as restrictions on the flex distribution, restarts of the high-level search with EECBS, and low-level focal-A* search. Our empirical evaluation shows that FEECBS substantially improves the efficiency of EECBS on MAPF instances with large maps and large numbers of agents.

Participatory budgeting (PB) is a democratic process where citizens jointly decide on how to allocate public funds to indivisible projects. In this work, we focus on PB processes where citizens may provide additional money to projects they want to see funded. We introduce a formal framework for this kind of PB with donations. Our framework also allows for diversity constraints, meaning that each project belongs to one or more types, and there are lower and upper bounds on the number of projects of the same type that can be funded. We propose three general classes of methods for aggregating the citizens’ preferences in the presence of donations and analyze their axiomatic properties. Furthermore, we investigate the computational complexity of determining the outcome of a PB process with donations and of finding a citizen’s optimal donation strategy.

Distributed Constraint Optimization Problems (DCOPs) are an important subclass of combinatorial optimization problems, where information and controls are distributed among multiple autonomous agents. Previously, Machine Learning (ML) has been largely applied to solve combinatorial optimization problems by learning effective heuristics. However, existing ML-based heuristic methods are often not generalizable to different search algorithms. Most importantly, these methods usually require full knowledge about the problems to be solved, which are not suitable for distributed settings where centralization is not realistic due to geographical limitations or privacy concerns. To address the generality issue, we propose a novel directed acyclic graph representation schema for DCOPs and leverage the Graph Attention Networks (GATs) to embed graph representations. Our model, GAT-PCM, is then pretrained with optimally labelled data in an offline manner, so as to construct effective heuristics to boost a broad range of DCOP algorithms where evaluating the quality of a partial assignment is critical, such as local search or backtracking search. Furthermore, to enable decentralized model inference, we propose a distributed embedding schema of GAT-PCM where each agent exchanges only embedded vectors, and show its soundness and complexity. Finally, we demonstrate the effectiveness of our model by combining it with a local search or a backtracking search algorithm. Extensive empirical evaluations indicate that the GAT-PCM-boosted algorithms significantly outperform the state-of-the-art methods in various benchmarks.

When dealing with a series of imminent issues, humans can naturally concentrate on a subset of these concerning issues by prioritizing them according to their contributions to motivational indices, e.g., the probability of winning a game. This idea of concentration offers insights into reinforcement learning of sophisticated Large-scale Multi-Agent Systems (LMAS) participated by hundreds of agents. In such an LMAS, each agent receives a long series of entity observations at each step, which can overwhelm existing aggregation networks such as graph attention networks and cause inefficiency. In this paper, we propose a concentration network called ConcNet. First, ConcNet scores the observed entities considering several motivational indices, e.g., expected survival time and state value of the agents, and then ranks, prunes, and aggregates the encodings of observed entities to extract features. Second, distinct from the well-known attention mechanism, ConcNet has a unique motivational subnetwork to explicitly consider the motivational indices when scoring the observed entities. Furthermore, we present a concentration policy gradient architecture that can learn effective policies in LMAS from scratch. Extensive experiments demonstrate that the presented architecture has excellent scalability and flexibility, and significantly outperforms existing methods on LMAS benchmarks.

We study fairness through the lens of cooperative multi-agent learning. Our work is motivated by empirical evidence that naive maximization of team reward yields unfair outcomes for individual team members. To address fairness in multi-agent contexts, we introduce team fairness, a group-based fairness measure for multi-agent learning. We then prove that it is possible to enforce team fairness during policy optimization by transforming the team's joint policy into an equivariant map. We refer to our multi-agent learning strategy as Fairness through Equivariance (Fair-E) and demonstrate its effectiveness empirically. We then introduce Fairness through Equivariance Regularization (Fair-ER) as a soft-constraint version of Fair-E and show that it reaches higher levels of utility than Fair-E and fairer outcomes than non-equivariant policies. Finally, we present novel findings regarding the fairness-utility trade-off in multi-agent settings; showing that the magnitude of the trade-off is dependent on agent skill.

Active Directory is the default security management system for Windows domain networks. We study the shortest path edge interdiction problem for defending Active Directory style attack graphs. The problem is formulated as a Stackelberg game between one defender and one attacker. The attack graph contains one destination node and multiple entry nodes. The attacker's entry node is chosen by nature. The defender chooses to block a set of edges limited by his budget. The attacker then picks the shortest unblocked attack path. The defender aims to maximize the expected shortest path length for the attacker, where the expectation is taken over entry nodes. We observe that practical Active Directory attack graphs have small maximum attack path length and are structurally close to trees. We first show that even if the maximum attack path length is a constant, the problem is still w[1]-hard with respect to the defender's budget. Having a small maximum attack path length and a small budget is not enough to design fixed-parameter algorithms. If we further assume that the number of entry nodes is small, then we derive a fixed-parameter tractable algorithm. We then propose two other fixed-parameter algorithms by exploiting the tree-like features. One is based on tree decomposition and requires a small tree width. The other assumes a small number of splitting nodes (nodes with multiple out-going edges). Finally, the last algorithm is converted into a graph convolutional neural network based heuristic, which scales to larger graphs with more splitting nodes.

Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths for a team of agents in a common environment. MAPF is NP-hard to solve optimally and, in some cases, also bounded-suboptimally. It is thus time-consuming for (bounded-sub)optimal solvers to solve large MAPF instances. Anytime algorithms find solutions quickly for large instances and then improve them to close-to-optimal ones over time. In this paper, we improve the current state-of-the-art anytime solver MAPF-LNS, that first finds an initial solution fast and then repeatedly replans the paths of subsets of agents via Large Neighborhood Search (LNS). It generates the subsets of agents for replanning by randomized destroy heuristics, but not all of them increase the solution quality substantially. We propose to use machine learning to learn how to select a subset of agents from a collection of subsets, such that replanning increases the solution quality more. We show experimentally that our solver, MAPF-ML-LNS, significantly outperforms MAPF-LNS on the standard MAPF benchmark set in terms of both the speed of improving the solution and the final solution quality.

We propose a novel policy gradient method for multi-agent reinforcement learning, which leverages two different variance-reduction techniques and does not require large batches over iterations. Specifically, we propose a momentum-based decentralized policy gradient tracking (MDPGT) where a new momentum-based variance reduction technique is used to approximate the local policy gradient surrogate with importance sampling, and an intermediate parameter is adopted to track two consecutive policy gradient surrogates. MDPGT provably achieves the best available sample complexity of O(N -1 e -3) for converging to an e-stationary point of the global average of N local performance functions (possibly nonconcave). This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning and when initialized with a single trajectory, the sample complexity matches those obtained by the existing decentralized policy gradient methods. We further validate the theoretical claim for the Gaussian policy function. When the required error tolerance e is small enough, MDPGT leads to a linear speed up, which has been previously established in decentralized stochastic optimization, but not for reinforcement learning. Lastly, we provide empirical results on a multi-agent reinforcement learning benchmark environment to support our theoretical findings.

Modern multi-agent robotic systems increasingly require scalable, robust and persistent Multi-Agent Path Finding (MAPF) with performance guarantees. While many MAPF solvers that provide some of these properties exist, none provides them all. To fill this need, we propose a new MAPF framework, the shard system. A shard system partitions the workspace into geographic regions, called shards, linked by a novel system of buffers. Agents are routed optimally within a shard by a local controller to local goals set by a global controller. The buffer system novelly allows shards to plan with perfect parallelism, providing scalability. A novel global controller algorithm can rapidly generate an inter-shard routing plan for thousands of agents while minimizing the traffic routed through any shard. A novel workspace partitioning algorithm produces shards small enough to replan rapidly. These innovations allow a shard system to adjust its routing plan in real time if an agent is delayed or assigned a new goal, enabling robust, persistent MAPF. A shard system's local optimality and optimized inter-shard routing bring the sum-of-costs of its solutions to single-shot MAPF problems to < 20-60% of optimal on a diversity of workspaces. Its scalability allows it to plan paths for 1000s of agents in seconds. If any of their goals change or move actions fails, a shard system can replan in under a second.

Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.

This paper studies defender patrol deception in general Stackelberg security games (SSGs), where a defender attempts to alter the attacker's perception of the defender's patrolling intensity so as to influence the attacker's decision making. We are interested in understanding the complexity and effectiveness of optimal defender deception under different attacker behavior models. Specifically, we consider three different attacker strategies of response (to the defender's deception) with increasing sophistication, and design efficient polynomial-time algorithms to compute the equilibrium for each. Moreover, we prove formal separation for the effectiveness of patrol deception when facing an attacker of increasing sophistication, until it becomes even harmful to the defender when facing the most intelligent attacker we consider. Our results shed light on when and how deception should be used in SSGs. We conduct extensive experiments to illustrate our theoretical results in various game settings.

Mean Field Games (MFGs) can potentially scale multi-agent systems to extremely large populations of agents. Yet, most of the literature assumes a single initial distribution for the agents, which limits the practical applications of MFGs. Machine Learning has the potential to solve a wider diversity of MFG problems thanks to generalizations capacities. We study how to leverage these generalization properties to learn policies enabling a typical agent to behave optimally against any population distribution. In reference to the Master equation in MFGs, we coin the term “Master policies” to describe them and we prove that a single Master policy provides a Nash equilibrium, whatever the initial distribution. We propose a method to learn such Master policies. Our approach relies on three ingredients: adding the current population distribution as part of the observation, approximating Master policies with neural networks, and training via Reinforcement Learning and Fictitious Play. We illustrate on numerical examples not only the efficiency of the learned Master policy but also its generalization capabilities beyond the distributions used for training.

Networked discrete dynamical systems are often used to model the spread of contagions and decision-making by agents in coordination games. Fixed points of such dynamical systems represent configurations to which the system converges. In the dissemination of undesirable contagions (such as rumors and misinformation), convergence to fixed points with a small number of affected nodes is a desirable goal. Motivated by such considerations, we formulate a novel optimization problem of finding a nontrivial fixed point of the system with the minimum number of affected nodes. We establish that, unless P = NP, there is no polynomial-time algorithm for approximating a solution to this problem to within the factor n^(1 - epsilon) for any constant epsilon > 0. To cope with this computational intractability, we identify several special cases for which the problem can be solved efficiently. Further, we introduce an integer linear program to address the problem for networks of reasonable sizes. For solving the problem on larger networks, we propose a general heuristic framework along with greedy selection methods. Extensive experimental results on real-world networks demonstrate the effectiveness of the proposed heuristics. A full version of the manuscript, source code and data are available at: https://github.com/bridgelessqiu/NMIN-FPE

Aggregating opinions of a collection of agents is a question of interest to a broad array of researchers, ranging from ensemble-learning theorists to political scientists designing democratic institutions. This work investigates the optimal number of agents needed to decide on a binary issue under majority rule. We take an epistemic view where the issue at hand has a ground truth ``correct'' outcome and each one of n voters votes correctly with a fixed probability, known as their competence level or competence. These competencies come from a fixed distribution D. Observing the competencies, we must choose a specific group that will represent the population. Finally, voters sample a decision (either correct or not), and the group is correct as long as more than half the chosen representatives voted correctly. Assuming that we can identify the best experts, i.e., those with the highest competence, to form an epistemic congress we find that the optimal congress size should be linear in the population size. This result is striking because it holds even when allowing the top representatives to become arbitrarily accurate, choosing the correct outcome with probabilities approaching 1. We then analyze real-world data, observing that the actual sizes of representative bodies are much smaller than the optimal ones our theoretical results suggest. We conclude by examining under what conditions congresses of sub-optimal sizes would still outperform direct democracy, in which all voters vote. We find that a small congress would beat direct democracy if the rate at which the societal bias towards the ground truth decreases with the population size fast enough, and we quantify the speed needed for constant and polynomial congress sizes.

Multiagent reinforcement learning algorithms have not been widely adopted in large scale environments with many agents as they often scale poorly with the number of agents. Using mean field theory to aggregate agents has been proposed as a solution to this problem. However, almost all previous methods in this area make a strong assumption of a centralized system where all the agents in the environment learn the same policy and are effectively indistinguishable from each other. In this paper, we relax this assumption about indistinguishable agents and propose a new mean field system known as Decentralized Mean Field Games, where each agent can be quite different from others. All agents learn independent policies in a decentralized fashion, based on their local observations. We define a theoretical solution concept for this system and provide a fixed point guarantee for a Q-learning based algorithm in this system. A practical consequence of our approach is that we can address a `chicken-and-egg' problem in empirical mean field reinforcement learning algorithms. Further, we provide Q-learning and actor-critic algorithms that use the decentralized mean field learning approach and give stronger performances compared to common baselines in this area. In our setting, agents do not need to be clones of each other and learn in a fully decentralized fashion. Hence, for the first time, we show the application of mean field learning methods in fully competitive environments, large-scale continuous action space environments, and other environments with heterogeneous agents. Importantly, we also apply the mean field method in a ride-sharing problem using a real-world dataset. We propose a decentralized solution to this problem, which is more practical than existing centralized training methods.

This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions.

We propose a new problem setting to study the sequential interactions between a recommender system and a user. Instead of assuming the user is omniscient, static, and explicit, as the classical practice does, we sketch a more realistic user behavior model, under which the user: 1) rejects recommendations if they are clearly worse than others; 2) updates her utility estimation based on rewards from her accepted recommendations; 3) withholds realized rewards from the system. We formulate the interactions between the system and such an explorative user in a K-armed bandit framework and study the problem of learning the optimal recommendation on the system side. We show that efficient system learning is still possible but is more difficult. In particular, the system can identify the best arm with probability at least 1-delta within O(1/delta) interactions, and we prove this is tight. Our finding contrasts the result for the problem of best arm identification with fixed confidence, in which the best arm can be identified with probability 1-delta within O(log(1/delta)) interactions. This gap illustrates the inevitable cost the system has to pay when it learns from an explorative user's revealed preferences on its recommendations rather than from the realized rewards.

Effective communication can improve coordination in cooperative multi-agent reinforcement learning (MARL). One popular communication scheme is exchanging agents' local observations or latent embeddings and using them to augment individual local policy input. Such a communication paradigm can reduce uncertainty for local decision-making and induce implicit coordination. However, it enlarges agents' local policy spaces and increases learning complexity, leading to poor coordination in complex settings. To handle this limitation, this paper proposes a novel framework named Multi-Agent Incentive Communication (MAIC) that allows each agent to learn to generate incentive messages and bias other agents' value functions directly, resulting in effective explicit coordination. Our method firstly learns targeted teammate models, with which each agent can anticipate the teammate's action selection and generate tailored messages to specific agents. We further introduce a novel regularization to leverage interaction sparsity and improve communication efficiency. MAIC is agnostic to specific MARL algorithms and can be flexibly integrated with different value function factorization methods. Empirical results demonstrate that our method significantly outperforms baselines and achieves excellent performance on multiple cooperative MARL tasks.

The cost efficiency of model inference is critical to real-world machine learning (ML) applications, especially for delay-sensitive tasks and resource-limited devices. A typical dilemma is: in order to provide complex intelligent services (e.g. smart city), we need inference results of multiple ML models, but the cost budget (e.g. GPU memory) is not enough to run all of them. In this work, we study underlying relationships among black-box ML models and propose a novel learning task: model linking. Model linking aims to bridge the knowledge of different black-box models by learning mappings (dubbed model links) between their output spaces. Based on model links, we developed a scheduling algorithm, named MLink. Through collaborative multi-model inference enabled by model links, MLink can improve the accuracy of obtained inference results under the cost budget. We evaluated MLink on a multi-modal dataset with seven different ML models and two real-world video analytics systems with six ML models and 3,264 hours of video. Experimental results show that our proposed model links can be effectively built among various black-box models. Under the budget of GPU memory, MLink can save 66.7% inference computations while preserving 94% inference accuracy, which outperforms multi-task learning, deep reinforcement learning-based scheduler and frame filtering baselines.

We extend the classic regret minimization framework for approximating equilibria in normal-form games by greedily weighing iterates based on regrets observed at runtime. Theoretically, our method retains all previous convergence rate guarantees. Empirically, experiments on large randomly generated games and normal-form subgames of the AI benchmark Diplomacy show that greedy weights outperforms previous methods whenever sampling is used, sometimes by several orders of magnitude.