| Total: 19

In real-world phenomena which involve mutual influence or causal effects between interconnected units, equilibrium states are typically represented with cycles in graphical models. An expressive class of graphical models, relational causal models, can represent and reason about complex dynamic systems exhibiting such cycles or feedback loops. Existing cyclic causal discovery algorithms for learning causal models from observational data assume that the data instances are independent and identically distributed which makes them unsuitable for relational causal models. At the same time, causal discovery algorithms for relational causal models assume acyclicity. In this work, we examine the necessary and sufficient conditions under which a constraint-based relational causal discovery algorithm is sound and complete for cyclic relational causal models. We introduce relational acyclification, an operation specifically designed for relational models that enables reasoning about the identifiability of cyclic relational causal models. We show that under the assumptions of relational acyclification and sigma-faithfulness, the relational causal discovery algorithm RCD is sound and complete for cyclic relational models. We present experimental results to support our claim.

Reasoning about the effect of interventions and counterfactuals is a fundamental task found throughout the data sciences. A collection of principles, algorithms, and tools has been developed for performing such tasks in the last decades. One of the pervasive requirements found throughout this literature is the articulation of assumptions, which commonly appear in the form of causal diagrams. Despite the power of this approach, there are significant settings where the knowledge necessary to specify a causal diagram over all variables is not available, particularly in complex, high-dimensional domains. In this paper, we introduce a new graphical modeling tool called cluster DAGs (for short, C-DAGs) that allows for the partial specification of relationships among variables based on limited prior knowledge, alleviating the stringent requirement of specifying a full causal diagram. A C-DAG specifies relationships between clusters of variables, while the relationships between the variables within a cluster are left unspecified, and can be seen as a graphical representation of an equivalence class of causal diagrams that share the relationships among the clusters. We develop the foundations and machinery for valid inferences over C-DAGs about the clusters of variables at each layer of Pearl's Causal Hierarchy - L1 (probabilistic), L2 (interventional), and L3 (counterfactual). In particular, we prove the soundness and completeness of d-separation for probabilistic inference in C-DAGs. Further, we demonstrate the validity of Pearl's do-calculus rules over C-DAGs and show that the standard ID identification algorithm is sound and complete to systematically compute causal effects from observational data given a C-DAG. Finally, we show that C-DAGs are valid for performing counterfactual inferences about clusters of variables.

Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.

Graphical event models (GEMs) are representations of temporal point process dynamics between different event types. Many real-world applications however involve limited event stream data, making it challenging to learn GEMs from data alone. In this paper, we introduce approaches that can work together in a score-based learning paradigm, to augment data with potentially different types of background knowledge. We propose novel scores for learning an important parametric class of GEMs; in particular, we propose a Bayesian score for leveraging prior information as well as a more practical simplification that involves fewer parameters, analogous to Bayesian networks. We also introduce a framework for incorporating easily assessed qualitative background knowledge from domain experts, in the form of statements such as `event X depends on event Y' or `event Y makes event X more likely'. The proposed framework has Bayesian interpretations and can be deployed by any score-based learner. Through an extensive empirical investigation, we demonstrate the practical benefits of background knowledge augmentation while learning GEMs for applications in the low-data regime.

Entropy regularization is known to improve exploration in sequential decision-making problems. We show that this same mechanism can also lead to nearly unbiased and lower-variance estimates of the mean reward in the optimize-and-estimate structured bandit setting. Mean reward estimation (i.e., population estimation) tasks have recently been shown to be essential for public policy settings where legal constraints often require precise estimates of population metrics. We show that leveraging entropy and KL divergence can yield a better trade-off between reward and estimator variance than existing baselines, all while remaining nearly unbiased. These properties of entropy regularization illustrate an exciting potential for bringing together the optimal exploration and estimation literature.

Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. It consists in automatically learning a logical theory from data. The basis for structure learning is mining repeating patterns in the data, known as structural motifs. Finding these patterns reduces the exponential search space and therefore guides the learning of formulas. Despite the importance of motif learning, it is still not well understood. We present the first principled approach for mining structural motifs in lifted graphical models, languages that blend first-order logic with probabilistic models, which uses a stochastic process to measure the similarity of entities in the data. Our first contribution is an algorithm, which depends on two intuitive hyperparameters: one controlling the uncertainty in the entity similarity measure, and one controlling the softness of the resulting rules. Our second contribution is a preprocessing step where we perform hierarchical clustering on the data to reduce the search space to the most relevant data. Our third contribution is to introduce an O(n ln(n)) (in the size of the entities in the data) algorithm for clustering structurally-related data. We evaluate our approach using standard benchmarks and show that we outperform state-of-the-art structure learning approaches by up to 6% in terms of accuracy and up to 80% in terms of runtime.

The permanent of a matrix has numerous applications but is notoriously hard to compute. While nonnegative matrices admit polynomial approximation schemes based on rapidly mixing Markov chains, the known practical estimators of the permanent rely on importance or rejection sampling. We advance the rejection sampling approach, which provides probabilistic accuracy guarantees, unlike importance sampling. Specifically, we give a novel class of nesting upper bounds and a simple preprocessing method that, in comparison to previous works, enable faster sampling with better acceptance rate; we demonstrate order-of-magnitude improvements with both theoretical and empirical analyses. In addition, we display instances on which our approximation scheme is competitive against state-of-the-art importance sampling based estimators.

Normalizing flows have been successfully modeling a complex probability distribution as an invertible transformation of a simple base distribution. However, there are often applications that require more than invertibility. For instance, the computation of energies and forces in physics requires the second derivatives of the transformation to be well-defined and continuous. Smooth normalizing flows employ infinitely differentiable transformation, but with the price of slow non-analytic inverse transforms. In this work, we propose diffeomorphic non-uniform B-spline flows that are at least twice continuously differentiable while bi-Lipschitz continuous, enabling efficient parametrization while retaining analytic inverse transforms based on a sufficient condition for diffeomorphism. Firstly, we investigate the sufficient condition for C(k-2)-diffeomorphic non-uniform kth-order B-spline transformations. Then, we derive an analytic inverse transformation of the non-uniform cubic B-spline transformation for neural diffeomorphic non-uniform B-spline flows. Lastly, we performed experiments on solving the force matching problem in Boltzmann generators, demonstrating that our C2-diffeomorphic non-uniform B-spline flows yielded solutions better than previous spline flows and faster than smooth normalizing flows. Our source code is publicly available at https://github.com/smhongok/Non-uniform-B-spline-Flow.

We propose novel identification conditions and a statistical estimation method for the probabilities of potential outcome types using covariate information in randomized trials in which the treatment assignment is randomized but subject compliance is not perfect. Different from existing studies, the proposed identification conditions do not require strict assumptions such as the assumption of monotonicity. When the probabilities of potential outcome types are identifiable through the proposed conditions, the problem of estimating the probabilities of potential outcome types is reduced to that of singular models. Thus, the probabilities cannot be evaluated using standard statistical likelihood-based estimation methods. Rather, the proposed identification conditions show that we can derive consistent estimators of the probabilities of potential outcome types via the method of moments, which leads to the asymptotic normality of the proposed estimators through the delta method under regular conditions. We also propose a new statistical estimation method based on the bounded constrained augmented Lagrangian method to derive more efficient estimators than can be derived through the method of moments.

There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alpha-beta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.

This paper develops a novel methodology to simultaneously learn a neural network and extract generalized logic rules. Different from prior neural-symbolic methods that require background knowledge and candidate logical rules to be provided, we aim to induce task semantics with minimal priors. This is achieved by a two-step learning framework that iterates between optimizing neural predictions of task labels and searching for a more accurate representation of the hidden task semantics. Notably, supervision works in both directions: (partially) induced task semantics guide the learning of the neural network and induced neural predictions admit an improved semantic representation. We demonstrate that our proposed framework is capable of achieving superior out-of-distribution generalization performance on two tasks: (i) learning multi-digit addition, where it is trained on short sequences of digits and tested on long sequences of digits; (ii) predicting the optimal action in the Tower of Hanoi, where the model is challenged to discover a policy independent of the number of disks in the puzzle.

We propose ordering-based approaches for learning the maximal ancestral graph (MAG) of a structural equation model (SEM) up to its Markov equivalence class (MEC) in the presence of unobserved variables. Existing ordering-based methods in the literature recover a graph through learning a causal order (c-order). We advocate for a novel order called removable order (r-order) as they are advantageous over c-orders for structure learning. This is because r-orders are the minimizers of an appropriately defined optimization problem that could be either solved exactly (using a reinforcement learning approach) or approximately (using a hill-climbing search). Moreover, the r-orders (unlike c-orders) are invariant among all the graphs in a MEC and include c-orders as a subset. Given that set of r-orders is often significantly larger than the set of c-orders, it is easier for the optimization problem to find an r-order instead of a c-order. We evaluate the performance and the scalability of our proposed approaches on both real-world and randomly generated networks.

The Voter model is a well-studied stochastic process that models the invasion of a novel trait A (e.g., a new opinion, social meme, genetic mutation, magnetic spin) in a network of individuals (agents, people, genes, particles) carrying an existing resident trait B. Individuals change traits by occasionally sampling the trait of a neighbor, while an invasion bias δ ≥ 0 expresses the stochastic preference to adopt the novel trait A over the resident trait B. The strength of an invasion is measured by the probability that eventually the whole population adopts trait A, i.e., the fixation probability. In more realistic settings, however, the invasion bias is not ubiquitous, but rather manifested only in parts of the network. For instance, when modeling the spread of a social trait, the invasion bias represents localized incentives. In this paper, we generalize the standard biased Voter model to the positional Voter model, in which the invasion bias is effectuated only on an arbitrary subset of the network nodes, called biased nodes. We study the ensuing optimization problem, which is, given a budget k, to choose k biased nodes so as to maximize the fixation probability of a randomly occurring invasion. We show that the problem is NP-hard both for finite δ and when δ → ∞ (strong bias), while the objective function is not submodular in either setting, indicating strong computational hardness. On the other hand, we show that, when δ → 0 (weak bias), we can obtain a tight approximation in O(n^2ω ) time, where ω is the matrix-multiplication exponent. We complement our theoretical results with an experimental evaluation of some proposed heuristics.

With the increased use of machine learning systems for decision making, questions about the fairness properties of such systems start to take center stage. Most existing work on algorithmic fairness assume complete observation of features at prediction time, as is the case for popular notions like statistical parity and equal opportunity. However, this is not sufficient for models that can make predictions with partial observation as we could miss patterns of bias and incorrectly certify a model to be fair. To address this, a recently introduced notion of fairness asks whether the model exhibits any discrimination pattern, in which an individual—characterized by (partial) feature observations—receives vastly different decisions merely by disclosing one or more sensitive attributes such as gender and race. By explicitly accounting for partial observations, this provides a much more fine-grained notion of fairness. In this paper, we propose an algorithm to search for discrimination patterns in a general class of probabilistic models, namely probabilistic circuits. Previously, such algorithms were limited to naive Bayes classifiers which make strong independence assumptions; by contrast, probabilistic circuits provide a unifying framework for a wide range of tractable probabilistic models and can even be compiled from certain classes of Bayesian networks and probabilistic programs, making our method much more broadly applicable. Furthermore, for an unfair model, it may be useful to quickly find discrimination patterns and distill them for better interpretability. As such, we also propose a sampling-based approach to more efficiently mine discrimination patterns, and introduce new classes of patterns such as minimal, maximal, and Pareto optimal patterns that can effectively summarize exponentially many discrimination patterns.

The concept of potential outcome types is one of the fundamental components of causal inference. However, even in randomized experiments, assumptions on the data generating process, such as monotonicity, are required to evaluate the probabilities of the potential outcome types. To solve the problem without such assumptions in experimental studies, a novel identification condition based on proxy covariate information is proposed in this paper. In addition, the estimation problem of the probabilities of the potential outcome types reduces to that of singular models when they are identifiable through the proposed condition. Thus, they cannot be evaluated by standard statistical estimation methods. To overcome this difficulty, new plug-in estimators of these probabilities are presented, and the asymptotic normality of the proposed estimators is shown.

We consider the task of weighted first-order model counting (WFOMC) used for probabilistic inference in the area of statistical relational learning. Given a formula φ, domain size n and a pair of weight functions, what is the weighted sum of all models of φ over a domain of size n? It was shown that computing WFOMC of any logical sentence with at most two logical variables can be done in time polynomial in n. However, it was also shown that the task is #P1-complete once we add the third variable, which inspired the search for extensions of the two-variable fragment that would still permit a running time polynomial in n. One of such extension is the two-variable fragment with counting quantifiers. In this paper, we prove that adding a linear order axiom (which forces one of the predicates in φ to introduce a linear ordering of the domain elements in each model of φ) on top of the counting quantifiers still permits a computation time polynomial in the domain size. We present a new dynamic programming-based algorithm which can compute WFOMC with linear order in time polynomial in n, thus proving our primary claim.

Methods to identify cause-effect relationships currently mostly assume the variables to be scalar random variables. However, in many fields the objects of interest are vectors or groups of scalar variables. We present a new constraint-based non-parametric approach for inferring the causal relationship between two vector-valued random variables from observational data. Our method employs sparsity estimates of directed and undirected graphs and is based on two new principles for groupwise causal reasoning that we justify theoretically in Pearl's graphical model-based causality framework. Our theoretical considerations are complemented by two new causal discovery algorithms for causal interactions between two random vectors which find the correct causal direction reliably in simulations even if interactions are nonlinear. We evaluate our methods empirically and compare them to other state-of-the-art techniques.

Enumerating the directed acyclic graphs (DAGs) of a Markov equivalence class (MEC) is an important primitive in causal analysis. The central resource from the perspective of computational complexity is the delay, that is, the time an algorithm that lists all members of the class requires between two consecutive outputs. Commonly used algorithms for this task utilize the rules proposed by Meek (1995) or the transformational characterization by Chickering (1995), both resulting in superlinear delay. In this paper, we present the first linear-time delay algorithm. On the theoretical side, we show that our algorithm can be generalized to enumerate DAGs represented by models that incorporate background knowledge, such as MPDAGs; on the practical side, we provide an efficient implementation and evaluate it in a series of experiments. Complementary to the linear-time delay algorithm, we also provide intriguing insights into Markov equivalence itself: All members of an MEC can be enumerated such that two successive DAGs have structural Hamming distance at most three.

Recently, several methods such as private ANM, EM-PC and Priv-PC have been proposed to perform differentially private causal discovery in various scenarios including bivariate, multivariate Gaussian and categorical cases. However, there is little effort on how to conduct private nonlinear causal discovery from numerical data. This work tries to challenge this problem. To this end, we propose a method to infer nonlinear causal relations from observed numerical data by using regression-based conditional independence test (RCIT) that consists of kernel ridge regression (KRR) and Hilbert-Schmidt independence criterion (HSIC) with permutation approximation. Sensitivity analysis for RCIT is given and a private constraint-based causal discovery framework with differential privacy guarantee is developed. Extensive simulations and real-world experiments for both conditional independence test and causal discovery are conducted, which show that our method is effective in handling nonlinear numerical cases and easy to implement. The source code of our method and data are available at https://github.com/Causality-Inference/PCD.