| Total: 31
Point processes often have a natural interpretation with respect to a continuous process. We propose a point process construction that describes arrival time observations in terms of the state of a latent diffusion process. In this framework, we relate the return time of diffusion in a continuous path space to new arrivals of the point process. These models arise in many disciplines, such as financial settings where actions in a market are determined by a hidden continuous price or in neuroscience where a latent stimulus generates spike trains. Based on the developments in Itô's excursion theory, we describe computational methods for inferring and sampling from the point process derived from the diffusion process. We provide numerical examples for the proposed method using both simulated and real data to illustrate the approach. The proposed methods and framework provide a basis for interpreting point processes through the lens of a diffusion.
Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures through the lens of ``inference as control''. They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.
Identifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI. While an agent can commonly interact with the environment and may implicitly perturb the behavior of some of these causal variables, often the targets it affects remain unknown. In this paper, we show that causal variables can still be identified for many common setups, e.g., additive Gaussian noise models, if the agent's interactions with a causal variable can be described by an unknown binary variable. This happens when each causal variable has two different mechanisms, e.g., an observational and an interventional one. Using this identifiability result, we propose BISCUIT, a method for simultaneously learning causal variables and their corresponding binary interaction variables. On three robotic-inspired datasets, BISCUIT accurately identifies causal variables and can even be scaled to complex, realistic environments for embodied AI.
Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior. In the context of density estimation, the standard nonparametric Bayesian approach is to target the posterior predictive of the Dirichlet process mixture model. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of quasi-Bayesian predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and a Gaussian process prior. While the predictive update of such a model is typically intractable, we derive a quasi-Bayesian predictive update that achieves state-of-the-art results on moderate-sized examples.
Marked temporal point processes (MTPPs) are a general class of stochastic models for modeling the evolution of events of different types (``marks'') in continuous time. These models have broad applications in areas such as medical data monitoring, financial prediction, user modeling, and communication networks. Of significant practical interest in such problems is the issue of missing or censored data over time. In this paper, we focus on the specific problem of inference for a trained MTPP model when events of certain types are not observed over a period of time during prediction. We introduce the concept of mark-censored sub-processes and use this framework to develop a novel marginalization technique for inference in the presence of censored marks. The approach is model-agnostic and applicable to any MTPP model with a well-defined intensity function. We illustrate the flexibility and utility of the method in the context of both parametric and neural MTPP models, with results across a range of datasets including data from simulated Hawkes processes, self-correcting processes, and multiple real-world event datasets.
Identifying the causes of an event, also termed as causal attribution, is a commonly encountered task in many application problems. Available methods, mostly in Bayesian or causal inference literature, suffer from two main drawbacks: 1) cannot attributing for individuals, (2) attributing one single cause at a time and cannot deal with the interaction effect among multiple causes. In this paper, based on our proposed new measurement, called conditional counterfactual causality effect (CCCE), we introduce an individual causal attribution method, which is able to utilize the individual observation as the evidence and consider common influence and interaction effect of multiple causes simultaneously. We discuss the identifiability of CCCE and also give the identification formulas under proper assumptions. Finally, we conduct experiments on simulated and real data to illustrate the effectiveness of CCCE and the results show that our proposed method outperforms significantly over state-of-the-art methods.
Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the next training data point to be learning from. Standard variants of SGD employ a sampling with replacement strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on sampling without replacement as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as random shuffling (RS), or before every epoch, strategy known as random reshuffling (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al. [2020] provided tight sublinear convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shuffling-based methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining linear convergence rates. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement.
Recently, neural heuristics based on deep learning have reported encouraging results for solving vehicle routing problems (VRPs), especially on independent and identically distributed (i.i.d.) instances, e.g. uniform. However, in the presence of a distribution shift for the testing instances, their performance becomes considerably inferior. In this paper, we propose a multi-view graph Contrastive learning (MVGCL) approach to enhance the generalization across different distributions, which exploits two graph pattern learners in a self-supervised fashion to facilitate a neural heuristic equipped with an active search scheme. Specifically, we first propose two augmentation methods that are specially designed for routing problems, and our MVGCL leverages graph contrastive learning to extract transferable patterns from VRP graphs to attain the generalizable multi-view (i.e. node and graph) representation. Then it adopts the learnt node embedding and graph embedding to assist the neural heuristic and the active search (during inference) for route construction, respectively. Extensive experiments on randomly generated VRP instances from various distributions, and the ones from TSPLib and CVRPLib show that our MVGCL is superior to the baselines in boosting the cross-distribution generalization performance.
Modern reinforcement learning systems produce many high-quality policies throughout the learning process. However, to choose which policy to actually deploy in the real world, they must be tested under an intractable number of environmental conditions. We introduce RPOSST, an algorithm to select a small set of test cases from a larger pool based on a relatively small number of sample evaluations. RPOSST treats the test case selection problem as a 2-player game and optimizes a solution with provable k-of-N robustness, bounding the error relative to a test that used all the test cases in the pool. Empirical results demonstrate that RPOSST finds a small set of test cases that identify high quality policies in a toy one-shot game, poker datasets, and a high-fidelity racing simulator.
By recursively summing node features over entire neighborhoods, spatial graph convolution operators have been heralded as key to the success of Graph Neural Networks (GNNs). Yet, despite the multiplication of GNN methods across tasks and applications, the effect of this aggregation operation has yet to be analyzed. In fact, while most recent efforts in the GNN community have focused on optimizing the architecture of the neural network, fewer works have attempted to characterize (a) the different classes of spatial convolution operators, (b) their impact on the geometry of the embedding space, and (c) how the choice of a particular convolution should relate to properties of the data. In this paper, we propose to answer all three questions by dividing existing operators into two main classes (symmetrized vs. row-normalized spatial convolutions), and show how these correspond to different implicit biases on the data. Finally, we show that this convolution operator is in fact tunable, and explicit regimes in which certain choices of convolutions --- and therefore, embedding geometries --- might be more appropriate.
Assessing the validity of a real-world system with respect to given quality criteria is a common yet costly task in industrial applications due to the vast number of required real-world tests. Validating such systems by means of simulation offers a promising and less expensive alternative, but requires an assessment of the simulation accuracy and therefore end-to-end measurements. Additionally, covariate shifts between simulations and actual usage can cause difficulties for estimating the reliability of such systems. In this work, we present a validation method that propagates bounds on distributional discrepancy measures through a composite system, thereby allowing us to derive an upper bound on the failure probability of the real system from potentially inaccurate simulations. Each propagation step entails an optimization problem, where -- for measures such as maximum mean discrepancy (MMD) -- we develop tight convex relaxations based on semidefinite programs. We demonstrate that our propagation method yields valid and useful bounds for composite systems exhibiting a variety of realistic effects. In particular, we show that the proposed method can successfully account for data shifts within the experimental design as well as model inaccuracies within the simulation.
Physically adversarial attacks can mislead detectors in the real world and have attracted increasing attention. However, most existing works directly manipulate the model’s final outputs as attack objects while ignoring the inherent characteristics of objects such as multi-scale features, which are easily trapped into model-specific local optimum and degrade the transferability. To address this issue, we propose the Multi-scale Feature-aware Attack (MFA) to generate adversarial camouflages with strong attacking ability and transferability by disrupting multi-scale object-aware critical features. Specifically, we adopt the location and category information of the detector outputs to assign attribution scores to different scale feature layers. Then, we weight each feature according to their attribution results and design a pixel-level loss function in the opposite optimized direction of object detection to generate adversarial camouflages. We conduct extensive experiments in both the digital and physical world on ten detection models (e.g., the up-to-date yolov7) and significantly demonstrate the superior performance of the proposed MFA. Our Code will be available at: https://github.com/ChenWen1997/MFA.
Diffusion auction refers to an emerging paradigm of online marketplace where an auctioneer utilises a social network to attract potential buyers. Diffusion auction poses significant privacy risks. From the auction outcome, it is possible to infer hidden, and potentially sensitive, preferences of buyers. To mitigate such risks, we initiate the study of differential privacy (DP) in diffusion auction mechanisms. DP is a well-established notion of privacy that protects a system against inference attacks. Achieving DP in diffusion auctions is non-trivial as the well-designed auction rules are required to incentivise the buyers to truthfully report their neighbourhood. We study the single-unit case and design two differentially private diffusion mechanisms (DPDMs): recursive DPDM and layered DPDM. We prove that these mechanisms guarantee differential privacy, incentive compatibility and individual rationality for both valuations and neighbourhood. We then empirically compare their performance on real and synthetic datasets.
This work proposes "jointly amortized neural approximation" (JANA) of intractable likelihood functions and posterior densities arising in Bayesian surrogate modeling and simulation-based inference. We train three complementary networks in an end-to-end fashion: 1) a summary network to compress individual data points, sets, or time series into informative embedding vectors; 2) a posterior network to learn an amortized approximate posterior; and 3) a likelihood network to learn an amortized approximate likelihood. Their interaction opens a new route to amortized marginal likelihood and posterior predictive estimation - two important ingredients of Bayesian workflows that are often too expensive for standard methods. We benchmark the fidelity of JANA on a variety of simulation models against state-of-the-art Bayesian methods and propose a powerful and interpretable diagnostic for joint calibration. In addition, we investigate the ability of recurrent likelihood networks to emulate complex time series models without resorting to hand-crafted summary statistics.
Neural networks are universal approximators and are studied for their use in solving differential equations. However, a major criticism is the lack of error bounds for obtained solutions. This paper proposes a technique to rigorously evaluate the error bound of Physics-Informed Neural Networks (PINNs) on most linear ordinary differential equations (ODEs), certain nonlinear ODEs, and first-order linear partial differential equations (PDEs). The error bound is based purely on equation structure and residual information and does not depend on assumptions of how well the networks are trained. We propose algorithms that bound the error efficiently. Some proposed algorithms provide tighter bounds than others at the cost of longer run time.
We consider missingness in the context of causal inference when the outcome of interest may be missing. If the outcome directly affects its own missingness status, i.e., it is "self-censoring", this may lead to severely biased causal effect estimates. Miao et al., [2015] proposed the shadow variable method to correct for bias due to self-censoring, however, verifying the required model assumptions can be difficult. Here, we propose a test based on a randomized incentive variable offered to encourage reporting of the outcome that can be used to verify identification assumptions that are sufficient to correct for both self-censoring and confounding bias. Concretely, the test confirms whether a given set of pre-treatment covariates are sufficient to block all backdoor paths between the treatment and outcome as well as all paths between the treatment and missinginess indicator after conditioning on the outcome. We show that under these conditions, the causal effect is identified by using the treatment as a shadow variable, and it leads to an intuitive inverse probability weighting estimator that uses a product of the treatment and response weights. We evaluate the efficacy of our test and downstream estimator via simulations.
Probabilistic "if A then B" rules are typically formalized as Bayesian conditionals P(B|A), as many (e.g., Pearl) have argued that Bayesian conditionals are the correct way to think about such rules. However, there are challenges with standard inferences such as modus ponens and modus tollens that might make probabilistic material implication a better candidate at times for rule-based systems employing forward-chaining; and arguably material implication is still suitable when information about prior or conditional probabilities is not available at all. We investigate a generalization of probabilistic material implication and Bayesian conditionals that combines the advantages of both formalisms in a systematic way and prove basic properties of the generalized rule, in particular, for inference chains in graphs.
Text classifiers built on Pre-trained Language Models (PLMs) have achieved remarkable progress in various tasks including sentiment analysis, natural language inference, and question-answering. However, these text classifiers sometimes make uncertain predictions, which challenges their trustworthiness during deployment in practical applications. Much effort has been devoted to designing various probes in order to understand what PLMs capture. But few works have explored factors influencing PLM-based classifiers' predictive uncertainty. In this paper, we propose a novel framework CUE for interpreting uncertainties of PLM-based models' predictions. In particular, we first map PLM-encoded representations to a latent space via a variational auto-encoder. We then generate text representations by perturbing the latent space which causes fluctuation in predictive uncertainty. By comparing the predictive uncertainty difference between the perturbed text representation and the original text representation, we are able to identify the latent dimensions that cause uncertainty and thus trace back to input features that lead to uncertainty. Our extensive experiments on four benchmark datasets for linguistic acceptability classification, emotion classification, and natural language inference show the feasibility of our proposed framework.
We introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a nonparametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a Conditional Goodness of Fit criterion for the KCCSD test's U-statistic. The tractability of the KCCSD test widens the surface area of calibration measures to new promising use-cases, such as regularization during model training. We demonstrate the properties of our test on various synthetic settings.
Most existing approaches of differentially private (DP) machine learning focus on private training. Despite its many advantages, private training lacks the flexibility in adapting to incremental changes to the training dataset such as deletion requests from exercising GDPR‚Äôs right to be forgotten. We revisit a long-forgotten alternative, known as private prediction, and propose a new algorithm named Individual Kernelized Nearest Neighbor (Ind-KNN). Ind-KNN is easily updatable over dataset changes and it allows precise control of the Rényi DP at an individual user level --- a user's privacy loss is measured by the exact amount of her contribution to predictions; and a user is removed if her prescribed privacy budget runs out. Our results show that Ind-KNN consistently improves the accuracy over existing private prediction methods for a wide range of ϵ on four vision and language tasks. We also illustrate several cases under which Ind-KNN is preferable over private training with NoisySGD.
We develop a novel approach to determine the optimal policy in entropy-regularized reinforcement learning (RL) with stochastic dynamics. For deterministic dynamics, the optimal policy can be derived using Bayesian inference in the control-as-inference framework; however, for stochastic dynamics, the direct use of this approach leads to risk-taking optimistic policies. To address this issue, current approaches in entropy-regularized RL involve a constrained optimization procedure which fixes system dynamics to the original dynamics, however this approach is not consistent with the unconstrained Bayesian inference framework. In this work we resolve this inconsistency by developing an exact mapping from the constrained optimization problem in entropy-regularized RL to a different optimization problem which can be solved using the unconstrained Bayesian inference approach. We show that the optimal policies are the same for both problems, thus our results lead to the exact solution for the optimal policy in entropy-regularized RL with stochastic dynamics through Bayesian inference.
Supervised learning typically focuses on learning transferable representations from training examples annotated by humans. While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. We use information theory to compare how a number of commonly used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-distribution generalization. We validate these results empirically in a series of experiments with over 1 million crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.
Do common assumptions about the way that crowd workers make mistakes in microtask (labeling) applications manifest in real crowdsourcing data? Prior work only addresses this question indirectly. Instead, it primarily focuses on designing new label aggregation algorithms, seeming to imply that better performance justifies any additional assumptions. However, empirical evidence in past instances has raised significant challenges to common assumptions. We continue this line of work, using crowdsourcing data itself as directly as possible to interrogate several basic assumptions about workers and tasks. We find strong evidence that the assumption that workers respond correctly to each task with a constant probability, which is common in theoretical work, is implausible in real data. We also illustrate how heterogeneity among tasks and workers can take different forms, which have different implications for the design and evaluation of label aggregation algorithms.
Causal discovery from interventional data is an important problem, where the task is to design an interventional strategy that learns the hidden ground truth causal graph G(V,E) on |V|=n nodes while minimizing the number of performed interventions. Most prior interventional strategies broadly fall into two categories: non-adaptive and adaptive. Non-adaptive strategies decide on a single fixed set of interventions to be performed while adaptive strategies can decide on which nodes to intervene on sequentially based on past interventions. While adaptive algorithms may use exponentially fewer interventions than their non-adaptive counterparts, there are practical concerns that constrain the amount of adaptivity allowed. Motivated by this trade-off, we study the problem of r-adaptivity, where the algorithm designer recovers the causal graph under a total of r sequential rounds whilst trying to minimize the total number of interventions. For this problem, we provide a r-adaptive algorithm that achieves O(min approximation with respect to the verification number, a well-known lower bound for adaptive algorithms. Furthermore, for every r, we show that our approximation is tight. Our definition of r-adaptivity interpolates nicely between the non-adaptive (r=1) and fully adaptive (r=n) settings where our approximation simplifies to O(n) and O(\log n) respectively, matching the best-known approximation guarantees for both extremes. Our results also extend naturally to the bounded size interventions.
Probabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications -- the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present an end-to-end deep probabilistic model for hierarchical forecasting that is motivated by a classical top-down strategy. It jointly learns the distribution of the root time series, and the (dirichlet) proportions according to which each parent time-series is split among its children at any point in time. The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements of up to 26% on most datasets compared to state-of-the-art baselines. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to the more traditional bottom-up modeling.