| Total: 188
This paper studies a cooperative multi-agent bandit scenario in which the rewards observed by agents are heterogeneous---one agent's meat can be another agent's poison. Specifically, the total reward observed by each agent is the sum of two values: an arm-specific reward, capturing the intrinsic value of the arm, and a privately-known agent-specific reward, which captures the personal preference/limitations of the agent. This heterogeneity in total reward leads to different local optimal arms for agents but creates an opportunity for *free exploration* in a cooperative setting---an agent can freely explore its local optimal arm with no regret and share this free observation with some other agents who would suffer regrets if they pull this arm since the arm is not optimal for them. We first characterize a regret lower bound that captures free exploration, i.e., arms that can be freely explored have no contribution to the regret lower bound. Then, we present a cooperative bandit algorithm that takes advantage of free exploration and achieves a near-optimal regret upper bound which tightly matches the regret lower bound up to a constant factor. Lastly, we run numerical simulations to compare our algorithm with various baselines without free exploration.
The vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of O(N^2) time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to O(N) to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of O(N^2) by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with O(N) complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values (QKV) because of BiNorm, and it is quite universal due to its simple architectural structure.
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a tuning-free framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.
Dataset condensation aims to condense the original training dataset into a small synthetic dataset for data-efficient learning. The recently proposed dataset condensation techniques allow the model trainers with limited resources to learn acceptable deep learning models on a small amount of synthetic data. However, in an adversarial environment, given the original dataset as a poisoned dataset, dataset condensation may encode the poisoning information into the condensed synthetic dataset. To explore the vulnerability of dataset condensation to data poisoning, we revisit the state-of-the-art targeted data poisoning method and customize a targeted data poisoning algorithm for dataset condensation. By executing the two poisoning methods, we demonstrate that, when the synthetic dataset is condensed from a poisoned dataset, the models trained on the synthetic dataset may predict the targeted sample as the attack-targeted label. To defend against data poisoning, we introduce the concept of poisoned deviation to quantify the poisoning effect. We further propose a poisoning-resilient dataset condensation algorithm with a calibration method to reduce poisoned deviation. Extensive evaluations demonstrate that our proposed algorithm can protect the synthetic dataset from data poisoning with minor performance drop.
The similarity matrix measures the pairwise similarities between a set of data points. It is an essential concept in data processing and is routinely used in practical applications. Obtaining the similarity matrix is usually trivial when the data points are completely observed. However, getting a high-quality similarity matrix often turns hard when there are incomplete observations, which becomes even more complex on sequential data streams. To address the challenge, we propose matrix correction algorithms that leverage the positive semi-definiteness of the similarity matrix to provide improved similarity estimation in both offline and online scenarios. Our approaches have a solid theoretical guarantee of performance and excellent potential for parallel execution on large-scale data. They also exhibit high effectiveness and efficiency in empirical evaluations with significantly improved results over the classical imputation-based methods, benefiting downstream applications with superior performance.
Off-policy policy evaluation has been a critical and challenging problem in reinforcement learning, and Temporal-Difference (TD) learning is one of the most important approaches for addressing it. Notably, Full Importance-Sampling TD is the only existing off-policy TD method that is guaranteed to find the on-policy TD fixed point in the linear function approximation setting but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of Emphatic TD, which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new consistent algorithm with a transient bias, which strikes a balance between bias and variance. Further, we unify the new algorithm with several existing algorithms and obtain a new family of consistent algorithms called \emph{Consistent Emphatic TD} (CETD(λ, β, ν)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through theoretical analysis and experiments on a didactic example, we validate the consistency of CETD(λ, β, ν). Moreover, we show that CETD(λ, β, ν) converges faster to the lowest error in a complex task with a high variance.
Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. This set is encoded by a latent variable, which is often assumed to follow a simple distribution. However, in real-word settings, the context set may be drawn from richer distributions having multiple modes, heavy tails, etc. In this work, we provide a framework that allows NPs’ latent variable to be given a rich prior defined by a graphical model. These distributional assumptions directly translate into an appropriate aggregation strategy for the context set. Moreover, we describe a message-passing procedure that still allows for end-to-end optimization with stochastic gradients. We demonstrate the generality of our framework by using mixture and Student-t assumptions that yield improvements in function modelling and test-time robustness.
Reinforcement learning algorithms have achieved impressive success in learning behaviors from pixels. However, their application to real-world tasks remains challenging because of their sensitivity to visual distractions (e.g., changes in viewpoint and light). A major reason is that the learned representations often suffer from overfitting task-irrelevant information. By comparing several representation learning methods, we find that the key to robust representation learning is the choice of prediction targets. Therefore, we propose a novel representation learning approach---namely, Reward Sequence Prediction (RSP)---that uses reward sequences or their transforms (e.g., discrete time Fourier transform) as prediction targets. RSP can learn robust representations efficiently because reward sequences rarely contain task-irrelevant information while providing sufficient supervised signals to accelerate representation learning. An appealing feature is that RSP makes no assumption about the type of distractions and thus can improve performance even when multiple types of distractions exist. We evaluate our approach in Distracting Control Suite. Experiments show that our method achieves state-of-the-art sample efficiency and generalization ability in tasks with distractions.
Information-theoretic clustering is one of the most promising and principled approaches to finding clusters with minimal apriori assumptions. The key criterion therein is to maximize the mutual information between the data points and their cluster labels. We instead propose to maximize the Kullback–Leibler divergence between the underlying distributions associated to clusters (referred to as cluster distributions). We show it to be equivalent to optimizing over the mutual information criterion while simultaneously maximizing cross entropy between the cluster distributions. For practical efficiency, we propose to empirically estimate the objective of KL-D between clusters in its dual form leveraging deep neural nets as a dual function approximator. Remarkably, our theoretical analysis establishes that estimating the divergence measure in its dual form simplifies the problem of clustering to one of optimally finding k − 1 cut points for k clusters in the 1-D dual functional space. Overall, our approach enables linear-time clustering algorithms with theoretical guarantees of near-optimality, owing to the submodularity of the objective. We show the empirical superiority of our approach w.r.t. current state-of-the-art methods on the challenging task of clustering noisy timeseries as observed in domains such as neuroscience, healthcare, financial markets, spatio-temporal environmental dynamics, etc.
Detecting out-of-distribution (OOD) samples is a problem of practical importance for a reliable use of deep neural networks (DNNs) in production settings. The corollary to this problem is the detection in-distribution (ID) samples, which is applicable to domain adaptation scenarios for augmenting a train set with ID samples from other data sets, or to continual learning for replay from the past. For both ID or OOD detection, we propose a principled yet simple approach of (empirically) estimating KL-Divergence, in its dual form, for a given test set w.r.t. a known set of ID samples in order to quantify the contribution of each test sample individually towards the divergence measure and accordingly detect it as OOD or ID. Our approach is compute-efficient and enjoys strong theoretical guarantees. For WideResnet101 and ViT-L-16, by considering ImageNet-1k dataset as the ID benchmark, we evaluate the proposed OOD detector on 51 test (OOD) datasets, and observe drastically and consistently lower false positive rates w.r.t. all the competitive methods. Moreover, the proposed ID detector is evaluated, using ECG and stock price datasets, for the task of data augmentation in domain adaptation and continual learning settings, and we observe higher efficacy compared to relevant baselines.
We address the problem of identifiability of an arbitrary conditional causal effect given both the causal graph and a set of any observational and/or interventional distributions of the form Q[S]:=P(S|do(V∖S)), where V denotes the set of all observed variables and S⊆V. We call this problem conditional generalized identifiability (c-gID in short) and prove the completeness of Pearl's do-calculus for the c-gID problem by providing sound and complete algorithm for the c-gID problem. This work revisited the c-gID problem in Lee et al. [2020], Correa et al. [2021] by adding explicitly the positivity assumption which is crucial for identifiability. It extends the results of [Lee et al., 2019, Kivva et al., 2022] on general identifiability (gID) which studied the problem for unconditional causal effects and Shpitser and Pearl [2006b] on identifiability of conditional causal effects given merely the observational distribution P(V) as our algorithm generalizes the algorithms proposed in [Kivva et al., 2022] and [Shpitser and Pearl, 2006b].
Item-to-Item (I2I) recommendation is an important function in most recommendation systems, which generates replacement or complement suggestions for a particular item based on its semantic similarities to other cataloged items. Given that subsets of items in a recommendation system might be co-interacted with by the same set of customers, graph-based models, such as graph neural networks (GNNs), provide a natural framework to combine, ingest and extract valuable insights from such high-order relational interactions between cataloged items, as well as their metadata features, as has been shown in many recent studies. However, learning GNNs effectively for I2I requires ingesting a large amount of relational data, which might not always be available, especially in new, emerging market segments. To mitigate this data bottleneck, we postulate that recommendation patterns learned from existing mature market segments (with private data) could be adapted to build effective warm-start models for emerging ones. To achieve this, we propose and investigate a personalized federated modeling framework based on GNNs to summarize, assemble and adapt recommendation patterns across market segments with heterogeneous customer behaviors into effective local models. Our key contribution is a personalized graph adaptation model that bridges the gap between recent literature on federated GNNs and (non-graph) personalized federated learning, which either does not optimize for the adaptability of the federated model or is restricted to local models with homogeneous parameterization, excluding GNNs with heterogeneous local graphs. The effectiveness of our framework is demonstrated on a real-world dataset on multiple item categories spanning multiple market segments.
Unsupervised anomaly detection is a binary classification that detects anomalies in unseen samples given only unlabeled normal data. Reconstruction-based approaches are widely used, which perform reconstruction error minimization on training data to learn normal patterns and quantify the degree of anomalies by reconstruction errors on testing data. However, this approach tends to miss anomalies when the normal data has multi-pattern. Because the model generalizes unrestrictedly beyond normal patterns even to include anomaly patterns. In this paper, we proposed a memory mechanism that memorizes typical normal patterns through a capacity-controlled external differentiable matrix so that the generalization of the model to anomalies is limited by the retrieval of the matrix. We achieved state-of-the-art performance on several public benchmarks.
We describe a simple two-party protocol in which each party contributes a set as input. The output of the protocol is an estimate of the cardinality of the intersection of the two input sets. We show that our protocol is efficient and secure. In particular, we show that the space complexity and communication complexity are constant, the time complexity for each party is proportional to the size of their input set, and that our protocol is differentially private. We also analyze the distribution of the output of the protocol, deriving both its asymptotic distribution and finite-sample bounds on its tail probabilities. These analyses show that, when the input sets are large, our protocol produces accurate set intersection cardinality estimates. As such, we claim that our protocol is an attractive alternative to traditional private set intersection cardinality (PSI-CA) protocols when the input sets are large, exact precision is not required, and differential privacy on its own can provide sufficient protection to the underlying sensitive data.
For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.
This paper presents a novel approach for detecting soldering defects in Printed Circuit Boards (PCBs) composed mainly of Surface Mount Technology (SMT) components, using advanced computer vision and deep learning techniques. The main challenge addressed is the detection of soldering defects in new components for which only examples of good soldering are available at the model training phase. To meet industrial quality standards, we must keep the leakage rate (i.e., miss detection rate) low. To address this, we design the system to be "unknown-aware" with a low unknown rate and utilize the knowledge gained from the soldering examples of old components to detect the soldering defects of new components. We evaluated the method on a real-world dataset from an electronics company. It significantly reduces the leakage rate from 1.827\% ± 3.063\% to 0.063\% ± 0.075\% with an unknown rate of 3.706\% ± 2.270\%, compared to the baseline approach.
We study a matrix recovery problem with unknown correspondence: given the observation matrix Mo=[A,˜PB], where ˜P is an unknown permutation matrix, we aim to recover the underlying matrix M=[A,B]. Such problem commonly arises in many applications where heterogeneous data are utilized and the correspondence among them are unknown, e.g., due to data mishandling or privacy concern. We show that, in some applications, it is possible to recover M via solving a nuclear norm minimization problem. Moreover, under a proper low-rank condition on M, we derive a non-asymptotic error bound for the recovery of M. We propose an algorithm, M3O (Matrix recovery via Min-Max Optimization) which recasts this combinatorial problem as a continuous minimax optimization problem and solves it by proximal gradient with a Max-Oracle. M3O can also be applied to a more general scenario where we have missing entries in Mo and multiple groups of data with distinct unknown correspondence. Experiments on simulated data, the MovieLens 100K dataset and Yale B database show that M3O achieves state-of-the-art performance over several baselines and can recover the ground-truth correspondence with high accuracy.
Motivated by the problem of online canonical correlation analysis, we propose the \emph{Stochastic Scaled-Gradient Descent} (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of √1/T, and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.
One of the major challenges in multi-person pose estimation is instance-aware keypoint estimation. Previous methods address this problem by leveraging an off-the-shelf detector, heuristic post-grouping process or explicit instance identification process, hindering further improvements in the inference speed which is an important factor for practical applications. From the statistical point of view, those additional processes for identifying instances are necessary to bypass learning the high-dimensional joint distribution of human keypoints, which is a critical factor for another major challenge, the occlusion scenario. In this work, we propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints with a mixture density model, termed as MDPose. Our MDPose estimates the distribution of human keypoints' coordinates using a mixture density model with an instance-aware keypoint head consisting simply of 8 convolutional layers. It is trained by minimizing the negative log-likelihood of the ground truth keypoints. Also, we propose a simple yet effective training strategy, Random Keypoint Grouping (RKG), which significantly alleviates the underflow problem leading to successful learning of relations between keypoints. On OCHuman dataset, which consists of images with highly occluded people, our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints. Furthermore, our MDPose shows significant improvement in inference speed with a competitive accuracy on MS COCO, a widely-used human keypoint dataset, thanks to the proposed much simpler single-stage pipeline.
Bayesian probabilistic numerical methods for numerical integration offer significant advantages over their non-Bayesian counterparts: they can encode prior information about the integrand, and can quantify uncertainty over estimates of an integral. However, the most popular algorithm in this class, Bayesian quadrature, is based on Gaussian process models and is therefore associated with a high computational cost. To improve scalability, we propose an alternative approach based on Bayesian neural networks which we call Bayesian Stein networks. The key ingredients are a neural network architecture based on Stein operators, and an approximation of the Bayesian posterior based on the Laplace approximation. We show that this leads to orders of magnitude speed-ups on the popular Genz functions benchmark, and on challenging problems arising in the Bayesian analysis of dynamical systems, and the prediction of energy production for a large-scale wind farm.
Client selection schemes are widely adopted to handle the communication-efficient problems in recent studies of Federated Learning (FL). However, the large variance of the model updates aggregated from the randomly-selected unrepresentative subsets directly slows the FL convergence. We present a novel clustering-based client selection scheme to accelerate the FL convergence by variance reduction. Simple yet effective schemes are designed to improve the clustering effect and control the effect fluctuation, therefore, generating the client subset with certain representativeness of sampling. Theoretically, we demonstrate the improvement of the proposed scheme in variance reduction. We also present the tighter convergence guarantee of the proposed method thanks to the variance reduction. Experimental results confirm the exceed efficiency of our scheme compared to alternatives.
Much research in deep learning is devoted to developing new model and training procedures. On the other hand, training objectives received much less attention and are often restricted to combinations of standard losses. When the objective aligns well with the evaluation metric, this is not a major issue. However when dealing with complex structured outputs, the ideal objective can be hard to optimize and the efficacy of usual objectives as a proxy for the true objective can be questionable. In this work, we argue that the existing inference network based structured prediction methods~\citep{tu-18, tu2020improving} are indirectly learning to optimize a dynamic loss objective parameterized by the energy model. We then explore using implicit-gradient based technique to learn the corresponding dynamic objectives. Our experiments show that implicitly learning a dynamic loss landscape is an effective method for improving model performance in structured prediction.
Explanations for the outputs of deep neural network classifiers are essential in promoting trust and comprehension among users. Conventional methods often offer explanations only for one single class in the output and neglect other classes with high probabilities, resulting in a limited view of the model's behaviors. In this paper, we propose a holistic explanation method for image classification. It not only facilitates an overall understanding of model behavior, but also provides a framework where one can examine the evidence for discriminating competing classes, and thereby yield contrastive explanations. We demonstrate the advantages of the new method over baselines in terms of both faithfulness to the model and interpretability to users. The source code will be made available to the public upon publication of the paper.
Thompson sampling (TS) is a popular heuristic for action selection, but it requires sampling from a posterior distribution. Unfortunately, this can become computationally intractable in complex environments, such as those modeled using neural networks. Approximate posterior samples can produce effective actions, but only if they reasonably approximate joint predictive distributions of outputs across inputs. Notably, accuracy of marginal predictive distributions does not suffice. Epistemic neural networks (ENNs) are designed to produce accurate joint predictive distributions. We compare a range of ENNs through computational experiments that assess their performance in approximating TS across bandit and reinforcement learning environments. The results indicate that ENNs serve this purpose well and illustrate how the quality of joint predictive distributions drives performance. Further, we demonstrate that the \textit{epinet} --- a small additive network that estimates uncertainty --- matches the performance of large ensembles at orders of magnitude lower computational cost. This enables effective application of TS with computation that scales gracefully to complex environments.