| Total: 141
Node classification on graphs can be formulated as the Dirichlet problem on graphs where the signal is given at the labeled nodes, and the harmonic extension is done on the unlabeled nodes. This paper considers a time-dependent version of the Dirichlet problem on graphs and shows how to improve its solution by learning the proper initialization vector on the unlabeled nodes. Further, we show that the improved solution is at par with state-of-the-art methods used for node classification. Finally, we conclude this paper by discussing the importance of parameter t, pros, and future directions.
Best-arm identification (BAI) in a fixed-budget setting is a bandit problem where the learning agent maximizes the probability of identifying the optimal (best) arm after a fixed number of observations. Most works on this topic study unstructured problems with a small number of arms, which limits their applicability. We propose a general tractable algorithm that incorporates the structure, by successively eliminating suboptimal arms based on their mean reward estimates from a joint generalization model. We analyze our algorithm in linear and generalized linear models (GLMs), and propose a practical implementation based on a G-optimal design. In linear models, our algorithm has competitive error guarantees to prior works and performs at least as well empirically. In GLMs, this is the first practical algorithm with analysis for fixed-budget BAI.
Semi-weakly supervised semantic segmentation (SWSSS) aims to train a model to identify objects in images based on a small number of images with pixel-level labels, and many more images with only image-level labels. Most existing SWSSS algorithms extract pixel-level pseudo-labels from an image classifier - a very difficult task to do well, hence requiring complicated architectures and extensive hyperparameter tuning on fully-supervised validation sets. We propose a method called prediction filtering, which instead of extracting pseudo-labels, just uses the classifier as a classifier: it ignores any segmentation predictions from classes which the classifier is confident are not present. Adding this simple post-processing method to baselines gives results competitive with or better than prior SWSSS algorithms. Moreover, it is compatible with pseudo-label methods: adding prediction filtering to existing SWSSS algorithms further improves segmentation performance.
When a person solves the multi-choice problem, she considers not only what is the answer but also what is not the answer. Knowing what choice is not the answer and utilizing the relationships between choices, she can improve the prediction accuracy. Inspired by this human reasoning process, we propose a new training strategy to fully utilize inter-class relationships, namely LogitMix. Our strategy is combined with recent data augmentation techniques, e.g., Mixup, Manifold Mixup, CutMix, and PuzzleMix. Then, we suggest using a mixed logit, i.e., a mixture of two logits, as an auxiliary training objective. Since the logit can preserve both positive and negative inter-class relationships, it can impose a network to learn the probability of wrong answers correctly. Our extensive experimental results on the image- and language-based tasks demonstrate that LogitMix achieves state-of-the-art performance among recent data augmentation techniques regarding calibration error and prediction accuracy.
We propose a novel black-box approach for performing adversarial attacks against knowledge graph embedding models. An adversarial attack is a small perturbation of the data at training time to cause model failure at test time. We make use of an efficient rule learning approach and use abductive reasoning to identify triples which are logical explanations for a particular prediction. The proposed attack is then based on the simple idea to suppress or modify one of the triples in the most confident explanation. Although our attack scheme is model independent and only needs access to the training data, we report results on par with state-of-the-art white-box attack methods that additionally require full access to the model architecture, the learned embeddings, and the loss functions. This is a surprising result which indicates that knowledge graph embedding models can partly be explained post hoc with the help of symbolic methods.
In AI evaluation, performance is often calculated by averaging across various instances. But to fully understand the capabilities of an AI system, we need to understand the factors that cause its pattern of success and failure. In this paper, we present a new methodology to identify and build informative instance features that can provide explanatory and predictive power to analyse the behaviour of AI systems more robustly. The methodology builds on these relevant features that should relate monotonically with success, and represents patterns of performance in a new type of plots known as ‘agent characteristic grids’. We illustrate this methodology with the Animal-AI competition as a representative example of how we can revisit existing competitions and benchmarks in AI—even when evaluation data is sparse. Agents with the same average performance can show very different patterns of performance at the instance level. With this methodology, these patterns can be visualised, explained and predicted, progressing towards a capability-oriented evaluation rather than relying on a less informative average performance score.
Positive-unlabeled (PU) learning deals with the circumstances where only a portion of positive instances are labeled, while the rest and all negative instances are unlabeled, and due to this confusion, the class prior can not be directly available. Existing PU learning methods usually estimate the class prior by training a nontraditional probabilistic classifier, which is prone to give an overestimation. Moreover, these methods learn the decision boundary by optimizing the minimum margin, which is not suitable in PU learning due to its sensitivity to label noise. In this paper, we enhance PU learning methods from the above two aspects. More specifically, we first explicitly learn a transformation from unlabeled data to positive data by entropy regularized optimal transport to achieve a much more precise estimation for class prior. Then we switch to optimizing the margin distribution, rather than the minimum margin, to obtain a label noise insensitive classifier. Extensive empirical studies on both synthetic and real-world data sets demonstrate the superiority of our proposed method.
We introduce Neural Contextual Anomaly Detection (NCAD), a framework for anomaly detection on time series that scales seamlessly from the unsupervised to supervised setting, and is applicable to both univariate and multivariate time series. This is achieved by combining recent developments in representation learning for multivariate time series, with techniques for deep anomaly detection originally developed for computer vision that we tailor to the time series setting. Our window-based approach facilitates learning the boundary between normal and anomalous classes by injecting generic synthetic anomalies into the available data. NCAD can effectively take advantage of domain knowledge and of any available training labels. We demonstrate empirically on standard benchmark datasets that our approach obtains a state-of-the-art performance in the supervised, semi-supervised, and unsupervised settings.
Graph Contrastive Learning (GCL) has proven highly effective in promoting the performance of Semi-Supervised Node Classification (SSNC). However, existing GCL methods are generally transferred from other fields like CV or NLP, whose underlying working mechanism remains underexplored. In this work, we first deeply probe the working mechanism of GCL in SSNC, and find that the promotion brought by GCL is severely unevenly distributed: the improvement mainly comes from subgraphs with less annotated information, which is fundamentally different from contrastive learning in other fields. However, existing GCL methods generally ignore this uneven distribution of annotated information and apply GCL evenly to the whole graph. To remedy this issue and further improve GCL in SSNC, we propose the Topology InFormation gain-Aware Graph Contrastive Learning (TIFA-GCL) framework that considers the annotated information distribution across graph in GCL. Extensive experiments on six benchmark graph datasets, including the enormous OGB-Products graph, show that TIFA-GCL can bring a larger improvement than existing GCL methods in both transductive and inductive settings. Further experiments demonstrate the generalizability and interpretability of TIFA-GCL.
Mimicking the sampling mechanism of the primate fovea, a retina-inspired vision sensor named spiking camera has been developed, which has shown great potential for capturing high-speed dynamic scenes with a sampling rate of 40,000 Hz. Unlike conventional digital cameras, the spiking camera continuously captures photons and outputs asynchronous binary spikes with various inter-spike intervals to record dynamic scenes. However, how to reconstruct dynamic scenes from asynchronous spike streams remains challenging. In this work, we propose a novel pretext task to build a self-supervised reconstruction framework for spiking cameras. Specifically, we utilize the blind-spot network commonly used in self-supervised denoising tasks as our backbone, and perform self-supervised learning by constructing proper pseudo-labels. In addition, in view of the poor scalability and insufficient information utilization of the blind-spot network, we present a mutual learning framework to improve the overall performance of the network through mutual distillation between a non-blind-spot network and a blind-spot network. This also enables the network to bypass constraints of the blind-spot network, allowing state-of-the-art modules to be used to further improve performance. The experimental results demonstrate that our methods evidently outperform previous unsupervised spiking camera reconstruction methods and achieve desirable results compared with supervised methods.
Despite their outstanding performance in a broad spectrum of real-world tasks, deep artificial neural networks are sensitive to input noises, particularly adversarial perturbations. On the contrary, human and animal brains are much less vulnerable. In contrast to the one-shot inference performed by most deep neural networks, the brain often solves decision-making with an evidence accumulation mechanism that may trade time for accuracy when facing noisy inputs. The mechanism is well described by the Drift-Diffusion Model (DDM). In the DDM, decision-making is modeled as a process in which noisy evidence is accumulated toward a threshold. Drawing inspiration from the DDM, we propose the Dropout-based Drift-Diffusion Model (DDDM) that combines test-phase dropout and the DDM for improving the robustness for arbitrary neural networks. The dropouts create temporally uncorrelated noises in the network that counter perturbations, while the evidence accumulation mechanism guarantees a reasonable decision accuracy. Neural networks enhanced with the DDDM tested in image, speech, and text classification tasks all significantly outperform their native counterparts, demonstrating the DDDM as a task-agnostic defense against adversarial attacks.
In few-shot learning, methods are enslaved to the scarce labeled data, resulting in suboptimal embedding. Recent studies learn the embedding network by other large-scale labeled data. However, the trained network may give rise to the distorted embedding of target data. We argue two respects are required for an unprecedented and promising solution. We call them Better Embedding and More Shots (BEMS). Suppose we propose to extract embedding from the embedding network. BE maximizes the extraction of general representation and prevents over-fitting information. For this purpose, we introduce the topological relation for global reconstruction, avoiding excessive memorizing. MS maximizes the relevance between the reconstructed embedding and the target class space. In this respect, increasing the number of shots is a pivotal but intractable strategy. As a creative method, we derive the bound of information-theory-based loss function and implicitly achieve infinite shots with negligible cost. A substantial experimental analysis is carried out to demonstrate the state-of-the-art performance. Compared to the baseline, our method improves by up to 10%+. We also prove that BEMS is suitable for both standard pre-trained and meta-learning embedded networks.
Federated learning (FL) enables edge-devices to collaboratively learn a model without disclosing their private data to a central aggregating server. Most existing FL algorithms require models of identical architecture to be deployed across the clients and server, making it infeasible to train large models due to clients' limited system resources. In this work, we propose a novel ensemble knowledge transfer method named Fed-ET in which small models (different in architecture) are trained on clients, and used to train a larger model at the server. Unlike in conventional ensemble learning, in FL the ensemble can be trained on clients' highly heterogeneous data. Cognizant of this property, Fed-ET uses a weighted consensus distillation scheme with diversity regularization that efficiently extracts reliable consensus from the ensemble while improving generalization by exploiting the diversity within the ensemble. We show the generalization bound for the ensemble of weighted models trained on heterogeneous datasets that supports the intuition of Fed-ET. Our experiments on image and language tasks show that Fed-ET significantly outperforms other state-of-the-art FL algorithms with fewer communicated parameters, and is also robust against high data-heterogeneity.
Even though Generative Adversarial Networks (GANs) have shown a remarkable ability to generate high-quality images, GANs do not always guarantee the generation of photorealistic images. Occasionally, they generate images that have defective or unnatural objects, which are referred to as `artifacts'. Research to investigate why these artifacts emerge and how they can be detected and removed has yet to be sufficiently carried out. To analyze this, we first hypothesize that rarely activated neurons and frequently activated neurons have different purposes and responsibilities for the progress of generating images. In this study, by analyzing the statistics and the roles for those neurons, we empirically show that rarely activated neurons are related to the failure results of making diverse objects and inducing artifacts. In addition, we suggest a correction method, called `Sequential Ablation’, to repair the defective part of the generated images without high computational cost and manual efforts.
Transformers have become methods of choice in many applications thanks to their ability to represent complex interactions between elements. However, extending the Transformer architecture to non-sequential data such as molecules and enabling its training on small datasets remains a challenge. In this work, we introduce a Transformer-based architecture for molecule property prediction, which is able to capture the geometry of the molecule. We modify the classical positional encoder by an initial encoding of the molecule geometry, as well as a learned gated self-attention mechanism. We further suggest an augmentation scheme for molecular data capable of avoiding the overfitting induced by the overparameterized architecture. The proposed framework outperforms the state-of-the-art methods while being based on pure machine learning solely, i.e. the method does not incorporate domain knowledge from quantum chemistry and does not use extended geometric inputs besides the pairwise atomic distances.
We propose a new method for unsupervised generative continual learning through realignment of Variational Autoencoder's latent space. Deep generative models suffer from catastrophic forgetting in the same way as other neural structures. Recent generative continual learning works approach this problem and try to learn from new data without forgetting previous knowledge. However, those methods usually focus on artificial scenarios where examples share almost no similarity between subsequent portions of data - an assumption not realistic in the real-life applications of continual learning. In this work, we identify this limitation and posit the goal of generative continual learning as a knowledge accumulation task. We solve it by continuously aligning latent representations of new data that we call bands in additional latent space where examples are encoded independently of their source task. In addition, we introduce a method for controlled forgetting of past data that simplifies this process. On top of the standard continual learning benchmarks, we propose a novel challenging knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold across all experiments and additional real-life evaluation. To our knowledge, Multiband VAE is the first method to show forward and backward knowledge transfer in generative continual learning.
Reinforcement learning (RL) is a powerful framework for learning complex behaviors, but lacks adoption in many settings due to sample size requirements. We introduce a framework for increasing sample efficiency of RL algorithms. Our approach focuses on optimizing environment rewards with high-level instructions. These are modeled as a high-level controller over temporally extended actions known as options. These options can be looped, interleaved and partially ordered with a rich language for high-level instructions. Crucially, the instructions may be underspecified in the sense that following them does not guarantee high reward in the environment. We present an algorithm for control with these so-called option machines (OMs), discuss option selection for the partially ordered case and describe an algorithm for learning with OMs. We compare our approach in zero-shot, single- and multi-task settings in an environment with fully specified and underspecified instructions. We find that OMs perform significantly better than or comparable to the state-of-art in all environments and learning settings.
Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (Project URL)
Neural ordinary differential equations (NODEs) -- parametrizations of differential equations using neural networks -- have shown tremendous promise in learning models of unknown continuous-time dynamical systems from data. However, every forward evaluation of a NODE requires numerical integration of the neural network used to capture the system dynamics, making their training prohibitively expensive. Existing works rely on off-the-shelf adaptive step-size numerical integration schemes, which often require an excessive number of evaluations of the underlying dynamics network to obtain sufficient accuracy for training. By contrast, we accelerate the evaluation and the training of NODEs by proposing a data-driven approach to their numerical integration. The proposed Taylor-Lagrange NODEs (TL-NODEs) use a fixed-order Taylor expansion for numerical integration, while also learning to estimate the expansion's approximation error. As a result, the proposed approach achieves the same accuracy as adaptive step-size schemes while employing only low-order Taylor expansions, thus greatly reducing the computational cost necessary to integrate the NODE. A suite of numerical experiments, including modeling dynamical systems, image classification, and density estimation, demonstrate that TL-NODEs can be trained more than an order of magnitude faster than state-of-the-art approaches, without any loss in performance.
This paper is concerned with contrastive learning (CL) for low-level image restoration and enhancement tasks. We propose a new label-efficient learning paradigm based on residuals, residual contrastive learning (RCL), and derive an unsupervised visual representation learning framework, suitable for low-level vision tasks with noisy inputs. While supervised image reconstruction aims to minimize residual terms directly, RCL alternatively builds a connection between residuals and CL by defining a novel instance discrimination pretext task, using residuals as the discriminative feature. Our formulation mitigates the severe task misalignment between instance discrimination pretext tasks and downstream image reconstruction tasks, present in existing CL frameworks. Experimentally, we find that RCL can learn robust and transferable representations that improve the performance of various downstream tasks, such as denoising and super resolution, in comparison with recent self-supervised methods designed specifically for noisy inputs. Additionally, our unsupervised pre-training can significantly reduce annotation costs whilst maintaining performance competitive with fully-supervised image reconstruction.
The relation classification is to identify semantic relations between two entities in a given text. While existing models perform well for classifying inverse relations with large datasets, their performance is significantly reduced for few-shot learning. In this paper, we propose a function words adaptively enhanced attention framework (FAEA) for few-shot inverse relation classification, in which a hybrid attention model is designed to attend class-related function words based on meta-learning. As the involvement of function words brings in significant intra-class redundancy, an adaptive message passing mechanism is introduced to capture and transfer inter-class differences.We mathematically analyze the negative impact of function words from dot-product measurement, which explains why the message passing mechanism effectively reduces the impact. Our experimental results show that FAEA outperforms strong baselines, especially the inverse relation accuracy is improved by 14.33% under 1-shot setting in FewRel1.0.
A network can effectively depict close relationships among its nodes, with labels in a taxonomy describing the nodes' rich attributes. Network embedding aims at learning a representation vector for each node and label to preserve their proximity, while most existing methods suffer from serious underfitting when dealing with datasets with dense node-label links. For instance, a node could have dozens of labels describing its diverse properties, causing the single node vector overloaded and hard to fit all the labels. We propose HIerarchical Multi-vector Embedding (HIME), which solves the underfitting problem by adaptively learning multiple 'branch vectors' for each node to dynamically fit separate sets of labels in a hierarchy-aware embedding space. Moreover, a 'root vector' is learned for each node based on its branch vectors to better predict the sparse but valuable node-node links with the knowledge of its labels. Experiments reveal HIME’s comprehensive advantages over existing methods on tasks such as proximity search, link prediction and hierarchical classification.
Existing unsupervised domain adaptation (UDA) studies focus on transferring knowledge in an offline manner. However, many tasks involve online requirements, especially in real-time systems. In this paper, we discuss Online UDA (OUDA) which assumes that the target samples are arriving sequentially as a small batch. OUDA tasks are challenging for prior UDA methods since online training suffers from catastrophic forgetting which leads to poor generalization. Intuitively, a good memory is a crucial factor in the success of OUDA. We formalize this intuition theoretically with a generalization bound where the OUDA target error can be bounded by the source error, the domain discrepancy distance, and a novel metric on forgetting in continuous online learning. Our theory illustrates the tradeoffs inherent in learning and remembering representations for OUDA. To minimize the proposed forgetting metric, we propose a novel source feature distillation (SFD) method which utilizes the source-only model as a teacher to guide the online training. In the experiment, we modify three UDA algorithms, i.e., DANN, CDAN, and MCC, and evaluate their performance on OUDA tasks with real-world datasets. By applying SFD, the performance of all baselines is significantly improved.
Deep learning has recently achieved remarkable performance in image classification tasks, which depends heavily on massive annotation. However, the classification mechanism of existing deep learning models seems to contrast to humans' recognition mechanism. With only a glance at an image of the object even unknown type, humans can quickly and precisely find other same category objects from massive images, which benefits from daily recognition of various objects. In this paper, we attempt to build a generalizable framework that emulates the humans' recognition mechanism in the image classification task, hoping to improve the classification performance on unseen categories with the support of annotations of other categories. Specifically, we investigate a new task termed Comparison Knowledge Translation (CKT). Given a set of fully labeled categories, CKT aims to translate the comparison knowledge learned from the labeled categories to a set of novel categories. To this end, we put forward a Comparison Classification Translation Network (CCT-Net), which comprises a comparison classifier and a matching discriminator. The comparison classifier is devised to classify whether two images belong to the same category or not, while the matching discriminator works together in an adversarial manner to ensure whether classified results match the truth. Exhaustive experiments show that CCT-Net achieves surprising generalization ability on unseen categories and SOTA performance on target categories.
Over the past decades in the field of machine teaching, several restrictions have been introduced to avoid ‘cheating’, such as collusion-free or non-clashing teaching. However, these restrictions forbid several teaching situations that we intuitively consider natural and fair, especially those ‘changes of mind’ of the learner as more evidence is given, affecting the likelihood of concepts and ultimately their posteriors. Under a new generalised probabilistic teaching, not only do these non-cheating constraints look too narrow but we also show that the most relevant machine teaching models are particular cases of this framework: the consistency graph between concepts and elements simply becomes a joint probability distribution. We show a simple procedure that builds the witness joint distribution from the ground joint distribution. We prove a chain of relations, also with a theoretical lower bound, on the teaching dimension of the old and new models. Overall, this new setting is more general than the traditional machine teaching models, yet at the same time more intuitively capturing a less abrupt notion of non-cheating teaching.