IJCAI.2021 - Machine Learning

| Total: 190

#1 The Surprising Power of Graph Neural Networks with Random Node Initialization [PDF] [Copy] [Kimi] [REL]

Authors: Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, Thomas Lukasiewicz

Graph neural networks (GNNs) are effective models for representation learning on relational data. However, standard GNNs are limited in their expressive power, as they cannot distinguish graphs beyond the capability of the Weisfeiler-Leman graph isomorphism heuristic. In order to break this expressiveness barrier, GNNs have been enhanced with random node initialization (RNI), where the idea is to train and run the models with randomized initial node features. In this work, we analyze the expressive power of GNNs with RNI, and prove that these models are universal, a first such result for GNNs not relying on computationally demanding higher-order properties. This universality result holds even with partially randomized initial node features, and preserves the invariance properties of GNNs in expectation. We then empirically analyze the effect of RNI on GNNs, based on carefully constructed datasets. Our empirical findings support the superior performance of GNNs with RNI over standard GNNs.


#2 Likelihood-free Out-of-Distribution Detection with Invertible Generative Models [PDF] [Copy] [Kimi] [REL]

Authors: Amirhossein Ahmadian, Fredrik Lindsten

Likelihood of generative models has been used traditionally as a score to detect atypical (Out-of-Distribution, OOD) inputs. However, several recent studies have found this approach to be highly unreliable, even with invertible generative models, where computing the likelihood is feasible. In this paper, we present a different framework for generative model--based OOD detection that employs the model in constructing a new representation space, instead of using it directly in computing typicality scores, where it is emphasized that the score function should be interpretable as the similarity between the input and training data in the new space. In practice, with a focus on invertible models, we propose to extract low-dimensional features (statistics) based on the model encoder and complexity of input images, and then use a One-Class SVM to score the data. Contrary to recently proposed OOD detection methods for generative models, our method does not require computing likelihood values. Consequently, it is much faster when using invertible models with iteratively approximated likelihood (e.g. iResNet), while it still has a performance competitive with other related methods.


#3 Simulation of Electron-Proton Scattering Events by a Feature-Augmented and Transformed Generative Adversarial Network (FAT-GAN) [PDF] [Copy] [Kimi] [REL]

Authors: Yasir Alanazi, Nobuo Sato, Tianbo Liu, Wally Melnitchouk, Pawel Ambrozewicz, Florian Hauenstein, Michelle P. Kuchera, Evan Pritchard, Michael Robertson, Ryan Strauss, Luisa Velasco, Yaohang Li

We apply generative adversarial network (GAN) technology to build an event generator that simulates particle production in electron-proton scattering that is free of theoretical assumptions about underlying particle dynamics. The difficulty of efficiently training a GAN event simulator lies in learning the complicated patterns of the distributions of the particles physical properties. We develop a GAN that selects a set of transformed features from particle momenta that can be generated easily by the generator, and uses these to produce a set of augmented features that improve the sensitivity of the discriminator. The new Feature-Augmented and Transformed GAN (FAT-GAN) is able to faithfully reproduce the distribution of final state electron momenta in inclusive electron scattering, without the need for input derived from domain-based theoretical assumptions. The developed technology can play a significant role in boosting the science of existing and future accelerator facilities, such as the Electron-Ion Collider.


#4 Deep Reinforcement Learning for Navigation in AAA Video Games [PDF] [Copy] [Kimi] [REL]

Authors: Eloi Alonso, Maxim Peter, David Goumard, Joshua Romoff

In video games, \non-player characters (NPCs) are used to enhance the players' experience in a variety of ways, e.g., as enemies, allies, or innocent bystanders. A crucial component of NPCs is navigation, which allows them to move from one point to another on the map. The most popular approach for NPC navigation in the video game industry is to use a navigation mesh (NavMesh), which is a graph representation of the map, with nodes and edges indicating traversable areas. Unfortunately, complex navigation abilities that extend the character's capacity for movement, e.g., grappling hooks, jetpacks, teleportation, or double-jumps, increase the complexity of the NavMesh, making it intractable in many practical scenarios. Game designers are thus constrained to only add abilities that can be handled by a NavMesh. As an alternative to the NavMesh, we propose to use Deep Reinforcement Learning (Deep RL) to learn how to navigate 3D maps in video games using any navigation ability. We test our approach on complex 3D environments that are notably an order of magnitude larger than maps typically used in the Deep RL literature. One of these environments is from a recently released AAA video game called Hyper Scape. We find that our approach performs surprisingly well, achieving at least 90% success rate in a variety of scenarios using complex navigation abilities.


#5 Conditional Self-Supervised Learning for Few-Shot Classification [PDF1] [Copy] [Kimi] [REL]

Authors: Yuexuan An, Hui Xue, Xingyu Zhao, Lu Zhang

How to learn a transferable feature representation from limited examples is a key challenge for few-shot classification. Self-supervision as an auxiliary task to the main supervised few-shot task is considered to be a conceivable way to solve the problem since self-supervision can provide additional structural information easily ignored by the main task. However, learning a good representation by traditional self-supervised methods is usually dependent on large training samples. In few-shot scenarios, due to the lack of sufficient samples, these self-supervised methods might learn a biased representation, which more likely leads to the wrong guidance for the main tasks and finally causes the performance degradation. In this paper, we propose conditional self-supervised learning (CSS) to use auxiliary information to guide the representation learning of self-supervised tasks. Specifically, CSS leverages supervised information as prior knowledge to shape and improve the learning feature manifold of self-supervision without auxiliary unlabeled data, so as to reduce representation bias and mine more effective semantic information. Moreover, CSS exploits more meaningful information through supervised and the improved self-supervised learning respectively and integrates the information into a unified distribution, which can further enrich and broaden the original representation. Extensive experiments demonstrate that our proposed method without any fine-tuning can achieve a significant accuracy improvement on the few-shot classification scenarios compared to the state-of-the-art few-shot learning methods.


#6 DEHB: Evolutionary Hyberband for Scalable, Robust and Efficient Hyperparameter Optimization [PDF1] [Copy] [Kimi] [REL]

Authors: Noor Awad, Neeratyoy Mallik, Frank Hutter

Modern machine learning algorithms crucially rely on several design decisions to achieve strong performance, making the problem of Hyperparameter Optimization (HPO) more important than ever. Here, we combine the advantages of the popular bandit-based HPO method Hyperband (HB) and the evolutionary search approach of Differential Evolution (DE) to yield a new HPO method which we call DEHB. Comprehensive results on a very broad range of HPO problems, as well as a wide range of tabular benchmarks from neural architecture search, demonstrate that DEHB achieves strong performance far more robustly than all previous HPO methods we are aware of, especially for high-dimensional problems with discrete input dimensions. For example, DEHB is up to 1000x faster than random search. It is also efficient in computational time, conceptually simple and easy to implement, positioning it well to become a new default HPO method.


#7 Verifying Reinforcement Learning up to Infinity [PDF] [Copy] [Kimi] [REL]

Authors: Edoardo Bacci, Mirco Giacobbe, David Parker

Formally verifying that reinforcement learning systems act safely is increasingly important, but existing methods only verify over finite time. This is of limited use for dynamical systems that run indefinitely. We introduce the first method for verifying the time-unbounded safety of neural networks controlling dynamical systems. We develop a novel abstract interpretation method which, by constructing adaptable template-based polyhedra using MILP and interval arithmetic, yields sound---safe and invariant---overapproximations of the reach set. This provides stronger safety guarantees than previous time-bounded methods and shows whether the agent has generalised beyond the length of its training episodes. Our method supports ReLU activation functions and systems with linear, piecewise linear and non-linear dynamics defined with polynomial and transcendental functions. We demonstrate its efficacy on a range of benchmark control problems.


#8 Robustly Learning Composable Options in Deep Reinforcement Learning [PDF] [Copy] [Kimi] [REL]

Authors: Akhil Bagaria, Jason Senthil, Matthew Slivinski, George Konidaris

Hierarchical reinforcement learning (HRL) is only effective for long-horizon problems when high-level skills can be reliably sequentially executed. Unfortunately, learning reliably composable skills is difficult, because all the components of every skill are constantly changing during learning. We propose three methods for improving the composability of learned skills: representing skill initiation regions using a combination of pessimistic and optimistic classifiers; learning re-targetable policies that are robust to non-stationary subgoal regions; and learning robust option policies using model-based RL. We test these improvements on four sparse-reward maze navigation tasks involving a simulated quadrupedal robot. Each method successively improves the robustness of a baseline skill discovery method, substantially outperforming state-of-the-art flat and hierarchical methods.


#9 Reconciling Rewards with Predictive State Representations [PDF] [Copy] [Kimi] [REL]

Authors: Andrea Baisero, Christopher Amato

Predictive state representations (PSRs) are models of controlled non-Markov observation sequences which exhibit the same generative process governing POMDP observations without relying on an underlying latent state. In that respect, a PSR is indistinguishable from the corresponding POMDP. However, PSRs notoriously ignore the notion of rewards, which undermines the general utility of PSR models for control, planning, or reinforcement learning. Therefore, we describe a sufficient and necessary accuracy condition which determines whether a PSR is able to accurately model POMDP rewards, we show that rewards can be approximated even when the accuracy condition is not satisfied, and we find that a non-trivial number of POMDPs taken from a well-known third-party repository do not satisfy the accuracy condition. We propose reward-predictive state representations (R-PSRs), a generalization of PSRs which accurately models both observations and rewards, and develop value iteration for R-PSRs. We show that there is a mismatch between optimal POMDP policies and the optimal PSR policies derived from approximate rewards. On the other hand, optimal R-PSR policies perfectly match optimal POMDP policies, reconfirming R-PSRs as accurate state-less generative models of observations and rewards.


#10 Optimal Algorithms for Range Searching over Multi-Armed Bandits [PDF] [Copy] [Kimi] [REL]

Authors: Siddharth Barman, Ramakrishnan Krishnamurthy, Saladi Rahul

This paper studies a multi-armed bandit (MAB) version of the range-searching problem. In its basic form, range searching considers as input a set of points (on the real line) and a collection of (real) intervals. Here, with each specified point, we have an associated weight, and the problem objective is to find a maximum-weight point within every given interval. The current work addresses range searching with stochastic weights: each point corresponds to an arm (that admits sample access) and the point's weight is the (unknown) mean of the underlying distribution. In this MAB setup, we develop sample-efficient algorithms that find, with high probability, near-optimal arms within the given intervals, i.e., we obtain PAC (probably approximately correct) guarantees. We also provide an algorithm for a generalization wherein the weight of each point is a multi-dimensional vector. The sample complexities of our algorithms depend, in particular, on the size of the {optimal hitting set} of the given intervals. Finally, we establish lower bounds proving that the obtained sample complexities are essentially tight. Our results highlight the significance of geometric constructs (specifically, hitting sets) in our MAB setting.


#11 Efficient Neural Network Verification via Layer-based Semidefinite Relaxations and Linear Cuts [PDF] [Copy] [Kimi] [REL]

Authors: Ben Batten, Panagiotis Kouvaros, Alessio Lomuscio, Yang Zheng

We introduce an efficient and tight layer-based semidefinite relaxation for verifying local robustness of neural networks. The improved tightness is the result of the combination between semidefinite relaxations and linear cuts. We obtain a computationally efficient method by decomposing the semidefinite formulation into layerwise constraints. By leveraging on chordal graph decompositions, we show that the formulation here presented is provably tighter than current approaches. Experiments on a set of benchmark networks show that the approach here proposed enables the verification of more instances compared to other relaxation methods. The results also demonstrate that the SDP relaxation here proposed is one order of magnitude faster than previous SDP methods.


#12 Fast Pareto Optimization for Subset Selection with Dynamic Cost Constraints [PDF] [Copy] [Kimi] [REL]

Authors: Chao Bian, Chao Qian, Frank Neumann, Yang Yu

Subset selection with cost constraints is a fundamental problem with various applications such as influence maximization and sensor placement. The goal is to select a subset from a ground set to maximize a monotone objective function such that a monotone cost function is upper bounded by a budget. Previous algorithms with bounded approximation guarantees include the generalized greedy algorithm, POMC and EAMC, all of which can achieve the best known approximation guarantee. In real-world scenarios, the resources often vary, i.e., the budget often changes over time, requiring the algorithms to adapt the solutions quickly. However, when the budget changes dynamically, all these three algorithms either achieve arbitrarily bad approximation guarantees, or require a long running time. In this paper, we propose a new algorithm FPOMC by combining the merits of the generalized greedy algorithm and POMC. That is, FPOMC introduces a greedy selection strategy into POMC. We prove that FPOMC can maintain the best known approximation guarantee efficiently.


#13 Partial Multi-Label Optimal Margin Distribution Machine [PDF] [Copy] [Kimi] [REL]

Authors: Nan Cao, Teng Zhang, Hai Jin

Partial multi-label learning deals with the circumstance in which the ground-truth labels are not directly available but hidden in a candidate label set. Due to the presence of other irrelevant labels, vanilla multi-label learning methods are prone to be misled and fail to generalize well on unseen data, thus how to enable them to get rid of the noisy labels turns to be the core problem of partial multi-label learning. In this paper, we propose the Partial Multi-Label Optimal margin Distribution Machine (PML-ODM), which distinguishs the noisy labels through explicitly optimizing the distribution of ranking margin, and exhibits better generalization performance than minimum margin based counterparts. In addition, we propose a novel feature prototype representation to further enhance the disambiguation ability, and the non-linear kernels can also be applied to promote the generalization performance for linearly inseparable data. Extensive experiments on real-world data sets validates the superiority of our proposed method.


#14 Towards Understanding the Spectral Bias of Deep Learning [PDF] [Copy] [Kimi] [REL]

Authors: Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, Quanquan Gu

An intriguing phenomenon observed during training neural networks is the spectral bias, which states that neural networks are biased towards learning less complex functions. The priority of learning functions with low complexity might be at the core of explaining the generalization ability of neural networks, and certain efforts have been made to provide a theoretical explanation for spectral bias. However, there is still no satisfying theoretical result justifying the underlying mechanism of spectral bias. In this paper, we give a comprehensive and rigorous explanation for spectral bias and relate it with the neural tangent kernel function proposed in recent work. We prove that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue. We then provide a case study when the input data is uniformly distributed over the unit sphere, and show that lower degree spherical harmonics are easier to be learned by over-parameterized neural networks. Finally, we provide numerical experiments to demonstrate the correctness of our theory. Our experimental results also show that our theory can tolerate certain model misspecification in terms of the input data distribution.


#15 Thompson Sampling for Bandits with Clustered Arms [PDF] [Copy] [Kimi] [REL]

Authors: Emil Carlsson, Devdatt Dubhashi, Fredrik D. Johansson

We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.


#16 Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment [PDF] [Copy] [Kimi] [REL]

Authors: Wilka Carvalho, Anthony Liang, Kimin Lee, Sungryull Sohn, Honglak Lee, Richard Lewis, Satinder Singh

Learning how to execute complex tasks involving multiple objects in a 3D world is challenging when there is no ground-truth information about the objects or any demonstration to learn from. When an agent only receives a signal from task-completion, this makes it challenging to learn the object-representations which support learning the correct object-interactions needed to complete the task. In this work, we formulate learning an attentive object dynamics model as a classification problem, using random object-images to define incorrect labels for our object-dynamics model. We show empirically that this enables object-representation learning that captures an object's category (is it a toaster?), its properties (is it on?), and object-relations (is something inside of it?). With this, our core learner (a relational RL agent) receives the dense training signal it needs to rapidly learn object-interaction tasks. We demonstrate results in the 3D AI2Thor simulated kitchen environment with a range of challenging food preparation tasks. We compare our method's performance to several related approaches and against the performance of an oracle: an agent that is supplied with ground-truth information about objects in the scene. We find that our agent achieves performance closest to the oracle in terms of both learning speed and maximum success rate.


#17 Generative Adversarial Neural Architecture Search [PDF] [Copy] [Kimi] [REL]

Authors: Seyed Saeed Changiz Rezaei, Fred X. Han, Di Niu, Mohammad Salameh, Keith Mills, Shuo Lian, Wei Lu, Shangling Jui

Despite the empirical success of neural architecture search (NAS) in deep learning applications, the optimality, reproducibility and cost of NAS schemes remain hard to assess. In this paper, we propose Generative Adversarial NAS (GA-NAS) with theoretically provable convergence guarantees, promoting stability and reproducibility in neural architecture search. Inspired by importance sampling, GA-NAS iteratively fits a generator to previously discovered top architectures, thus increasingly focusing on important parts of a large search space. Furthermore, we propose an efficient adversarial learning approach, where the generator is trained by reinforcement learning based on rewards provided by a discriminator, thus being able to explore the search space without evaluating a large number of architectures. Extensive experiments show that GA-NAS beats the best published results under several cases on three public NAS benchmarks. In the meantime, GA-NAS can handle ad-hoc search constraints and search spaces. We show that GA-NAS can be used to improve already optimized baselines found by other NAS methods, including EfficientNet and ProxylessNAS, in terms of ImageNet accuracy or the number of parameters, in their original search space.


#18 AMA-GCN: Adaptive Multi-layer Aggregation Graph Convolutional Network for Disease Prediction [PDF1] [Copy] [Kimi] [REL]

Authors: Hao Chen, Fuzhen Zhuang, Li Xiao, Ling Ma, Haiyan Liu, Ruifang Zhang, Huiqin Jiang, Qing He

Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful mean for Computer Aided Diagnosis (CADx). This approach requires building a population graph to aggregate structural information, where the graph adjacency matrix represents the relationship between nodes. Until now, this adjacency matrix is usually defined manually based on phenotypic information. In this paper, we propose an encoder that automatically selects the appropriate phenotypic measures according to their spatial distribution, and uses the text similarity awareness mechanism to calculate the edge weights between nodes. The encoder can automatically construct the population graph using phenotypic measures which have a positive impact on the final results, and further realizes the fusion of multimodal information. In addition, a novel graph convolution network architecture using multi-layer aggregation mechanism is proposed. The structure can obtain deep structure information while suppressing over-smooth, and increase the similarity between the same type of nodes. Experimental results on two databases show that our method can significantly improve the diagnostic accuracy for Autism spectrum disorder and breast cancer, indicating its universality in leveraging multimodal data for disease prediction.


#19 Learning Attributed Graph Representation with Communicative Message Passing Transformer [PDF] [Copy] [Kimi] [REL]

Authors: Jianwen Chen, Shuangjia Zheng, Ying Song, Jiahua Rao, Yuedong Yang

Constructing appropriate representations of molecules lies at the core of numerous tasks such as material science, chemistry, and drug designs. Recent researches abstract molecules as attributed graphs and employ graph neural networks (GNN) for molecular representation learning, which have made remarkable achievements in molecular graph modeling. Albeit powerful, current models either are based on local aggregation operations and thus miss higher-order graph properties or focus on only node information without fully using the edge information. For this sake, we propose a Communicative Message Passing Transformer (CoMPT) neural network to improve the molecular graph representation by reinforcing message interactions between nodes and edges based on the Transformer architecture. Unlike the previous transformer-style GNNs that treat molecule as a fully connected graph, we introduce a message diffusion mechanism to leverage the graph connectivity inductive bias and reduce the message enrichment explosion. Extensive experiments demonstrated that the proposed model obtained superior performances (around 4% on average) against state-of-the-art baselines on seven chemical property datasets (graph-level tasks) and two chemical shift datasets (node-level tasks). Further visualization studies also indicated a better representation capacity achieved by our model.


#20 Understanding Structural Vulnerability in Graph Convolutional Networks [PDF] [Copy] [Kimi] [REL]

Authors: Liang Chen, Jintang Li, Qibiao Peng, Yang Liu, Zibin Zheng, Carl Yang

Recent studies have shown that Graph Convolutional Networks (GCNs) are vulnerable to adversarial attacks on the graph structure. Although multiple works have been proposed to improve their robustness against such structural adversarial attacks, the reasons for the success of the attacks remain unclear. In this work, we theoretically and empirically demonstrate that structural adversarial examples can be attributed to the non-robust aggregation scheme (i.e., the weighted mean) of GCNs. Specifically, our analysis takes advantage of the breakdown point which can quantitatively measure the robustness of aggregation schemes. The key insight is that weighted mean, as the basic design of GCNs, has a low breakdown point and its output can be dramatically changed by injecting a single edge. We show that adopting the aggregation scheme with a high breakdown point (e.g., median or trimmed mean) could significantly enhance the robustness of GCNs against structural attacks. Extensive experiments on four real-world datasets demonstrate that such a simple but effective method achieves the best robustness performance compared to state-of-the-art models.


#21 Monte Carlo Filtering Objectives [PDF] [Copy] [Kimi] [REL]

Authors: Shuangshuang Chen, Sihao Ding, Yiannis Karayiannidis, Mårten Björkman

Learning generative models and inferring latent trajectories have shown to be challenging for time series due to the intractable marginal likelihoods of flexible generative models. It can be addressed by surrogate objectives for optimization. We propose Monte Carlo filtering objectives (MCFOs), a family of variational objectives for jointly learning parametric generative models and amortized adaptive importance proposals of time series. MCFOs extend the choices of likelihood estimators beyond Sequential Monte Carlo in state-of-the-art objectives, possess important properties revealing the factors for the tightness of objectives, and allow for less biased and variant gradient estimates. We demonstrate that the proposed MCFOs and gradient estimations lead to efficient and stable model learning, and learned generative models well explain data and importance proposals are more sample efficient on various kinds of time series data.


#22 Dependent Multi-Task Learning with Causal Intervention for Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Wenqing Chen, Jidong Tian, Caoyun Fan, Hao He, Yaohui Jin

Recent work for image captioning mainly followed an extract-then-generate paradigm, pre-extracting a sequence of object-based features and then formulating image captioning as a single sequence-to-sequence task. Although promising, we observed two problems in generated captions: 1) content inconsistency where models would generate contradicting facts; 2) not informative enough where models would miss parts of important information. From a causal perspective, the reason is that models have captured spurious statistical correlations between visual features and certain expressions (e.g., visual features of "long hair" and "woman"). In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI). Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning. The intermediate task would help the model better understand the visual features and thus alleviate the content inconsistency problem. Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders and thus letting models focus on the causal visual features. Specifically, the high-frequency concept set is considered as the proxy confounders where the real confounders are inferred in the continuous space. Finally, we use a multi-agent reinforcement learning (MARL) strategy to enable end-to-end training and reduce the inter-task error accumulations. The extensive experiments show that our model outperforms the baseline models and achieves competitive performance with state-of-the-art models.


#23 Few-Shot Learning with Part Discovery and Augmentation from Unlabeled Images [PDF] [Copy] [Kimi] [REL]

Authors: Wentao Chen, Chenyang Si, Wei Wang, Liang Wang, Zilei Wang, Tieniu Tan

Few-shot learning is a challenging task since only few instances are given for recognizing an unseen class. One way to alleviate this problem is to acquire a strong inductive bias via meta-learning on similar tasks. In this paper, we show that such inductive bias can be learned from a flat collection of unlabeled images, and instantiated as transferable representations among seen and unseen classes. Specifically, we propose a novel part-based self-supervised representation learning scheme to learn transferable representations by maximizing the similarity of an image to its discriminative part. To mitigate the overfitting in few-shot classification caused by data scarcity, we further propose a part augmentation strategy by retrieving extra images from a base dataset. We conduct systematic studies on miniImageNet and tieredImageNet benchmarks. Remarkably, our method yields impressive results, outperforming the previous best unsupervised methods by 7.74% and 9.24% under 5-way 1-shot and 5-way 5-shot settings, which are comparable with state-of-the-art supervised methods.


#24 On Self-Distilling Graph Neural Network [PDF] [Copy] [Kimi] [REL]

Authors: Yuzhao Chen, Yatao Bian, Xi Xiao, Yu Rong, Tingyang Xu, Junzhou Huang

Recently, the teacher-student knowledge distillation framework has demonstrated its potential in training Graph Neural Networks (GNNs). However, due to the difficulty of training over-parameterized GNN models, one may not easily obtain a satisfactory teacher model for distillation. Furthermore, the inefficient training process of teacher-student knowledge distillation also impedes its applications in GNN models. In this paper, we propose the first teacher-free knowledge distillation method for GNNs, termed GNN Self-Distillation (GNN-SD), that serves as a drop-in replacement of the standard training process. The method is built upon the proposed neighborhood discrepancy rate (NDR), which quantifies the non-smoothness of the embedded graph in an efficient way. Based on this metric, we propose the adaptive discrepancy retaining (ADR) regularizer to empower the transferability of knowledge that maintains high neighborhood discrepancy across GNN layers. We also summarize a generic GNN-SD framework that could be exploited to induce other distillation strategies. Experiments further prove the effectiveness and generalization of our approach, as it brings: 1) state-of-the-art GNN distillation performance with less training cost, 2) consistent and considerable performance enhancement for various popular backbones.


#25 Time-Aware Multi-Scale RNNs for Time Series Modeling [PDF] [Copy] [Kimi] [REL]

Authors: Zipeng Chen, Qianli Ma, Zhenxi Lin

Multi-scale information is crucial for modeling time series. Although most existing methods consider multiple scales in the time-series data, they assume all kinds of scales are equally important for each sample, making them unable to capture the dynamic temporal patterns of time series. To this end, we propose Time-Aware Multi-Scale Recurrent Neural Networks (TAMS-RNNs), which disentangle representations of different scales and adaptively select the most important scale for each sample at each time step. First, the hidden state of the RNN is disentangled into multiple independently updated small hidden states, which use different update frequencies to model time-series multi-scale information. Then, at each time step, the temporal context information is used to modulate the features of different scales, selecting the most important time-series scale. Therefore, the proposed model can capture the multi-scale information for each time series at each time step adaptively. Extensive experiments demonstrate that the model outperforms state-of-the-art methods on multivariate time series classification and human motion prediction tasks. Furthermore, visualized analysis on music genre recognition verifies the effectiveness of the model.