| Total: 181

In many applications that involve processing high-dimensional data, it is important to identify a small set of entities that account for a significant fraction of detections. Rather than formalize this as a clustering problem, in which all detections must be grouped into hard or soft categories, we formalize it as an instance of the frequent items or heavy hitters problem, which finds groups of tightly clustered objects that have a high density in the feature space. We show that the heavy hitters formulation generates solutions that are more accurate and effective than the clustering formulation. In addition, we present a novel online algorithm for heavy hitters, called HAC, which addresses problems in continuous space, and demonstrate its effectiveness on real video and household domains.

The latent feature relational model (LFRM) for graphs represents each node as having binary memberships in one or more communities. The community memberships can be represented in form of a binary vector and LFRM defines the link probability between any pair of nodes as a bilinear function of their community membership vectors. Moreover, using nonparametric Bayesian prior - Indian Buffet Process - on the community membership matrix enables learning the number of communities automatically from the data. However, despite its modeling flexibility, strong link predictive performance, and nice interpretability of binary embeddings, inference in LFRM remains a challenge and is typically done via MCMC or variational methods. These methods can be slow and may take a long time to converge. In this work, we apply the small variance asymptotics idea to the non-parametric Bayesian LFRM, utilizing the connection between exponential families and Bregman divergence. This leads to an overlapping k-means like objective function for the nonparametric Bayesian LFRM, which can be optimized using generic or specialized solvers. We also propose an iterative greedy algorithm to optimize the objective function and compare our approach with other inference methods on several benchmark datasets. Our results demonstrate that our inference algorithm is competitive to methods such as MCMC while being much faster.

Convolutional Neural Network (CNN) achieved satisfying performance in click-through rate (CTR) prediction in recent studies. Since features used in CTR prediction have no meaningful sequence in nature, the features can be arranged in any order. As CNN learns the local information of a sample, the feature sequence may influence its performance significantly. However, this problem has not been fully investigated. This paper firstly investigates whether and how the feature sequence affects the performance of the CNN-based CTR prediction method. As the data distribution of CTR prediction changes with time, the best current sequence may not be suitable for future data. Two multi-sequence models are proposed to learn the information provided by different sequences. The first model learns all sequences using a single feature learning module, while each sequence is learnt individually by a feature learning module in the second one. Moreover, a method of generating a set of embedding sequences which aims to consider the combined influence of all feature pairs on feature learning is also introduced. The experiments are conducted to demonstrate the effectiveness and stability of our proposed models in the offline and online environment on both the benchmark Avazu dataset and a real commercial dataset.

Deep neural networks have witnessed great successes in various real applications, but it requires a large number of labeled data for training. In this paper, we propose tri-net, a deep neural network which is able to use massive unlabeled data to help learning with limited labeled data. We consider model initialization, diversity augmentation and pseudo-label editing simultaneously. In our work, we utilize output smearing to initialize modules, use fine-tuning on labeled data to augment diversity and eliminate unstable pseudo-labels to alleviate the influence of suspicious pseudo-labeled data. Experiments show that our method achieves the best performance in comparison with state-of-the-art semi-supervised deep learning methods. In particular, it achieves 8.30% error rate on CIFAR-10 by using only 4000 labeled examples.

In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the different samplings between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the sampling bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i.e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both adversarial pairs and original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to representative state-of-the-art metric learning models.

Distributed primal-dual optimization has received many focuses in the past few years. In this framework, training samples are stored in multiple machines. At each round, all the machines conduct a sequence of updates based on their local data, and then the local updates are synchronized and merged to obtain the update to the global model. All the previous approaches merge the local updates by averaging all of them with a uniform weight. However, in many real world applications data are not uniformly distributed on each machine, so the uniform weight is inadequate to capture the heterogeneity of local updates. To resolve this issue, we propose a better way to merge local updates in the primal-dual optimization framework. Instead of using a single weight for all the local updates, we develop a computational efficient algorithm to automatically choose the optimal weights for each machine. Furthermore, we propose an efficient way to estimate the duality gap of the merged update by exploiting the structure of the objective function, and this leads to an efficient line search algorithm based on the reduction of duality gap. Combining these two ideas, our algorithm is much faster and more scalable than existing methods on real world problems.

Frank-Wolfe methods (FW) have gained significant interest in the machine learning community due to their ability to efficiently solve large problems that admit a sparse structure (e.g. sparse vectors and low-rank matrices). However the performance of the existing FW method hinges on the quality of the linear approximation. This typically restricts FW to smooth functions for which the approximation quality, indicated by a global curvature measure, is reasonably good. In this paper, we propose a modified FW algorithm amenable to nonsmooth functions, subject to a separability assumption, by optimizing for approximation quality over all affine functions, given a neighborhood of interest. We analyze theoretical properties of the proposed algorithm and demonstrate that it overcomes many issues associated with existing methods in the context of nonsmooth low-rank matrix estimation.

Causal inference in time series is an important problem in many fields. Traditional methods use regression models for this problem. The inference accuracies of these methods depend greatly on whether or not the model can be well fitted to the data, and therefore we are required to select an appropriate regression model, which is difficult in practice. This paper proposes a supervised learning framework that utilizes a classifier instead of regression models. We present a feature representation that employs the distance between the conditional distributions given past variable values and show experimentally that the feature representation provides sufficiently different feature vectors for time series with different causal relationships. Furthermore, we extend our framework to multivariate time series and present experimental results where our method outperformed the model-based methods and the supervised learning method for i.i.d. data.

We propose a novel method to merge convolutional neural-nets for the inference stage. Given two well-trained networks that may have different architectures that handle different tasks, our method aligns the layers of the original networks and merges them into a unified model by sharing the representative codes of weights. The shared weights are further re-trained to fine-tune the performance of the merged model. The proposed method effectively produces a compact model that may run original tasks simultaneously on resource-limited devices. As it preserves the general architectures and leverages the co-used weights of well-trained networks, a substantial training overhead can be reduced to shorten the system development time. Experimental results demonstrate a satisfactory performance and validate the effectiveness of the method.

It has been observed that a particular form of analogical inference, based on analogical proportions, yields competitive results in classification tasks. Using the algebraic normal form of Boolean functions, it has been shown that analogical prediction is always exact iff the labeling function is affine. We point out that affine functions are also meaningful when using another view of analogy. We address the accuracy of analogical inference for arbitrary Boolean functions and show that if a function is epsilon-close to an affine function, then the probability of making a wrong prediction is upper bounded by 4 epsilon. This result is confirmed by an empirical study showing that the upper bound is tight. It highlights the specificity of analogical inference, also characterized in terms of the Hamming distance.

In this paper, we investigate the research problem of unsupervised multi-view feature selection. Conventional solutions first simply combine multiple pre-constructed view-specific similarity structures into a collaborative similarity structure, and then perform the subsequent feature selection. These two processes are separate and independent. The collaborative similarity structure remains fixed during feature selection. Further, the simple undirected view combination may adversely reduce the reliability of the ultimate similarity structure for feature selection, as the view-specific similarity structures generally involve noises and outlying entries. To alleviate these problems, we propose an adaptive collaborative similarity learning (ACSL) for multi-view feature selection. We propose to dynamically learn the collaborative similarity structure, and further integrate it with the ultimate feature selection into a unified framework. Moreover, a reasonable rank constraint is devised to adaptively learn an ideal collaborative similarity structure with proper similarity combination weights and desirable neighbor assignment, both of which could positively facilitate the feature selection. An effective solution guaranteed with the proved convergence is derived to iteratively tackle the formulated optimization problem. Experiments demonstrate the superiority of the proposed approach.

We present a novel framework for augmenting data sets for machine learning based on counterexamples. Counterexamples are misclassified examples that have important properties for retraining and improving the model. Key components of our framework include a \textit{counterexample generator}, which produces data items that are misclassified by the model and error tables, a novel data structure that stores information pertaining to misclassifications. Error tables can be used to explain the model's vulnerabilities and are used to efficiently generate counterexamples for augmentation. We show the efficacy of the proposed framework by comparing it to classical augmentation techniques on a case study of object detection in autonomous driving based on deep neural networks.

Network embedding is a method of learning a low-dimensional vector representation of network vertices under the condition of preserving different types of network properties. Previous studies mainly focus on preserving structural information of vertices at a particular scale, like neighbor information or community information, but cannot preserve the hierarchical community structure, which would enable the network to be easily analyzed at various scales. Inspired by the hierarchical structure of galaxies, we propose the Galaxy Network Embedding (GNE) model, which formulates an optimization problem with spherical constraints to describe the hierarchical community structure preserving network embedding. More specifically, we present an approach of embedding communities into a low dimensional spherical surface, the center of which represents the parent community they belong to. Our experiments reveal that the representations from GNE preserve the hierarchical community structure and show advantages in several applications such as vertex multi-class classification and network visualization. The source code of GNE is available online.

Network embedding, as an approach to learn low-dimensional representations of vertices, has been proved extremely useful in many applications. Lots of state-of-the-art network embedding methods based on Skip-gram framework are efficient and effective. However, these methods mainly focus on the static network embedding and cannot naturally generalize to the dynamic environment. In this paper, we propose a stable dynamic embedding framework with high efficiency. It is an extension for the Skip-gram based network embedding methods, which can keep the optimality of the objective in the Skip-gram based methods in theory. Our model can not only generalize to the new vertex representation, but also update the most affected original vertex representations during the evolvement of the network. Multi-class classification on three real-world networks demonstrates that, our model can update the vertex representations efficiently and achieve the performance of retraining simultaneously. Besides, the visualization experimental result illustrates that, our model is capable of avoiding the embedding space drifting.

It is NP-complete to find non-negative factors W and H with fixed rank r from a non-negative matrix X by minimizing ||X-WH^Τ ||^2. Although the separability assumption (all data points are in the conical hull of the extreme rows) enables polynomial-time algorithms, the computational cost is not affordable for big data. This paper investigates how the power of quantum computation can be capitalized to solve the non-negative matrix factorization with the separability assumption (SNMF) by devising a quantum algorithm based on the divide-and-conquer anchoring (DCA) scheme [Zhou et al., 2013]. The design of quantum DCA (QDCA) is challenging. In the divide step, the random projections in DCA is completed by a quantum algorithm for linear operations, which achieves the exponential speedup. We then devise a heuristic post-selection procedure which extracts the information of anchors stored in the quantum states efficiently. Under a plausible assumption, QDCA performs efficiently, achieves the quantum speedup, and is beneficial for high dimensional problems.

Class imbalance refers to the scenario where certain classes are highly under-represented compared to other classes in terms of the availability of training data. This situation hinders the applicability of conventional machine learning algorithms to most of the classification problems where class imbalance is prominent. Most existing methods addressing class imbalance either rely on sampling techniques or cost-sensitive learning methods; thus inheriting their shortcomings. In this paper, we introduce a novel approach that is different from sampling or cost-sensitive learning based techniques, to address the class imbalance problem, where two samples are simultaneously considered to train the classifier. Further, we propose a mechanism to use a single base classifier, instead of an ensemble of classifiers, to obtain the output label of the test sample using majority voting method. Experimental results on several benchmark datasets clearly indicate the usefulness of the proposed approach over the existing state-of-the-art techniques.

In partial label learning, each training example is assigned a set of candidate labels, only one of which is the ground-truth label. Existing partial label learning frameworks either assume each candidate label of equal confidence or consider the ground-truth label as a latent variable hidden in the indiscriminate candidate label set, while the different labeling confidence levels of the candidate labels are regrettably ignored. In this paper, we formalize the different labeling confidence levels as the latent label distributions, and propose a novel unified framework to estimate the latent label distributions while training the model simultaneously. Specifically, we present a biconvex formulation with constrained local consistency and adopt an alternating method to solve this optimization problem. The process of alternating optimization exactly facilitates the mutual adaption of the model training and the constrained label propagation. Extensive experimental results on controlled UCI datasets as well as real-world datasets clearly show the effectiveness of the proposed approach.

Building multiple hash tables has been proven a successful technique for indexing massive databases, which can guarantee a desired level of overall performance. However, existing hash based multi-indexing methods suffer from the heavy redundancy, without strong table complementarity and effective hash code learning. To address the problems, this paper proposes a complementary binary quantization (CBQ) method to jointly learning multiple hash tables. It exploits the power of incomplete binary coding based on prototypes to align the original space and the Hamming space, and further utilizes the nature of multi-indexing search to jointly reduce the quantization loss based on the prototype based hash function. Our alternating optimization adaptively discovers the complementary prototype sets and the corresponding code sets of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes. Extensive experiments carried out on two popular large-scale tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed CBQ method enjoys the strong table complementarity and significantly outperforms the state-of-the-art, with up to 57.76\% performance gains relatively.

Generative Moment-Matching Network (GMMN) is a deep generative model, which employs maximum mean discrepancy as the objective to learn model parameters. However, this model can only generate samples, failing to infer the latent code from samples for downstream tasks. In this paper, we propose a novel Joint Generative Moment-Matching Network (JGMMN), which learns the structural latent code for unsupervised inference. Specifically, JGMMN has a generation network for the generation task and an inference network for the inference task. We first reformulate this model as the two joint distributions matching problem. To solve this problem, we propose to use the Joint Maximum Mean Discrepancy (JMMD) as the objective to learn these two networks simultaneously. Furthermore, to enforce the consistency between the sample distribution and the inferred latent code distribution, we propose a novel multi-modal regularization to enforce this consistency. At last, extensive experiments on both synthetic and real-world datasets have verified the effectiveness and correctness of our proposed JGMMN.

Sparse learning models have shown promising performance in the high dimensional machine learning applications. The main challenge of sparse learning models is how to optimize it efficiently. Most existing methods solve this problem by relaxing it as a convex problem, incurring large estimation bias. Thus, the sparse learning model with nonconvex constraint has attracted much attention due to its better performance. But it is difficult to optimize due to the non-convexity. In this paper, we propose a linearly convergent stochastic second-order method to optimize this nonconvex problem for large-scale datasets. The proposed method incorporates second-order information to improve the convergence speed. Theoretical analysis shows that our proposed method enjoys linear convergence rate and guarantees to converge to the underlying true model parameter. Experimental results have verified the efficiency and correctness of our proposed method.

Feature hashing is widely used to process large scale sparse features for learning of predictive models. Collisions inherently happen in the hashing process and hurt the model performance. In this paper, we develop a feature hashing scheme called Cuckoo Feature Hashing(CCFH) based on the principle behind Cuckoo hashing, a hashing scheme designed to resolve collisions. By providing multiple possible hash locations for each feature, CCFH prevents the collisions between predictive features by dynamically hashing them into alternative locations during model training. Experimental results on prediction tasks with hundred-millions of features demonstrate that CCFH can achieve the same level of performance by using only 15%-25% parameters compared with conventional feature hashing.

Most of current network representation models are learned in unsupervised fashions, which usually lack the capability of discrimination when applied to network analysis tasks, such as node classification. It is worth noting that label information is valuable for learning the discriminative network representations. However, labels of all training nodes are always difficult or expensive to obtain and manually labeling all nodes for training is inapplicable. Different sets of labeled nodes for model learning lead to different network representation results. In this paper, we propose a novel method, termed as ANRMAB, to learn the active discriminative network representations with a multi-armed bandit mechanism in active learning setting. Specifically, based on the networking data and the learned network representations, we design three active learning query strategies. By deriving an effective reward scheme that is closely related to the estimated performance measure of interest, ANRMAB uses a multi-armed bandit mechanism for adaptive decision making to select the most informative nodes for labeling. The updated labeled nodes are then used for further discriminative network representation learning. Experiments are conducted on three public data sets to verify the effectiveness of ANRMAB.

We study the problem of learning first-order rules from large Knowledge Graphs (KGs). With recent advancement in information extraction, vast data repositories in the KG format have been obtained such as Freebase and YAGO. However, traditional techniques for rule learning are not scalable for KGs. This paper presents a new approach RLvLR to learning rules from KGs by using the technique of embedding in representation learning together with a new sampling method. Experimental results show that our system outperforms some state-of-the-art systems. Specifically, for massive KGs with hundreds of predicates and over 10M facts, RLvLR is much faster and can learn much more quality rules than major systems for rule learning in KGs such as AMIE+. We also used the RLvLR-mined rules in an inference module to carry out the link prediction task. In this task, RLvLR outperformed Neural LP, a state-of-the-art link prediction system, in both runtime and accuracy.

Semi-Supervised Learning (SSL) is able to build reliable classifier with very scarce labeled examples by properly utilizing the abundant unlabeled examples. However, existing SSL algorithms often yield unsatisfactory performance due to the lack of supervision information. To address this issue, this paper formulates SSL as a Generalized Distillation (GD) problem, which treats existing SSL algorithm as a learner and introduces a teacher to guide the learner?s training process. Specifically, the intelligent teacher holds the privileged knowledge that ?explains? the training data but remains unknown to the learner, and the teacher should convey its rich knowledge to the imperfect learner through a specific teaching function. After that, the learner gains knowledge by ?imitating? the output of the teaching function under an optimization framework. Therefore, the learner in our algorithm learns from both the teacher and the training data, so its output can be substantially distilled and enhanced. By deriving the Rademacher complexity and error bounds of the proposed algorithm, the usefulness of the introduced teacher is theoretically demonstrated. The superiority of our algorithm to the related state-of-the-art methods has also been empirically demonstrated by the experiments on different datasets with various sources of privileged knowledge.

Structured-sparsity regularization is popular for sparse learning because of its flexibility of encoding the feature structures. This paper considers a generalized version of structured-sparsity regularization (especially for $l_1/l_{\infty}$ norm) with arbitrary group overlap. Due to the group overlap, it is time-consuming to solve the associated proximal operator. Although Mairal~\shortcite{mairal2010network} have proposed a network-flow algorithm to solve the proximal operator, it is still time-consuming especially in the high-dimensional setting. To address this challenge, in this paper, we have developed a more efficient solution for $l_1/l_{\infty}$ group lasso with arbitrary group overlap using an Inexact Proximal-Gradient method. In each iteration, our algorithm only requires to calculate an inexact solution to the proximal sub-problem, which can be done efficiently. On the theoretic side, the proposed algorithm enjoys the same global convergence rate as the exact proximal methods. Experiments demonstrate that our algorithm is much more efficient than network-flow algorithm, while retaining the similar generalization performance.