| Total: 182

This paper shows that simply prescribing "none of the above" labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter — probability of sampling from unlabeled data — is also studied empirically.

Multi-label classification is a practical yet challenging task in machine learning related fields, since it requires the prediction of more than one label category for each input instance. We propose a novel deep neural networks (DNN) based model, Canonical Correlated AutoEncoder (C2AE), for solving this task. Aiming at better relating feature and label domain data for improved classification, we uniquely perform joint feature and label embedding by deriving a deep latent space, followed by the introduction of label-correlation sensitive loss function for recovering the predicted label outputs. Our C2AE is achieved by integrating the DNN architectures of canonical correlation analysis and autoencoder, which allows end-to-end learning and prediction with the ability to exploit label dependency. Moreover, our C2AE can be easily extended to address the learning problem with missing labels. Our experiments on multiple datasets with different scales confirm the effectiveness and robustness of our proposed method, which is shown to perform favorably against state-of-the-art methods for multi-label classification.

Feature selection aims to select a small subset from the high-dimensional features which can lead to better learning performance, lower computational complexity, and better model readability. The class imbalance problem has been neglected by traditional feature selection methods, therefore the selected features will be biased towards the majority classes. Because of the superiority of F-measure to accuracy for imbalanced data, we propose to use F-measure as the performance measure for feature selection algorithms. As a pseudo-linear function, the optimization of F-measure can be achieved by minimizing the total costs. In this paper, we present a novel cost-sensitive feature selection (CSFS) method which optimizes F-measure instead of accuracy to take class imbalance issue into account. The features will be selected according to optimal F-measure classifier after solving a series of cost-sensitive feature selection sub-problems. The features selected by our method will fully represent the characteristics of not only majority classes, but also minority classes. Extensive experimental results conducted on synthetic, multi-class and multi-label datasets validate the efficiency and significance of our feature selection method.

Kernel learning is a fundamental problem both in recent research and application of kernel methods. Existing kernel learning methods commonly use some measures of generalization errors to learn the optimal kernel in a convex (or conic) combination of prescribed basic kernels. However, the generalization bounds derived by these measures usually have slow convergence rates, and the basic kernels are finite and should be specified in advance. In this paper, we propose a new kernel learning method based on a novel measure of generalization error, called principal eigenvalue proportion (PEP), which can learn the optimal kernel with sharp generalization bounds over the convex hull of a possibly infinite set of basic kernels. We first derive sharp generalization bounds based on the PEP measure. Then we design two kernel learning algorithms for finite kernels and infinite kernels respectively, in which the derived sharp generalization bounds are exploited to guarantee faster convergence rates, moreover, basic kernels can be learned automatically for infinite kernel learning instead of being prescribed in advance. Theoretical analysis and empirical results demonstrate that the proposed kernel learning method outperforms the state-of-the-art kernel learning methods.

While much research effort has been dedicated to scaling up sparse Gaussian process (GP) models based on inducing variables for big data, little attention is afforded to the other less explored class of low-rank GP approximations that exploit the sparse spectral representation of a GP kernel. This paper presents such an effort to advance the state of the art of sparse spectrum GP models to achieve competitive predictive performance for massive datasets. Our generalized framework of stochastic variational Bayesian sparse spectrum GP (sVBSSGP) models addresses their shortcomings by adopting a Bayesian treatment of the spectral frequencies to avoid overfitting, modeling these frequencies jointly in its variational distribution to enable their interaction a posteriori, and exploiting local data for boosting the predictive performance. However, such structural improvements result in a variational lower bound that is intractable to be optimized. To resolve this, we exploit a variational parameterization trick to make it amenable to stochastic optimization. Interestingly, the resulting stochastic gradient has a linearly decomposable structure that can be exploited to refine our stochastic optimization method to incur constant time per iteration while preserving its property of being an unbiased estimator of the exact gradient of the variational lower bound. Empirical evaluation on real-world datasets shows that sVBSSGP outperforms state-of-the-art stochastic implementations of sparse GP models.

We are interested in exploring the possibility and benefits of structure learning for deep models. As the first step, this paper investigates the matter for Restricted Boltzmann Machines (RBMs). We conduct the study with Replicated Softmax, a variant of RBMs for unsupervised text analysis. We present a method for learning what we call Sparse Boltzmann Machines, where each hidden unit is connected to a subset of the visible units instead of all of them. Empirical results show that the method yields models with significantly improved model fit and interpretability as compared with RBMs where each hidden unit is connected to all visible units.

Curriculum Learning (CL) mimics the cognitive process ofhumans and favors a learning algorithm to follow the logical learning sequence from simple examples to more difficult ones. Recent studies show that selecting the simplest curriculum examples from different modalities for graph-based label propagation can yield better performance than simply leveraging single modality. However, they forcibly requirethe curriculums generated by all modalities to be identical to a common curriculum, which discard the individuality ofevery modality and produce the inaccurate curriculum for the subsequent learning. Therefore, this paper proposes a novel multi-modal CL algorithm by comprehensively investigating both the individuality and commonality of different modalities. By considering the curriculums of multiple modalities altogether, their common preference on selecting the simplestexamples can be explored by a row-sparse matrix, and their distinct opinions are captured by a sparse noise matrix. As a consequence, a "soft" fusion of multiple curriculums from different modalities is achieved and the propagation quality can thus be improved. Comprehensive empirical studies reveal that our method can generate higher accuracy than the state-of-the-art multi-modal CL approach and label propagation algorithms on various image classification tasks.

Partial label learning aims to induce a multi-class classifier from training examples where each of them is associated with a set of candidate labels, among which only one label is valid. The common discriminative solution to learn from partial label examples assumes one parametric model for each class label, whose predictions are aggregated to optimize specific objectives such as likelihood or margin over the training examples. Nonetheless, existing discriminative approaches treat the predictions from all parametric models in an equal manner, where the confidence of each candidate label being the ground-truth label is not differentiated. In this paper, a boosting-style partial label learning approach is proposed to enabling confidence-rated discriminative modeling. Specifically, the ground-truth confidence of each candidate label is maintained in each boosting round and utilized to train the base classifier. Extensive experiments on artificial as well as real-world partial label data sets validate the effectiveness of the confidence-rated discriminative modeling.

With the advance of acquisition techniques, plentiful higherorder tensor data sets are built up in a great variety of fields such as computer vision, neuroscience, remote sensing and recommender systems. The real-world tensors often contain missing values, which makes tensor completion become a prerequisite to utilize them. Previous studies have shown that imposing a low-rank constraint on tensor completion produces impressive performances. In this paper, we argue that low-rank constraint, albeit useful, is not effective enough to exploit the local smooth and piecewise priors of visual data. We propose integrating total variation into low-rank tensor completion (LRTC) to address the drawback. As LRTC can be formulated by both tensor unfolding and tensor decomposition, we develop correspondingly two methods, namely LRTC-TV-I and LRTC-TVII, and their iterative solvers. Extensive experimental results on color image and medical image inpainting tasks show the effectiveness and superiority of the two methods against state-of-the-art competitors.

Denoising autoencoders (DAE) are trained to reconstruct their clean inputs with noise injected at the input level, while variational autoencoders (VAE) are trained with noise injected in their stochastic hidden layer, with a regularizer that encourages this noise injection. In this paper, we show that injecting noise both in input and in the stochastic hidden layer can be advantageous and we propose a modified variational lower bound as an improved objective function in this setup. When input is corrupted, then the standard VAE lower bound involves marginalizing the encoder conditional distribution over the input noise, which makes the training criterion intractable. Instead, we propose a modified training criterion which corresponds to a tractable bound when input is corrupted. Experimentally, we find that the proposed denoising variational autoencoder (DVAE) yields better average log-likelihood than the VAE and the importance weighted autoencoder on the MNIST and Frey Face datasets.

Correlation measures are a key element of statistics and machine learning, and essential for a wide range of data analysis tasks. Most existing correlation measures are for pairwise relationships, but real-world data can also exhibit complex multivariate correlations, involving three or more variables. We argue that multivariate correlation measures should be comparable, interpretable, scalable and unbiased. However, no existing measures satisfy all these requirements. In this paper, we propose an unbiased multivariate correlation measure, called UMC, which satisfies all the above criteria. UMC is a cumulative entropy based non-parametric multivariate correlation measure, which can capture both linear and non-linear correlations for groups of three or more variables. It employs a correction for chance using a statistical model of independence to address the issue of bias. UMC has high interpretability and we empirically show it outperforms state-of-the-art multivariate correlation measures in terms of statistical power, as well as for use in both subspace clustering and outlier detection tasks.

Gaussian state space models have been used for decades as generative models of sequential data. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption. We introduce a unified algorithm to efficiently learn a broad class of linear and non-linear state space models, including variants where the emission and transition distributions are modeled by deep neural networks. Our learning algorithm simultaneously learns a compiled inference network and the generative model, leveraging a structured variational approximation parameterized by recurrent neural networks to mimic the posterior distribution. We apply the learning algorithm to both synthetic and real-world datasets, demonstrating its scalability and versatility. We find that using the structured approximation to the posterior results in models with significantly higher held-out likelihood.

This paper proposes a one-step spectral clustering method by learning an intrinsic affinity matrix (i.e., the clustering result) from the low-dimensional space (i.e., intrinsic subspace) of original data. Specifically, the intrinsic affinitymatrix is learnt by: 1) the alignment of the initial affinity matrix learnt from original data; 2) the adjustment of the transformation matrix, which transfers the original feature space into its intrinsic subspace by simultaneously conducting feature selection and subspace learning; and 3) the clustering result constraint, i.e., the graph constructed by the intrinsic affinity matrix has exact c connected components where c is the number of clusters. In this way, two affinity matrices and a transformation matrix are iteratively updated until achieving their individual optimum, so that these two affinity matrices are consistent and the intrinsic subspace is learnt via the transformation matrix. Experimental results on both synthetic and benchmark datasets verified that our proposed method outputted more effective clustering result than the previous clustering methods.

We consider a class of non-convex learning problems that can be formulated as jointly optimizing regularized hinge loss and a set of auxiliary variables. Such problems encompass but are not limited to various versions of semi-supervised learning,learning with hidden structures, robust learning, etc. Existing methods either suffer from local minima or have to invoke anon-scalable combinatorial search. In this paper, we propose a novel learning procedure, namely Parametric Dual Maximization(PDM), that can approach global optimality efficiently with user specified approximation levels. The building blocks of PDM are two new results: (1) The equivalent convex maximization reformulation derived by parametric analysis.(2) The improvement of local solutions based on a necessary and sufficient condition for global optimality. Experimental results on two representative applications demonstrate the effectiveness of PDM compared to other approaches.

Subspace clustering has been widely applied to detect meaningful clusters in high-dimensional data spaces. A main challenge in subspace clustering is to quickly calculate a "good" affinity matrix. ℓ0, ℓ1, ℓ2 or nuclear norm regularization is used to construct the affinity matrix in many subspace clustering methods because of their theoretical guarantees and empirical success. However, they suffer from the following problems: (1) ℓ2 and nuclear norm regularization require very strong assumptions to guarantee a subspace-preserving affinity; (2) although ℓ1 regularization can be guaranteed to give a subspace-preserving affinity under certain conditions, it needs more time to solve a large-scale convex optimization problem; (3) ℓ0 regularization can yield a tradeoff between computationally efficient and subspace-preserving affinity by using the orthogonal matching pursuit (OMP) algorithm, but this still takes more time to search the solution in OMP when the number of data points is large. In order to overcome these problems, we first propose a learned OMP (LOMP) algorithm to learn a single hidden neural network (SHNN) to fast approximate the ℓ0code. We then exploit a sparse subspace clustering method based on ℓ0 code which is fast computed by SHNN. Two sufficient conditions are presented to guarantee that our method can give a subspace-preserving affinity. Experiments on handwritten digit and face clustering show that our method not only quickly computes the ℓ0 code, but also outperforms the relevant subspace clustering methods in clustering results. In particular, our method achieves the state-of-the-art clustering accuracy (94.32%) on MNIST.

In this paper, we consider the problem of actively learning a linear classifier through query synthesis where the learner can construct artificial queries in order to estimate the true decision boundaries. This problem has recently gained a lot of interest in automated science and adversarial reverse engineering for which only heuristic algorithms are known. In such applications, queries can be constructed de novo to elicit information (e.g., automated science) or to evade detection with minimal cost (e.g., adversarial reverse engineering). We develop a general framework, called dimension coupling (DC), that 1) reduces a d-dimensional learning problem to d-1 low dimensional sub-problems, 2) solves each sub-problem efficiently, 3) appropriately aggregates the results and outputs a linear classifier, and 4) provides a theoretical guarantee for all possible schemes of aggregation. The proposed method is proved resilient to noise. We show that the DC framework avoids the curse of dimensionality: its computational complexity scales linearly with the dimension. Moreover, we show that the query complexity of DC is near optimal (within a constant factor of the optimum algorithm). To further support our theoretical analysis, we compare the performance of DC with the existing work. We observe that DC consistently outperforms the prior arts in terms of query complexity while often running orders of magnitude faster.

The study on generalization performance of ranking algorithms is one of the fundamental issues in ranking learning theory. Although several generalization bounds have been proposed based on different measures, the convergence rates of the existing bounds are usually at most O(√1/n), where n is the size of data set. In this paper, we derive novel generalization bounds for the regularized ranking in reproducing kernel Hilbert space via integral operator of kernel function. We prove that the rates of our bounds are much faster than (√1/n). Specifically, we first introduce a notion of local Rademacher complexity for ranking, called local ranking Rademacher complexity, which is used to measure the complexity of the space of loss functions of the ranking. Then, we use the local ranking Rademacher complexity to obtain a basic generalization bound. Finally, we establish the relationship between the local Rademacher complexity and the eigenvalues of integral operator, and further derive sharp generalization bounds of faster convergence rate.

The k-fold cross-validation is commonly used to evaluate the effectiveness of SVMs with the selected hyper-parameters. It is known that the SVM k-fold cross-validation is expensive, since it requires training k SVMs. However, little work has explored reusing the h-th SVM for training the (h+1)-th SVM for improving the efficiency of k-fold cross-validation. In this paper, we propose three algorithms that reuse the h-th SVM for improving the efficiency of training the (h+1)-th SVM. Our key idea is to efficiently identify the support vectors and to accurately estimate their associated weights (also called alpha values) of the next SVM by using the previous SVM. Our experimental results show that our algorithms are several times faster than the k-fold cross-validation which does not make use of the previously trained SVM. Moreover, our algorithms produce the same results (hence same accuracy) as the k-fold cross-validation which does not make use of the previously trained SVM.

Visual versus near infrared (VIS-NIR) face recognition is still a challenging heterogeneous task due to large appearance difference between VIS and NIR modalities. This paper presents a deep convolutional network approach that uses only one network to map both NIR and VIS images to a compact Euclidean space. The low-level layers of this network are trained only on large-scale VIS data. Each convolutional layer is implemented by the simplest case of maxout operator. The high-level layer is divided into two orthogonal subspaces that contain modality-invariant identity information and modality-variant spectrum information respectively. Our joint formulation leads to an alternating minimization approach for deep representation at the training time and an efficient computation for heterogeneous data at the testing time. Experimental evaluations show that our method achieves 94% verification rate at FAR=0.1% on the challenging CASIA NIR-VIS 2.0 face recognition dataset. Compared with state-of-the-art methods, it reduces the error rate by 58% only with a compact 64-D representation.

Unsupervised neural networks, such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs), are powerful tools for feature selection and pattern recognition tasks. We demonstrate that overfitting occurs in such models just as in deep feedforward neural networks, and discuss possible regularization methods to reduce overfitting. We also propose a "partial" approach to improve the efficiency of Dropout/DropConnect in this scenario, and discuss the theoretical justification of these methods from model convergence and likelihood bounds. Finally, we compare the performance of these methods based on their likelihood and classification error rates for various pattern recognition data sets.

Matrix completion as a common problem in many application domains has received increasing attention in the machine learning community. Previous matrix completion methods have mostly focused on exploiting the matrix low-rank property to recover missing entries. Recently, it has been noticed that side information that describes the matrix items can help to improve the matrix completion performance. In this paper, we propose a novel matrix completion approach that exploits side information within a principled co-embedding framework. This framework integrates a low-rank matrix factorization model and a label embedding based prediction model together to derive a convex co-embedding formulation with nuclear norm regularization. We develop a fast proximal gradient descent algorithm to solve this co-embedding problem. The effectiveness of the proposed approach is demonstrated on two types of real world application problems.

Feature selection is an important technique in machine learning research. An effective and robust feature selection method is desired to simultaneously identify the informative features and eliminate the noisy ones of data. In this paper, we consider the unsupervised feature selection problem which is particularly difficult as there is not any class labels that would guide the search for relevant features. To solve this, we propose a novel algorithmic framework which performs unsupervised feature selection. Firstly, the proposed framework implements structure learning, where the data structures (including intrinsic distribution structure and the data segment) are found via a combination of the alternative optimization and clustering. Then, both the intrinsic data structure and data segmentation are formulated as regularization terms for discriminant feature selection. The results of the feature selection also affect the structure learning step in the following iterations. By leveraging the interactions between structure learning and feature selection, we are able to capture more accurate structure of data and select more informative features. Clustering and classification experiments on real world image data sets demonstrate the effectiveness of our method.

Discovering robust low-rank data representations is important in many real-world problems. Traditional robust principal component analysis (RPCA) assumes that the observed data are corrupted by some sparse noise (e.g., Laplacian noise) and utilizes the l1-norm to separate out the noisy compo- nent. Nevertheless, as well as simple Gaussian or Laplacian noise, noise in real-world data is often more complex, and thus the l1 and l2-norms are insufficient for noise charac- terization. This paper presents a more flexible approach to modeling complex noise by investigating their properties in the frequency domain. Although elements of a noise matrix are chaotic in the spatial domain, the absolute values of its alternative coefficients in the frequency domain are constant w.r.t. their variance. Based on this observation, a new robust PCA algorithm is formulated by simultaneously discovering the low-rank and noisy components. Extensive experiments on synthetic data and video background subtraction demon- strate that FRPCA is effective for handles complex noise.

Clustering tasks often requires multiple views rather than a singleview to correctly reflect diverse characteristics of the cluster boundaries. The cluster boundaries estimated using a single view are incorrect in general, and those incorrect estimation should be compensated by helps of other views. If each viewis independent to other views, incorrect estimations will be mostly revised as the number of views grow. However, as the number of views grow, it is almost impossibleto avoid dependencies among views, and such dependencies often delude correct estimations. Thus, dependencies among views should be carefully considered in multi-view clustering. This paper proposes a new spectral clustering method to deal with multi-view data and dependencies among views. The proposed method is motivated by the brainstorming process. In the brainstorming process, an instance is regarded as an agenda to be discussed, while each view is considered as a brainstormer. Through the discussion step in the brainstorming process, a brainstormer iteratively suggests their opinions and accepts others’ different opinions. To compensate the biases caused by information sharing between brainstormers with dependent opinions, those having independent opinions are more encouraged to discuss together than those with dependent opinions. The conclusion step makes a compromise by merging or concatenating all opinions. The clustering is finally done after the conclusion. Experimental results in three tasks show the effectiveness of the proposed method comparing with ordinary single and multi-view spectral clusterings.

Learning a smooth skeleton in a low-dimensional space from noisy data becomes important in computer vision and computational biology. Existing methods assume that the manifold constructed from the data is smooth, but they lack the ability to model skeleton structures from noisy data. To overcome this issue, we propose a novel probabilistic structured learning model to learn the density of latent embedding given high-dimensional data and its neighborhood graph. The embedded points that form a smooth skeleton structure are obtained by maximum a posteriori (MAP) estimation. Our analysis shows that the resulting similarity matrix is sparse and unique, and its associated kernel has eigenvalues that follow a power law distribution, which leads to the embeddings of a smooth skeleton. The model is extended to learn a sparse similarity matrix when the graph structure is unknown. Extensive experiments demonstrate the effectiveness of the proposed methods on various datasets by comparing them with existing methods.