| Total: 137

In computer vision, a complex entity such as an image or video is often represented as a set of instance vectors, which are extracted from different parts of that entity. Thus, it is essential to design a representation to encode information in a set of instances robustly. Existing methods such as FV and VLAD are designed based on a generative perspective, and their performances fluctuate when difference types of instance vectors are used (i.e., they are not robust). The proposed D3 method effectively compares two sets as two distributions, and proposes a directional total variation distance (DTVD) to measure their dissimilarity. Furthermore, a robust classifier-based method is proposed to estimate DTVD robustly, and to efficiently represent these sets. D3 is evaluated in action and image recognition tasks. It achieves excellent robustness, accuracy and speed.

We propose a new input perturbation mechanism for publishing a covariance matrix to achieve (epsilon,0)-differential privacy. Our mechanism uses a Wishart distribution to generate matrix noise. In particular, we apply this mechanism to principal component analysis (PCA). Our mechanism is able to keep the positive semi-definiteness of the published covariance matrix. Thus, our approach gives rise to a general publishing framework for input perturbation of a symmetric positive semidefinite matrix. Moreover, compared with the classic Laplace mechanism, our method has better utility guarantee. To the best of our knowledge, the Wishart mechanism is the best input perturbation approach for (epsilon,0)-differentially private PCA. We also compare our work with previous exponential mechanism algorithms in the literature and provide near optimal bound while having more flexibility and less computational intractability.

In multi-label learning, there are two main challenges: missing labels and class imbalance (CIB). The former assumes that only a partial set of labels are provided for each training instance while other labels are missing. CIB is observed from two perspectives: first, the number of negative labels of each instance is much larger than its positive labels; second, the rate of positive instances (i.e. the number of positive instances divided by the total number of instances) of different classes are significantly different. Both missing labels and CIB lead to significant performance degradation. In this work, we propose a new method to handle these two challenges simultaneously. We formulate the problem as a constrained submodular minimization that is composed of a submodular objective function that encourages label consistency and smoothness, as well as, class cardinality bound constraints to handle class imbalance. We further present a convex approximation based on the Lovasz extension of submodular functions, leading to a linear program, which can be efficiently solved by the alternative direction method of multipliers (ADMM). Experimental results on several benchmark datasets demonstrate the improved performance of our method over several state-of-the-art methods.

This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Starting with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the initial learning rate which has to be chosen by the experimenter. We investigate several features and show how an adaptive controller can adjust the learning rate without prior knowledge of the learning problem at hand.

We consider convex-concave saddle point problems with a separable structure and non-strongly convex functions. We propose an efficient stochastic block coordinate descent method using adaptive primal-dual updates, which enables flexible parallel optimization for large-scale problems. Our method shares the efficiency and flexibility of block coordinate descent methods with the simplicity of primal-dual methods and utilizing the structure of the separable convex-concave saddle point problem. It is capable of solving a wide range of machine learning applications, including robust principal component analysis, Lasso, and feature selection by group Lasso, etc. Theoretically and empirically, we demonstrate significantly better performance than state-of-the-art methods in all these applications.

This paper proposes a set of new error criteria and a learning approach, called Adaptive Normalized Risk-Averting Training (ANRAT) to attack the non-convex optimization problem in training deep neural networks without pretraining. Theoretically, we demonstrate its effectiveness based on the expansion of the convexity region. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why our learning method using gradient descent works. In practice, we show how this training method is successfully applied for improved training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Using simple experimental settings without pretraining and other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptron and Denoised Auto-encoder is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other common tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization strategy in training DNNs.

Brown clustering is an established technique, used in hundreds of computational linguistics papers each year, to group word types that have similar distributional information. It is unsupervised and can be used to create powerful word representations for machine learning. Despite its improbable success relative to more complex methods, few have investigated whether Brown clustering has really been applied optimally. In this paper, we present a subtle but profound generalisation of Brown clustering to improve the overall quality by decoupling the number of output classes from the computational active set size. Moreover, the generalisation permits a novel approach to feature selection from Brown clusters: We show that the standard approach of shearing the Brown clustering output tree at arbitrary bitlengths is lossy and that features should be chosen insead by rolling up Generalised Brown hierarchies. The generalisation and corresponding feature generation is more principled, challenging the way Brown clustering is currently understood and applied.

Multi-label propagation aims to transmit the multi-label information from labeled examples to unlabeled examples based on a weighted graph. Existing methods ignore the specific propagation difficulty of different unlabeled examples and conduct the propagationin an imperfect sequence, leading to the error-prone classification of some difficult examples with uncertain labels. To address this problem, this paper associates each possible label with a "teacher", and proposesa "Multi-Label Teaching-to-Learn and Learning-to-Teach" (ML-TLLT) algorithm, so that the entire propagationprocess is guided by the teachers and manipulated from simple examples to more difficult ones. In the teaching-to-learn step, the teachers select the simplest examples for the current propagation by investigating both the definitiveness of each possible label of the unlabeled examples, and the dependencies between labels revealed by the labeled examples. In the learning-to-teach step, the teachers reversely learn from the learner’s feedback to properly select the simplest examples for the next propagation. Thorough empirical studies show that due to the optimized propagation sequence designed by the teachers, ML-TLLT yields generally better performance than seven state-of-the-art methods on the typical multi-label benchmark datasets.

We propose a probabilistic model for non-exhaustive and overlapping (NEO) bi-clustering. Our goal is to extract a few sub-matrices from the given data matrix, where entries of a sub-matrix are characterized by a specific distribution or parameters. Existing NEO biclustering methods typically require the number of sub-matrices to be extracted, which is essentially difficult to fix a priori. In this paper, we extend the plaid model, known as one of the best NEO bi-clustering algorithms, to allow infinite bi-clustering; NEO bi-clustering without specifying the number of sub-matrices. Our model can represent infinite sub-matrices formally. We develop a MCMC inference without the finite truncation, which potentially addresses all possible numbers of sub-matrices. Experiments quantitatively and qualitatively verify the usefulness of the proposed model. The results reveal that our model can offer more precise and in-depth analysis of sub-matrices.

Matrix completion is to recover missing/unobserved values of a data matrix from very limited observations. Due to widely potential applications, it has received growing interests in fields from machine learning, data mining, to collaborative filtering and computer vision. To ensure the successful recovery of missing values, most existing matrix completion algorithms utilise the low-rank assumption, i.e., the fully observed data matrix has a low rank, or equivalently the columns of the matrix can be linearly represented by a few numbers of basis vectors. Although such low-rank assumption applies generally in practice, real-world data can process much richer structural information. In this paper, we present a new model for matrix completion, motivated by the separability assumption of nonnegative matrices from the recent literature of matrix factorisations: there exists a set of columns of the matrix such that the resting columns can be represented by their convex combinations. Given the separability property, which holds reasonably for many applications, our model provides a more accurate matrix completion than the low-rank based algorithms. Further, we derives a scalable algorithm to solve our matrix completion model, which utilises a randomised method to select the basis columns under the separability assumption and a coordinate gradient based method to automatically deal with the structural constraints in optimisation. Compared to the state-of-the-art algorithms, the proposed matrix completion model achieves competitive results on both synthetic and real datasets.

Various sparse regularizers have been applied to machine learning problems, among which structured sparsity has been proposed for a better adaption to structured data. In this paper, motivated by effectively classifying linked data (e.g. Web pages, tweets, articles with references, and biological network data) where a group structure exists over the whole dataset and links exist between specific samples, we propose a joint sparse representation model that combines group sparsity and graph sparsity, to select a small number of connected components from the graph of linked samples, meanwhile promoting the sparsity of edges that link samples from different groups in each connected component. Consequently, linked samples are selected from a few sparsely-connected groups. Both theoretical analysis and experimental results on four benchmark datasets show that the joint sparsity model outperforms traditional group sparsity model and graph sparsity model, as well as the latest group-graph sparsity model.

General zero-shot learning (ZSL) approaches exploit transfer learning via semantic knowledge space. In this paper, we reveal a novel relational knowledge transfer (RKT) mechanism for ZSL, which is simple, generic and effective. RKT resolves the inherent semantic shift problem existing in ZSL through restoring the missing manifold structure of unseen categories via optimizing semantic mapping. It extracts the relational knowledge from data manifold structure in semantic knowledge space based on sparse coding theory. The extracted knowledge is then transferred backwards to generate virtual data for unseen categories in the feature space. On the one hand, the generalizing ability of the semantic mapping function can be enhanced with the added data. On the other hand, the mapping function for unseen categories can be learned directly from only these generated data, achieving inspiring performance. Incorporated with RKT, even simple baseline methods can achieve good results. Extensive experiments on three challenging datasets show prominent performance obtained by RKT, and we obtain 82.43% accuracy on the Animals with Attributes dataset.

Hierarchical latent tree analysis (HLTA) is recently proposed as a new method for topic detection. It differs fundamentally from the LDA-based methods in terms of topic definition, topic-document relationship, and learning method. It has been shown to discover significantly more coherent topics and better topic hierarchies. However, HLTA relies on the Expectation-Maximization (EM) algorithm for parameter estimation and hence is not efficient enough to deal with large datasets. In this paper, we propose a method to drastically speed up HLTA using a technique inspired by the advances in the method of moments. Empirical experiments show that our method greatly improves the efficiency of HLTA. It is as efficient as the state-of-the-art LDA-based method for hierarchical topic detection and finds substantially better topics and topic hierarchies.

The Dirichlet process mixtures (DPM) can automatically infer the model complexity from data. Hence it has attracted significant attention recently, and is widely used for model selection and clustering. As a generative model, it generally requires prior base distribution to learn component parameters by maximizing posterior probability. In contrast, discriminative classifiers model the conditional probability directly, and have yielded better results than generative classifiers.In this paper, we propose a maximum margin Dirichlet process mixture for clustering, which is different from the traditional DPM for parameter modeling. Our model takes a discriminative clustering approach, by maximizing a conditional likelihood to estimate parameters. In particular, we take a EM-like algorithm by leveraging Gibbs sampling algorithm for inference, which in turn can be perfectly embedded in the online maximum margin learning procedure to update model parameters. We test our model and show comparative results over the traditional DPM and other nonparametric clustering approaches.

Despite its nonconvex nature, ℓ0 sparse approximation is desirable in many theoretical and application cases. We study the ℓ0 sparse approximation problem with the tool of deep learning, by proposing Deep ℓ0 Encoders. Two typical forms, the ℓ0 regularized problem and the M-sparse problem, are investigated. Based on solid iterative algorithms, we model them as feed-forward neural networks, through introducing novel neurons and pooling functions. Enforcing such structural priors acts as an effective network regularization. The deep encoders also enjoy faster inference, larger learning capacity, and better scalability compared to conventional sparse coding solutions. Furthermore, under task-driven losses, the models can be conveniently optimized from end to end. Numerical results demonstrate the impressive performances of the proposed encoders.

Learning in deep models using Bayesian methods has generated significant attention recently. This is largely because of the feasibility of modern Bayesian methods to yield scalable learning and inference, while maintaining a measure of uncertainty in the model parameters. Stochastic gradient MCMC algorithms (SG-MCMC) are a family of diffusion-based sampling methods for large-scale Bayesian learning. In SG-MCMC, multivariate stochastic gradient thermostats (mSGNHT) augment each parameter of interest, with a momentum and a thermostat variable to maintain stationary distributions as target posterior distributions. As the number of variables in a continuous-time diffusion increases, its numerical approximation error becomes a practical bottleneck, so better use of a numerical integrator is desirable. To this end, we propose use of an efficient symmetric splitting integrator in mSGNHT, instead of the traditional Euler integrator. We demonstrate that the proposed scheme is more accurate, robust, and converges faster. These properties are demonstrated to be desirable in Bayesian deep learning. Extensive experiments on two canonical models and their deep extensions demonstrate that the proposed scheme improves general Bayesian posterior sampling, particularly for deep models.

Effective training of deep neural networks suffers from two main issues. The first is that the parameter space of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

Bayesian models provide a framework for probabilistic modelling of complex datasets. Many such models are computationally demanding, especially in the presence of large datasets. In sensor network applications, statistical (Bayesian) parameter estimation usually relies on decentralized algorithms, in which both data and computation are distributed across the nodes of the network. In this paper we propose a framework for decentralized Bayesian learning using Bregman Alternating Direction Method of Multipliers (B-ADMM). We demonstrate the utility of our framework, with Mean Field Variational Bayes (MFVB) as the primitive for distributed affine structure from motion (SfM).

Recent years have witnessed the success of deep neural networks in dealing with a plenty of practical problems. The invention of effective training techniques largely contributes to this success. The so-called "Dropout" training scheme is one of the most powerful tool to reduce over-fitting. From the statistic point of view, Dropout works by implicitly imposing an L2 regularizer on the weights. In this paper, we present a new training scheme: Shakeout. Instead of randomly discarding units as Dropout does at the training stage, our method randomly chooses to enhance or inverse the contributions of each unit to the next layer. We show that our scheme leads to a combination of L1 regularization and L2 regularization imposed on the weights, which has been proved effective by the Elastic Net models in practice.We have empirically evaluated the Shakeout scheme and demonstrated that sparse network weights are obtained via Shakeout training. Our classification experiments on real-life image datasets MNIST and CIFAR-10 show that Shakeout deals with over-fitting effectively.

We introduce a broad family of decision trees, Composite Trees, whose leaf classifiers are selected out of a hypothesis set composed of p subfamilies with different complexities. We prove new data-dependent learning guarantees for this family in the multi-class setting. These learning bounds provide a quantitative guidance for the choice of the hypotheses at each leaf. Remarkably, they depend on the Rademacher complexities of the sub-families of the predictors and the fraction of sample points correctly classified at each leaf. We further introduce random composite trees and derive learning guarantees for random composite trees which also apply to Random Forests. Using our theoretical analysis, we devise a new algorithm, RANDOMCOMPOSITEFORESTS (RCF), that is based on forming an ensemble of random composite trees. We report the results of experiments demonstrating that RCF yields significant performance improvements over both Random Forests and a variant of RCF in several tasks.

Many learning problems in real world applications involve rich datasets comprising multiple information modalities. In this work, we study co-regularized PLSA(coPLSA) as an efficient solution to probabilistic topic analysis of multi-modal data. In coPLSA, similarities between topic compositions of a data entity across different data modalities are measured with divergences between discrete probabilities, which are incorporated as a co-regularizer to augment individual PLSA models over each data modality. We derive efficient iterative learning algorithms for coPLSA with symmetric KL, L2 and L1 divergences as co-regularizers, in each case the essential optimization problem affords simple numerical solutions that entail only matrix arithmetic operations and numerical solution of 1D nonlinear equations. We evaluate the performance of the coPLSA algorithms on text/image cross-modal retrieval tasks, on which they show competitive performance with state-of-the-art methods.

Cluster validation constitutes one of the most challenging problems in unsupervised cluster analysis. For example, identifying the true number of clusters present in a dataset has been investigated for decades, and is still puzzling researchers today. The difficulty stems from the high variety of the dataset characteristics. Some datasets exhibit a strong structure with a few well-separated and normally distributed clusters, but most often real-world datasets contain possibly many overlapping non-gaussian clusters with heterogeneous variances and shapes. This calls for the design of robust clustering algorithms that could adapt to the structure of the data and in particular accurately guess the true number of clusters. They have recently been interesting attempts to design such algorithms, e.g. based on involved non-parametric statistical inference techniques. In this paper, we develop Viral Clustering (VC), a simple algorithm that jointly estimates the number of clusters and outputs clusters. The VC algorithm relies on two antagonist and interacting components. The first component tends to regroup neighbouring samples together, while the second component tends to spread samples in various clusters. This spreading component is performed using an analogy with the way virus spread over networks. We present extensive numerical experiments illustrating the robustness of the VC algorithm, and its superiority compared to existing algorithms.

We present a simple noise-robust margin-based active learn-ing algorithm to find homogeneous (passing the origin) linearseparators and analyze its error convergence when labels arecorrupted by noise. We show that when the imposed noisesatisfies the Tsybakov low noise condition (Mammen, Tsy-bakov, and others 1999; Tsybakov 2004) the algorithm is ableto adapt to unknown level of noise and achieves optimal sta-tistical rate up to polylogarithmic factors. We also derive lower bounds for margin based active learningalgorithms under Tsybakov noise conditions (TNC) for themembership query synthesis scenario (Angluin 1988). Ourresult implies lower bounds for the stream based selectivesampling scenario (Cohn 1990) under TNC for some fairlysimple data distributions. Quite surprisingly, we show that thesample complexity cannot be improved even if the underly-ing data distribution is as simple as the uniform distributionon the unit ball. Our proof involves the construction of a well-separated hypothesis set on the d-dimensional unit ball alongwith carefully designed label distributions for the Tsybakovnoise condition. Our analysis might provide insights for otherforms of lower bounds as well.

We address the problem of maximizing an unknown submodular function that can only be accessed via noisy evaluations. Our work is motivated by the task of summarizing content, e.g., image collections, by leveraging users' feedback in form of clicks or ratings. For summarization tasks with the goal of maximizing coverage and diversity, submodular set functions are a natural choice. When the underlying submodular function is unknown, users' feedback can provide noisy evaluations of the function that we seek to maximize. We provide a generic algorithm — ExpGreedy — for maximizing an unknown submodular function under cardinality constraints. This algorithm makes use of a novel exploration module— TopX — that proposes good elements based on adaptively sampling noisy function evaluations. TopX is able to accommodate different kinds of observation models such as value queries and pairwise comparisons. We provide PAC-style guarantees on the quality and sampling cost of the solution obtained by ExpGreedy. We demonstrate the effectiveness of our approach in an interactive, crowdsourced image collection summarization application.

We propose probabilistic models for predicting future classifiers given labeled data with timestamps collected until the current time. In some applications, the decision boundary changes over time. For example, in spam mail classification, spammers continuously create new spam mails to overcome spam filters, and therefore, the decision boundary that classifies spam or non-spam can vary. Existing methods require additional labeled and/or unlabeled data to learn a time-evolving decision boundary. However, collecting these data can be expensive or impossible. By incorporating time-series models to capture the dynamics of a decision boundary, the proposed model can predict future classifiers without additional data. We developed two learning algorithms for the proposed model on the basis of variational Bayesian inference. The effectiveness of the proposed method is demonstrated with experiments using synthetic and real-world data sets.