Total: 316

Many stochastic search algorithms are designed to optimize a fixed objective function to learn a task, i.e., if the objective function changes slightly, for example, due to a change in the situation or context of the task, relearning is required to adapt to the new context. For instance, if we want to learn a kicking movement for a soccer robot, we have to relearn the movement for different ball locations. Such relearning is undesired as it is highly inefficient and many applications require a fast adaptation to a new context/situation. Therefore, we investigate contextual stochastic search algorithms that can learn multiple, similar tasks simultaneously. Current contextual stochastic search methods are based on policy search algorithms and suffer from premature convergence and the need for parameter tuning. In this paper, we extend the well known CMA-ES algorithm to the contextual setting and illustrate its performance on several contextual tasks. Our new algorithm, called contextual CMA-ES, leverages from contextual learning while it preserves all the features of standard CMA-ES such as stability, avoidance of premature convergence, step size control and a minimal amount of parameter tuning.

Hashing is an important tool in large-scale machine learning. Unfortunately, current data-dependent hashing algorithms are not robust to small perturbations of the data points, which degrades the performance of nearest neighbor (NN) search. The culprit is the minimization of the L_2-norm, average distortion among pairs of points to find the hash function. Inspired by recent progress in robust optimization, we develop a novel hashing algorithm, dubbed RHash, that instead minimizes the L_1-norm, worst-case distortion among pairs of points. We develop practical and efficient implementations of RHash that couple the alternating direction method of multipliers (ADMM) framework with column generation to scale well to large datasets. A range of experimental evaluations demonstrate the superiority of RHash over ten state-of-the-art binary hashing schemes. In particular, we show that RHash achieves the same retrieval performance as the state-of-the-art algorithms in terms of average precision while using up to 60% fewer bits.

With the recent proliferation of human-oriented robotic applications in domestic and industrial scenarios, it is vital for robots to continually learn about their environments and about the humans they share their environments with. In this paper, we present a novel, online, incremental framework for unsupervised symbol grounding in real-world, human environments for autonomous robots. We demonstrate the flexibility of the framework by learning about colours, people names, usable objects and simple human activities, integrating state-of-the-art object segmentation, pose estimation, activity analysis along with a number of sensory input encodings into a continual learning framework. Natural language is grounded to the learned concepts, enabling the robot to communicate in a human-understandable way. We show, using a challenging real-world dataset of human activities as perceived by a mobile robot, that our framework is able to extract useful concepts, ground natural language descriptions to them, and, as a proof-of-concept, generate simple sentences from templates to describe people and the activities they are engaged in.

Many state-of-the-art reinforcement learning (RL) algorithms typically assume that the environment is an ergodic Markov Decision Process (MDP). In contrast, the field of universal reinforcement learning (URL) is concerned with algorithms that make as few assumptions as possible about the environment. The universal Bayesian agent AIXI and a family of related URL algorithms have been developed in this setting. While numerous theoretical optimality results have been proven for these agents, there has been no empirical investigation of their behavior to date. We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an open- source reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas.

A key problem in crowdsourcing is the aggregation of judgments of proportions. For example, workers might be presented with a news article or an image, and be asked to identify the proportion of each topic, sentiment, object, or colour present in it. These varying judgments then need to be aggregated to form a consensus view of the document’s or image’s contents. Often, however, these judgments are skewed by workers who provide judgments randomly. Such spammers make the cost of acquiring judgments more expensive and degrade the accuracy of the aggregation. For such cases, we provide a new Bayesian framework for aggregating these responses (expressed in the form of categorical distributions) that for the first time accounts for spammers. We elicit 796 judgments about proportions of objects and coloursin images. Experimental results show comparable aggregation accuracy when 60% of the workers are spammers, as other state of the art approaches do when there are no spammers.

In the context of hierarchical reinforcement learning, the idea of hierarchies of abstract machines (HAMs) is to write a partial policy as a set of hierarchical finite state machines with unspecified choice states, and use reinforcement learning to learn an optimal completion of this partial policy. Given a HAM with potentially deep hierarchical structure, there often exist many internal transitions where a machine calls another machine with the environment state unchanged. In this paper, we propose a new hierarchical reinforcement learning algorithm that discovers such internal transitions automatically, and shortcircuits them recursively in computation of Q values. The resulting HAMQ-INT algorithm outperforms the state of the art significantly on the benchmark Taxi domain and a much more complex RoboCup Keepaway domain.

Pattern mining is an important task in AI for eliciting hypotheses from the data. When it comes to spatial data, the geo-coordinates are often considered independently as two different attributes. Consequently, rectangular patterns are searched for. Such an arbitrary form is not able to capture interesting regions in general. We thus introduce convex polygons, a good trade-off for capturing high density areas in any pattern mining task. Our contribution is threefold: (i) We formally introduce such patterns in Formal Concept Analysis (FCA), (ii) we give all the basic bricks for mining polygons with exhaustive search and pattern sampling, and (iii) we design several algorithms that we compare experimentally.

We critically examine and point out weaknesses of the existing considerations in Boolean matrix factorization (BMF) regarding noise and the algorithms' ability to deal with noise. We argue that the current understanding is underdeveloped and that the current approaches are missing an important aspect. We provide a new, quantitative way to assess the ability of an algorithm to handle noise. Our approach is based on a common-sense definition of robustness requiring that the computed factorizations should not be affected much by varying the noise in data. We present an experimental evaluation of several existing algorithms and compare the results to the observations available in the literature. In addition to providing justification of some properties claimed in the literature without proper justification, our experiments reveal properties which were not reported as well as properties which counter certain claims made in the literature. Importantly, our approach reveals a line separating robust-to-noise from sensitive-to-noise algorithms, which has not been revealed by the previous approaches.

In our research, we consider transfer learning scenarios where a target learner does not have access to the source data, but instead to hypotheses or models induced from it. This is called the Hypothesis Transfer Learning (HTL) problem. Previous approaches concentrated on transferring source hypotheses as a whole. We introduce a novel method for selectively transferring elements from previous hypotheses learned with Support Vector Machines. The representation of an SVM hypothesis as a set of support vectors allows us to treat this information as privileged to aid learning during a new task. Given a possibly large number of source hypotheses, our approach selects the source support vectors that more closely resemble the target data, and transfers their learned coefficients as constraints on the coefficients to be learned. This strategy increases the importance of relevant target data points based on their similarity to source support vectors, while learning from the target data. Our method shows important improvements on the convergence rate on three classification datasets of varying sizes, decreasing the number of iterations by up to 56% on average compared to learning with no transfer and up to 92% compared to regular HTL, while maintaining similar accuracy levels.

Popularity of wearable cameras in life logging, law enforcement, assistive vision and other similar applications is leading to explosion in generation of egocentric video content. First person action recognition is an important aspect of automatic analysis of such videos. Annotating such videos is hard, not only because of obvious scalability constraints, but also because of privacy issues often associated with egocentric videos. This motivates the use of unsupervised methods for egocentric video analysis. In this work, we propose a robust and generic unsupervised approach for first person action clustering. Unlike the contemporary approaches, our technique is neither limited to any particular class of actions nor requires priors such as pre-training, fine-tuning, etc. We learn time sequenced visual and flow features from an array of weak feature extractors based on convolutional and LSTM autoencoder networks. We demonstrate that clustering of such features leads to the discovery of semantically meaningful actions present in the video. We validate our approach on four disparate public egocentric actions datasets amounting to approximately 50 hours of videos. We show that our approach surpasses the supervised state of the art accuracies without using the action labels.

We propose a general approach to modeling semi-supervised learning (SSL) algorithms. Specifically, we present a declarative language for modeling both traditional supervised classification tasks and many SSL heuristics, including both well-known heuristics such as co-training and novel domain-specific heuristics. In addition to representing individual SSL heuristics, we show that multiple heuristics can be automatically combined using Bayesian optimization methods. We experiment with two classes of tasks, link-based text classification and relation extraction. We show modest improvements on well-studied link-based classification benchmarks, and state-of-the-art results on relation-extraction tasks for two realistic domains.

Human decision makers in many domains can make use of predictions made by machine learning models in their decision making process, but the usability of these predictions is limited if the human is unable to justify his or her trust in the prediction. We propose a novel approach to producing justifications that is geared towards users without machine learning expertise, focusing on domain knowledge and on human reasoning, and utilizing natural language generation. Through a task-based experiment, we show that our approach significantly helps humans to correctly decide whether or not predictions are accurate, and significantly increases their satisfaction with the justification.

We consider a novel formulation of the multi-armed bandit model, which we call the contextual bandit with restricted context, where only a limited number of features can be accessed by the learner at every iteration. This novel formulation is motivated by different online problems arising in clinical trials, recommender systems and attention modeling.Herein, we adapt the standard multi-armed bandit algorithm known as Thompson Sampling to take advantage of our restricted context setting, and propose two novel algorithms, called the Thompson Sampling with Restricted Context (TSRC) and the Windows Thompson Sampling with Restricted Context (WTSRC), for handling stationary and nonstationary environments, respectively. Our empirical results demonstrate advantages of the proposed approaches on several real-life datasets.

Dealing with sparse, long-tailed datasets, and cold-start problems is always a challenge for recommender systems. These issues can partly be dealt with by making predictions not in isolation, but by leveraging information from related events; such information could include signals from social relationships or from the sequence of recent activities. Both types of additional information can be used to improve the performance of state-of-the-art matrix factorization-based techniques. In this paper, we propose new methods to combine both social and sequential information simultaneously, in order to further improve recommendation performance. We show these techniques to be particularly effective when dealing with sparsity and cold-start issues in several large, real-world datasets.

Recurrent neural networks (RNNs) have enjoyed great success in speech recognition, natural language processing, etc. Many variants of RNNs have been proposed, including vanilla RNNs, LSTMs, and GRUs. However, current architectures are not particularly adept at dealing with tasks involving multi-faceted contents. In this work, we solve this problem by proposing Multiple-Weight RNNs and LSTMs, which rely on multiple weight matrices in an attempt to mimic the human ability of switching between contexts. We present a framework for adapting RNN-based models and analyze the properties of this approach. Our detailed experimental results show that our model outperforms previous work across a range of different tasks and datasets.

Episodic memory enables a cognitive system to improve its performance by reflecting upon past events. In this paper, we propose a computational model called STEM for encoding and recall of episodic events together with the associated contextual information in real time. Based on a class of self-organizing neural networks, STEM is designed to learn memory chunks or cognitive nodes, each encoding a set of co-occurring multi-modal activity patterns across multiple pattern channels. We present algorithms for recall of events based on partial and inexact input patterns. Our empirical results based on a public domain data set show that STEM displays a high level of efficiency and robustness in encoding and retrieval with both partial and noisy search cues when compared with a state-of-the-art associative memory model.

Large-scale kernel approximation is an important problem in machine learning research. Approaches using random Fourier features have become increasingly popular \cite{Rahimi_NIPS_07}, where kernel approximation is treated as empirical mean estimation via Monte Carlo (MC) or Quasi-Monte Carlo (QMC) integration \cite{Yang_ICML_14}. A limitation of the current approaches is that all the features receive an equal weight summing to 1. In this paper, we propose a novel shrinkage estimator from "Stein effect", which provides a data-driven weighting strategy for random features and enjoys theoretical justifications in terms of lowering the empirical risk. We further present an efficient randomized algorithm for large-scale applications of the proposed method. Our empirical results on six benchmark data sets demonstrate the advantageous performance of this approach over representative baselines in both kernel approximation and supervised learning tasks.

Semantic Segmentation (SS) partitions an image into several coherent semantically meaningful parts, and classifies each part into one of the pre-determined classes. In this paper, we argue that existing SS methods cannot be reliably applied to autonomous driving system as they ignore the different importance levels of distinct classes for safe-driving. For example, pedestrians in the scene are much more important than sky when driving a car, so their segmentations should be as accurate as possible. To incorporate the importance information possessed by various object classes, this paper designs an "Importance-Aware Loss" (IAL) that specifically emphasizes the critical objects for autonomous driving. IAL operates under a hierarchical structure, and the classes with different importance are located in different levels so that they are assigned distinct weights. Furthermore, we derive the forward and backward propagation rules for IAL and apply them to deep neural networks for realizing SS in intelligent driving system. The experiments on CamVid and Cityscapes datasets reveal that by employing the proposed loss function, the existing deep learning models including FCN, SegNet and ENet are able to consistently obtain the improved segmentation results on the pre-defined important classes for safe-driving.

Many recent works have demonstrated the benefits of knowledge graph embeddings in completing monolingual knowledge graphs. Inasmuch as related knowledge bases are built in several different languages, achieving cross-lingual knowledge alignment will help people in constructing a coherent knowledge base, and assist machines in dealing with different expressions of entity relationships across diverse human languages. Unfortunately, achieving this highly desirable cross-lingual alignment by human labor is very costly and error-prone. Thus, we propose MTransE, a translation-based model for multilingual knowledge graph embeddings, to provide a simple and automated solution. By encoding entities and relations of each language in a separated embedding space, MTransE provides transitions for each embedding vector to its cross-lingual counterparts in other spaces, while preserving the functionalities of monolingual embeddings. We deploy three different techniques to represent cross-lingual transitions, namely axis calibration, translation vectors, and linear transformations, and derive five variants for MTransE using different loss functions. Our models can be trained on partially aligned graphs, where just a small portion of triples are aligned with their cross-lingual counterparts. The experiments on cross-lingual entity matching and triple-wise alignment verification show promising results, with some variants consistently outperforming others on different tasks. We also explore how MTransE preserves the key properties of its monolingual counterpart.

Many spectral clustering algorithms have been proposed and successfully applied to many high-dimensional applications. However, there are still two problems that need to be solved: 1) existing methods for obtaining the final clustering assignments may deviate from the true discrete solution, and 2) most of these methods usually have very high computational complexity. In this paper, we propose a Scalable Normalized Cut method for clustering of large scale data. In the new method, an efficient method is used to construct a small representation matrix and then clustering is performed on the representation matrix. In the clustering process, an improved spectral rotation method is proposed to obtain the solution of the final clustering assignments. A series of experimental were conducted on 14 benchmark data sets and the experimental results show the superior performance of the new method.

With the rapid increase of complex and high-dimensional sparse data, demands for new methods to select features by exploiting both labeled and unlabeled data have increased. Least regression based feature selection methods usually learn a projection matrix and evaluate the importances of features using the projection matrix, which is lack of theoretical explanation. Moreover, these methods cannot find both global and sparse solution of the projection matrix. In this paper, we propose a novel semi-supervised feature selection method which can learn both global and sparse solution of the projection matrix. The new method extends the least square regression model by rescaling the regression coefficients in the least square regression with a set of scale factors, which are used for ranking the features. It has shown that the new model can learn global and sparse solution. Moreover, the introduction of scale factors provides a theoretical explanation for why we can use the projection matrix to rank the features. A simple yet effective algorithm with proved convergence is proposed to optimize the new model. Experimental results on eight real-life data sets show the superiority of the method.

Learning rich and diverse representations is critical for the performance of deep convolutional neural networks (CNNs). In this paper, we consider how to use privileged information to promote inherent diversity of a single CNN model such that the model can learn better representations and offer stronger generalization ability. To this end, we propose a novel group orthogonal convolutional neural network (GoCNN) that learns untangled representations within each layer by exploiting provided privileged information and enhances representation diversity effectively. We take image classification as an example where image segmentation annotations are used as privileged information during the training process. Experiments on two benchmark datasets – ImageNet and PASCAL VOC – clearly demonstrate the strong generalization ability of our proposed GoCNN model. On the ImageNet dataset, GoCNN improves the performance of state-of-the-art ResNet-152 model by absolute value of 1.2% while only uses privileged information of 10% of the training images, confirming effectiveness of GoCNN on utilizing available privileged knowledge to train better CNNs.

The Frank-Wolfe (FW) algorithm has been widely used in solving nuclear norm constrained problems, since it does not require projections. However, FW often yields high rank intermediate iterates, which can be very expensive in time and space costs for large problems. To address this issue, we propose a rank-drop method for nuclear norm constrained problems. The goal is to generate descent steps that lead to rank decreases, maintaining low-rank solutions throughout the algorithm. Moreover, the optimization problems are constrained to ensure that the rank-drop step is also feasible and can be readily incorporated into a projection-free minimization method, e.g., Frank-Wolfe. We demonstrate that by incorporating rank-drop steps into the Frank-Wolfe algorithm, the rank of the solution is greatly reduced compared to the original Frank-Wolfe or its common variants.

Detecting buffer overruns from a source code is one of the most common and yet challenging tasks in program analysis. Current approaches based on rigid rules and handcrafted features are limited in terms of flexible applicability and robustness due to diverse bug patterns and characteristics existing in sophisticated real-world software programs. In this paper, we propose a novel, data-driven approach that is completely end-to-end without requiring any hand-crafted features, thus free from any program language-specific structural limitations. In particular, our approach leverages a recently proposed neural network model called memory networks that have shown the state-of-the-art performances mainly in question-answering tasks. Our experimental results using source code samples demonstrate that our proposed model is capable of accurately detecting different types of buffer overruns. We also present in-depth analyses on how a memory network can learn to understand the semantics in programming languages solely from raw source codes, such as tracing variables of interest, identifying numerical values, and performing their quantitative comparisons.

In many applications, one can define a large set of features to support the classification task at hand. At test time, however, these become prohibitively expensive to evaluate, and only a small subset of features is used, often selected for their information-theoretic value. For threshold-based, Naive Bayes classifiers, recent work has suggested selecting features that maximize the expected robustness of the classifier, that is, the expected probability it maintains its decision after seeing more features. We propose the first algorithm to compute this expected same-decision probability for general Bayesian network classifiers, based on compiling the network into a tractable circuit representation. Moreover, we develop a search algorithm for optimal feature selection that utilizes efficient incremental circuit modifications. Experiments on Naive Bayes, as well as more general networks, show the efficacy and distinct behavior of this decision-making approach.