ECCV.2018 - Accept

Total: 773

#1 Modeling Varying Camera-IMU Time Offset in Optimization-Based Visual-Inertial Odometry [PDF] [Copy] [Kimi]

Authors: Yonggen Ling ; Linchao Bao ; Zequn Jie ; Fengming Zhu ; Ziyang Li ; Shanmin Tang ; Yongsheng Liu ; Wei Liu ; Tong Zhang

Combining cameras and inertial measurement units (IMUs) has been proven effective in motion tracking, as these two sensing modalities offer complementary characteristics that are suitable for fusion. While most works focus on global-shutter cameras and synchronized sensor measurements, consumer-grade devices are mostly equipped with rolling-shutter cameras and suffer from imperfect sensor synchronization. In this work, we propose a nonlinear optimization-based monocular visual inertial odometry (VIO) with varying camera-IMU time offset modeled as an unknown variable. Our approach is able to handle the rolling-shutter effects and imperfect sensor synchronization in a unified way. Additionally, we introduce an efficient algorithm based on dynamic programming and red-black tree to speed up IMU integration over variable-length time intervals during the optimization. An uncertainty-aware initialization is also presented to launch the VIO robustly. Comparisons with state-of-the-art methods on the Euroc dataset and mobile phone data are shown to validate the effectiveness of our approach.

#2 Pose Partition Networks for Multi-Person Pose Estimation [PDF] [Copy] [Kimi]

Authors: Xuecheng Nie ; Jiashi Feng ; Junliang Xing ; Shuicheng Yan

This paper proposes a novel Pose Partition Network (PPN) to address the challenging multi-person pose estimation problem. The proposed PPN is favorably featured by low complexity and high accuracy of joint detection and partition. In particular, PPN performs dense regressions from global joint candidates within a specific embedding space, which is parameterized by centroids of persons, to efficiently generate robust person detection and joint partitions. Then, PPN infers joint configurations of person poses through conducting graph partition for each person detection locally, utilizing reliable global affinity cues. In this way, PPN reduces computation complexity and improves multi-person pose estimation significantly. We implement PPN with the Hourglass architecture as the backbone network to simultaneously learn joint detector and dense regressor. Extensive experiments on benchmarks MPII Human Pose Multi-Person, extended PASCAL-Person-Part, and WAF, show the efficiency of PPN with new state-of-the-art performance.

#3 Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition [PDF] [Copy] [Kimi]

Authors: Xiaohang Zhan ; Ziwei Liu ; Junjie Yan ; Dahua Lin ; Chen Change Loy

Face recognition has witnessed great progresses in recent years, mainly attributed to the high-capacity model designed and the abundant labeled data collected. However, it becomes more and more prohibitive to scale up the current million-level identity annotations. In this work, we show that unlabeled face data can be as effective as the labeled ones. Here, we consider a setting closely mimicking the real-world scenario, where the unlabeled data are collected from unconstrained environment and their identities are exclusive from the labeled ones. Our main insight is that although the class information is not available, we can still faithfully approximate these semantic relationship by constructing a relational graph in a bottom-up manner. We propose Consensus-Driven Propagation (CDP) to tackle this challenging problem with two well-designed modules, the "committee" and the "mediator", which select positive face pairs robustly by carefully aggregating multi-view information. Extensive experiments validate the effectiveness of both modules to discard outliers and mine hard positives. With CDP, we achieve a compelling 78.18% on MegaFace identification challenge by using only 9% of the labels, comparing to 61.78% when no unlabeled data are used and 78.52% when all the labels are employed.

#4 Open-World Stereo Video Matching with Deep RNN [PDF] [Copy] [Kimi]

Authors: Yiran Zhong ; Hongdong Li ; Yuchao Dai

In this paper, we propose a novel deep Recurrent Neural network (RNN) that takes a continuous (possibly previously unseen) stereo video as input, and directly predict a depth-map without of any pre-training process. The quality and accuracy of the obtained depth-map improves over time as new stereo frames being fed in. Thanks to the recurrent nature (based on two convolutional LSTM blocks) the network is able to memorize and learn from its past experience and gradually adapts its parameters (interconnection weights) to achieve better stereo matching result on the current stereo input. In this sense, our new method is an unsupervised network which does not rely on any labeled ground-truth depthmap, and it is able to work in a previously unseen or unfamiliar environments, suggesting a remarkable generaliability: it is applicable in an {em open-world} setting, by adapting its network parameters to generic stereo video inputs to be robust to changes in scene content, statistics, and lighting and season etc. Through extensive experiments, we demonstrate the method is able to seamlessly adapt between different scenarios. Also importantly, in terms of absolute stereo matching performance, it even outperforms the state of the art stereo algorithms on several standard benchmark datasets such as KITTI and Middlebury stereo.

#5 Deep Cross-Modal Projection Learning for Image-Text Matching [PDF] [Copy] [Kimi]

Authors: Ying Zhang ; Huchuan Lu

The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs. Despite the great progress of associating the deep cross-modal embeddings with the bi-directional ranking loss, developing the strategies for mining useful triplets and selecting appropriate margins remains a challenge in real applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss for learning discriminative image-text embeddings. The CMPM loss minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed approach.

#6 Gray-box Adversarial Training [PDF] [Copy] [Kimi]

Authors: B. S. Vivek ; Konda Reddy Mopuri ; R. Venkatesh Babu

Adversarial samples are perturbed inputs crafted to mislead the machine learning systems. A training mechanism, called adversarial training, which presents adversarial samples along with clean samples has been introduced to learn robust models. In order to scale adversarial training for large datasets, these perturbations can only be crafted using fast and simple methods (e.g., using gradient ascent). However, it is shown that adversarial training converges to a degenerate minimum, where the model appears to be robust by generating weaker adversaries. As a result, the models are vulnerable to simple black-box attacks. In this paper we, (i) demonstrate the drawbacks of existing evaluation policy, (ii) Introduce novel variants of white-box and black-box attacks, dubbed "gray-box adversarial attacks" based on which we propose novel evaluation method to assess the robustness of the learned models, and (iii) propose a novel variant of adversarial training, named "Graybox Adversarial Training" that uses intermediate versions of the models to seed the adversaries. Experimental evaluation demonstrates that the models trained using our method exhibit better robustness compared to both undefended and adversarially trained models.

#7 Multi-Class Model Fitting by Energy Minimization and Mode-Seeking [PDF] [Copy] [Kimi]

Authors: Daniel Barath ; Jiri Matas

We propose a general formulation, called Multi-X, for multi-class multi-instance model fitting - the problem of interpreting the input data as a mixture of noisy observations originating from multiple instances of multiple classes. We extend the commonly used alpha-expansion-based technique with a new move in the label space. The move replaces a set of labels with the corresponding density mode in the model parameter domain, thus achieving fast and robust optimization. Key optimization parameters like the bandwidth of the mode seeking are set automatically within the algorithm. Considering that a group of outliers may form spatially coherent structures in the data, we propose a cross-validation-based technique removing statistically insignificant instances. Multi-X outperforms significantly the state-of-the-art on publicly available datasets for diverse problems: multiple plane and rigid motion detection; motion segmentation; simultaneous plane and cylinder fitting; circle and line fitting.

#8 MRF Optimization with Separable Convex Prior on Partially Ordered Labels [PDF] [Copy] [Kimi]

Authors: Csaba Domokos ; Frank R. Schmidt ; Daniel Cremers

Solving a multi-labeling problem with a convex penalty can be achieved in polynomial time if the label set is totally ordered. In this paper we propose a generalization to partially ordered sets. To this end, we assume that the label set is the Cartesian product of totally ordered sets and the convex prior is separable. For this setting we introduce a general combinatorial optimization framework that provides an approximate solution. More specifically, we first construct a graph whose minimal cut provides a lower bound to our energy. The result of this relaxation is then used to get a feasible solution via classical move-making cuts. To speed up the optimization, we propose an efficient coarse-to-fine approach over the label space. We demonstrate the proposed framework through extensive experiments for optical flow estimation.

#9 VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions [PDF] [Copy] [Kimi]

Authors: Qing Li ; Qingyi Tao ; Shafiq Joty ; Jianfei Cai ; Jiebo Luo

Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

#10 Context Refinement for Object Detection [PDF] [Copy] [Kimi]

Authors: Zhe Chen ; Shaoli Huang ; Dacheng Tao

Current two-stage object detectors, which consists of a region proposal stage and a refinement stage, may produce unreliable results due to ill-localized proposed regions. To address this problem, we propose a context refinement algorithm that explores rich contextual information to better refine each proposed region. In particular, we first identify neighboring regions that may contain useful contexts and then perform refinement based on the extracted and unified contextual information. In practice, our method effectively improves the quality of the final detection results as well as region proposals. Empirical studies show that context refinement yields substantial and consistent improvements over different baseline detectors. Moreover, the proposed algorithm brings around 3% performance gain on PASCAL VOC benchmark and around 6% gain on MS COCO benchmark respectively.

#11 Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network [PDF] [Copy] [Kimi]

Authors: Xinjing Cheng ; Peng Wang ; Ruigang Yang

Depth estimation from a single image is a fundamental problem in computer vision. In this paper, we propose a simple yet effective convolutional spatial propagation network (CSPN) to learn the affinity matrix for depth prediction. Specifically, we adopt an efficient linear propagation model, where the propagation is performed with a manner of recurrent convolutional operation, and the affinity among neighboring pixels is learned through a deep convolutional neural network (CNN). We apply the designed CSPN to two depth estimation tasks given a single image: (1) To refine the depth output from state-of-the-art (SOTA) existing methods; and (2) to convert sparse depth samples to a dense depth map by embedding the depth samples within the propagation procedure. The second task is inspired by the availability of LIDARs that provides sparse but accurate depth measurements. We experimented the proposed CSPN over two popular benchmarks for depth estimation, ie NYU v2 and KITTI, where we show that our proposed approach improves in not only quality (e.g., 30% more reduction in depth error), but also speed (e.g., 2 to 5$ imes$ faster) than prior SOTA methods.

#12 Zero-Annotation Object Detection with Web Knowledge Transfer [PDF] [Copy] [Kimi]

Authors: Qingyi Tao ; Hao Yang ; Jianfei Cai

Object detection is one of the major problems in computer vision, and has been extensively studied. Most of the existing detection works rely on labor-intensive supervision, such as ground truth bounding boxes of objects or at least image-level annotations. On the contrary, we propose an object detection method that does not require any form of human annotation on target tasks, by exploiting freely available web images. In order to facilitate effective knowledge transfer from web images, we introduce a multi-instance multi-label domain adaption learning framework with two key innovations. First of all, we propose an instance-level adversarial domain adaptation network with attention on foreground objects to transfer the object appearances from web domain to target domain. Second, to preserve the class-specific semantic structure of transferred object features, we propose a simultaneous transfer mechanism to transfer the supervision across domains through pseudo strong label generation. With our end-to-end framework that simultaneously learns a weakly supervised detector and transfers knowledge across domains, we achieved significant improvements over baseline methods on the benchmark datasets.

#13 Fast Light Field Reconstruction With Deep Coarse-To-Fine Modeling of Spatial-Angular Clues [PDF] [Copy] [Kimi]

Authors: Henry Wing Fung Yeung ; Junhui Hou ; Jie Chen ; Yuk Ying Chung ; Xiaoming Chen

Densely-sampled light fields (LFs) are beneficial to many applications such as depth inference and post-capture refocusing. However, it is costly and challenging to capture them. In this paper, we propose a learning based algorithm to reconstruct a densely-sampled LF fast and accurately from a sparsely-sampled LF in one forward pass. Our method uses computationally efficient convolutions to deeply characterize the high dimensional spatial-angular clues in a coarse-tofine manner. Specifically, our end-to-end model first synthesizes a set of intermediate novel sub-aperture images (SAIs) by exploring the coarse characteristics of the sparsely-sampled LF input with spatial-angular alternating convolutions. Then, the synthesized intermediate novel SAIs are efficiently refined by further recovering the fine relations from all SAIs via guided residual learning and stride-2 4-D convolutions. Experimental results on extensive real-world and synthetic LF images show that our model can provide more than 3 dB advantage in reconstruction quality in average than the state-of-the-art methods while being computationally faster by a factor of 30. Besides, more accurate depth can be inferred from the reconstructed densely-sampled LFs by our method.

#14 AGIL: Learning Attention from Human for Visuomotor Tasks [PDF] [Copy] [Kimi]

Authors: Ruohan Zhang ; Zhuode Liu ; Luxin Zhang ; Jake A. Whritner ; Karl S. Muller ; Mary M. Hayhoe ; Dana H. Ballard

When intelligent agents learn visuomotor behaviors from human demonstrations, they may benefit from knowing where the human is allocating visual attention, which can be inferred from their gaze. A wealth of information regarding intelligent decision making is conveyed by human gaze allocation; hence, exploiting such information has the potential to improve the agents' performance. With this motivation, we propose the AGIL (Attention Guided Imitation Learning) framework. We collect high-quality human action and gaze data while playing Atari games in a carefully controlled experimental setting. Using these data, we first train a deep neural network that can predict human gaze positions and visual attention with high accuracy (the gaze network) and then train another network to predict human actions (the policy network). Incorporating the learned attention model from the gaze network into the policy network significantly improves the action prediction accuracy and task performance.

#15 Physical Primitive Decomposition [PDF1] [Copy] [Kimi]

Authors: Zhijian Liu ; William T. Freeman ; Joshua B. Tenenbaum ; Jiajun Wu

Objects are made of parts, each with distinct geometry, physics, functionality, and affordances. Developing such a distributed, physical, interpretable representation of objects will facilitate intelligent agents to better explore and interact with the world. In this paper, we study physical primitive decomposition---understanding an object through its components, each with physical and geometric attributes. As annotated data for object parts and physics are rare, we propose a novel formulation that learns physical primitives by explaining both an object's appearance and its behaviors in physical events. Our model performs well on block towers and tools in both synthetic and real scenarios; we also demonstrate that visual and physical observations often provide complementary signals. We further present ablation and behavioral studies to better understand our model and contrast it with human performance.

#16 Deep Expander Networks: Efficient Deep Networks from Graph Theory [PDF] [Copy] [Kimi]

Authors: Ameya Prabhu ; Girish Varma ; Anoop Namboodiri

Efficient CNN designs like ResNets and DenseNet were proposed to improve accuracy vs efficiency trade-offs. They essentially increased the connectivity, allowing efficient information flow across layers. Inspired by these techniques, we propose to model connections between filters of a CNN using graphs which are simultaneously sparse and well connected. Sparsity results in efficiency while well connectedness can preserve the expressive power of the CNNs. We use a well-studied class of graphs from theoretical computer science that satisfies these properties known as Expander graphs. Expander graphs are used to model connections between filters in CNNs to design networks called X-Nets. We present two guarantees on the connectivity of X-Nets: Each node influences every node in a layer in logarithmic steps, and the number of paths between two sets of nodes is proportional to the product of their sizes. We also propose efficient training and inference algorithms, making it possible to train deeper and wider X-Nets effectively. Expander based models give a 4% improvement in accuracy on MobileNet over grouped convolutions, a popular technique, which has the same sparsity but worse connectivity. X-Nets give better performance trade-offs than the original ResNet and DenseNet-BC architectures. We achieve model sizes comparable to state-of-the-art pruning techniques using our simple architecture design, without any pruning. We hope that this work motivates other approaches to utilize results from graph theory to develop efficient network architectures.

#17 Real-Time MDNet [PDF] [Copy] [Kimi]

Authors: Ilchae Jung ; Jeany Son ; Mooyeol Baek ; Bohyung Han

We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation. We also introduce a novel loss term to differentiate foreground instances across multiple domains and learn a more discriminative embedding of target objects with similar semantics. The proposed techniques are integrated into the pipeline of a well known CNN-based visual tracking algorithm, MDNet. We accomplish approximately 25 times speed-up with almost identical accuracy compared to MDNet. Our algorithm is evaluated in multiple popular tracking benchmark datasets including OTB2015, UAV123, and TempleColor, and outperforms the state-of-the-art real-time tracking methods consistently even without dataset-specific parameter tuning.

#18 The Mutex Watershed: Efficient, Parameter-Free Image Partitioning [PDF] [Copy] [Kimi]

Authors: Steffen Wolf ; Constantin Pape ; Alberto Bailoni ; Nasim Rahaman ; Anna Kreshuk ; Ullrich Kothe ; FredA. Hamprecht

Image partitioning, or segmentation without semantics, is the task of decomposing an image into distinct segments; or equivalently, the task of detecting closed contours in an image. Most prior work either requires seeds, one per segment; or a threshold; or formulates the task as an NP-hard signed graph partitioning problem. Here, we propose an algorithm with empirically linearithmic complexity. Unlike seeded watershed, the algorithm can accommodate not only attractive but also repulsive cues, allowing it to find a previously unspecified number of segments without the need for explicit seeds or a tunable threshold. The algorithm itself, which we dub “mutex watershed”, is closely related to a minimal spanning tree computation. It is deterministic and easy to implement. When presented with short-range attractive and long-range repulsive cues from a deep neural network, the mutex watershed gives results that currently define the state-of-the-art in the competitive ISBI 2012 EM segmentation benchmark. These results are also better than those obtained from other recently proposed clustering strategies operating on the very same network outputs.

#19 MVSNet: Depth Inference for Unstructured Multi-view Stereo [PDF] [Copy] [Kimi]

Authors: Yao Yao ; Zixin Luo ; Shiwei Li ; Tian Fang ; Long Quan

We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed method is demonstrated on the large-scale DTU dataset. With simple post-processing, MVSNet not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. In the end, we also show the generalization power of MVSNet on the complex outdoor Tanks and Temples dataset, which has not been used to train the network.

#20 Audio-Visual Event Localization in Unconstrained Videos [PDF] [Copy] [Kimi]

Authors: Yapeng Tian ; Jing Shi ; Bochen Li ; Zhiyao Duan ; Chenliang Xu

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

#21 Attend and Rectify: a gated attention mechanism for fine-grained recovery [PDF] [Copy] [Kimi]

Authors: Pau Rodriguez ; Josep M. Gonfaus ; Guillem Cucurull ; F. XavierRoca ; Jordi Gonzalez

We propose a novel attention mechanism to enhance Convolutional Neural Networks for fine-grained recognition. It learns to attend to lower-level feature activations without requiring part annotations and uses these activations to update and rectify the output likelihood distribution. In contrast to other approaches, the proposed mechanism is modular, architecture-independent and efficient both in terms of parameters and computation required. Experiments show that networks augmented with our approach systematically improve their classification accuracy and become more robust to clutter. As a result, Wide Residual Networks augmented with our proposal surpasses the state of the art classification accuracies in CIFAR-10, the Adience gender recognition task, Stanford dogs, and UEC Food-100.

#22 PyramidBox: A Context-assisted Single Shot Face Detector [PDF] [Copy] [Kimi]

Authors: Xu Tang ; Daniel K. Du ; Zeqiang He ; Jingtuo Liu

Face detection has been well studied for many years and one of remaining challenges is to detect small, blurred and partially occluded faces in uncontrolled environment. This paper proposes a novel context-assisted single shot face detector, named emph{PyramidBox} to handle the hard face detection problem. Observing the importance of the context, we improve the utilization of contextual information in the following three aspects. First, we design a novel context anchor to supervise high-level contextual feature learning by a semi-supervised method, which we call it PyramidAnchors. Second, we propose the Low-level Feature Pyramid Network to combine adequate high-level context semantic feature and Low-level facial feature together, which also allows the PyramidBox to predict faces of all scales in a single shot. Third, we introduce a context-sensitive structure to increase the capacity of prediction network to improve the final accuracy of output. In addition, we use the method of Data-anchor-sampling to augment the training samples across different scales, which increases the diversity of training data for smaller faces. By exploiting the value of context, PyramidBox achieves superior performance among the state-of-the-art over the two common face detection benchmarks, FDDB and WIDER FACE. Our code is available in PaddlePaddle: href{}{url{}}.

#23 RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments [PDF] [Copy] [Kimi]

Authors: Tobias Fischer ; Hyung Jin Chang ; Yiannis Demiris

In this work, we consider the problem of robust gaze estimation in natural environments. Large camera-to-subject distances and high variations in head pose and eye gaze angles are common in such environments. This leads to two main shortfalls in state-of-the-art methods for gaze estimation: hindered ground truth gaze annotation and diminished gaze estimation accuracy as image resolution decreases with distance. We first record a novel dataset of varied gaze and head pose images in a natural environment, addressing the issue of ground truth annotation by measuring head pose using a motion capture system and eye gaze using mobile eyetracking glasses. We apply semantic image inpainting to the area covered by the glasses to bridge the gap between training and testing images by removing the obtrusiveness of the glasses. We also present a new real-time algorithm involving appearance-based deep convolutional neural networks with increased capacity to cope with the diverse images in the new dataset. Experiments with this network architecture are conducted on a number of diverse eye-gaze datasets including our own, and in cross dataset evaluations. We demonstrate state-of-the-art performance in terms of estimation accuracy in all experiments, and the architecture performs well even on lower resolution images.

#24 Contemplating Visual Emotions: Understanding and Overcoming Dataset Bias [PDF] [Copy] [Kimi]

Authors: Rameswar Panda ; Jianming Zhang ; Haoxiang Li ; Joon-Young Lee ; Xin Lu ; Amit K. Roy-Chowdhury

While machine learning approaches to visual emotion recognition offer great promise, current methods consider training and testing models on small scale datasets covering limited visual emotion concepts. Our analysis identifies an important but long overlooked issue of existing visual emotion benchmarks in the form of dataset biases. We design a series of tests to show and measure how such dataset biases obstruct learning a generalizable emotion recognition model. Based on our analysis, we propose a webly supervised approach by leveraging a large quantity of stock image data. Our approach uses a simple yet effective curriculum guided training strategy for learning discriminative emotion features. We discover that the models learned using our large scale stock image dataset exhibit significantly better generalization ability than the existing datasets without the manual collection of even a single label. Moreover, visual representation learned using our approach holds a lot of promise across a variety of tasks on different image and video datasets.

#25 Highly-Economized Multi-View Binary Compression for Scalable Image Clustering [PDF] [Copy] [Kimi]

Authors: Zheng Zhang ; Li Liu ; Jie Qin ; Fan Zhu ; Fumin Shen ; Yong Xu ; Ling Shao ; Heng Tao Shen

How to economically cluster large-scale multi-view images is a long-standing problem in computer vision. To tackle this challenge, this paper introduces a novel approach named Highly-economized Scalable Image Clustering (HSIC) that radically surpasses conventional image clustering methods via binary compression. We intuitively unify the binary representation learning and efficient binary cluster structure learning into a joint framework. In particular, common binary representations are learned by exploiting both sharable and individual information across multiple views in order to capture their underlying correlations. Meanwhile, cluster assignment with robust binary centroids is also performed via effective discrete optimization under L21-norm constraint. By this means, heavy continuous-valued Euclidean distance computations can be successfully reduced by efficient binary XOR operations during the clustering procedure. To our best knowledge, HSIC is the first binary clustering work specifically designed for scalable multi-view image clustering. Extensive experimental results on four large-scale image datasets show that HSIC consistently outperforms the state-of-the-art approaches, whilst significantly reducing computational time and memory footprint.