IJCAI.2018 - Computer Vision

| Total: 90

#1 MEnet: A Metric Expression Network for Salient Object Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Shulian Cai ; Jiabin Huang ; Delu Zeng ; Xinghao Ding ; John Paisley

Recent CNN-based saliency models have achieved excellent performance on public datasets, but most are sensitive to distortions from noise or compression. In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to overcome this drawback. We construct a topological metric space where the implicit metric is determined by a deep network. In this latent space, we can group pixels within an observed image semantically into two regions, based on whether they are in a salient region or a non-salient region in the image. We carry out all feature extractions at the pixel level, which makes the output boundaries of the salient object finely-grained. Experimental results show that the proposed metric can generate robust salient maps that allow for object segmentation. By testing the method on several public benchmarks, we show that the performance of MEnet achieves excellent results. We also demonstrate that the proposed method outperforms previous CNN-based methods on distorted images.

#2 Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Hui Chen ; Guiguang Ding ; Zijia Lin ; Sicheng Zhao ; Jungong Han

Despite the fact that attribute-based approaches and attention-based approaches have been proven to be effective in image captioning, most attribute-based approaches simply predict attributes independently without taking the co-occurrence dependencies among attributes into account. Besides, most attention-based captioning models directly leverage the feature map extracted from CNN, in which many features may be redundant in relation to the image content. In this paper, we focus on training a good attribute-inference model via the recurrent neural network (RNN) for image captioning, where the co-occurrence dependencies among attributes can be maintained. The uniqueness of our inference model lies in the usage of a RNN with the visual attention mechanism to \textit{observe} the image before generating captions. Additionally, it is noticed that compact and attribute-driven features will be more useful for the attention-based captioning model. To this end, we extract the context feature for each attribute, and guide the captioning model adaptively attend to these context features. We verify the effectiveness and superiority of the proposed approach over the other captioning approaches by conducting massive experiments and comparisons on MS COCO image captioning dataset.

#3 Learning Deep Unsupervised Binary Codes for Image Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Junjie Chen ; William K. Cheung ; Anran Wang

Hashing is an efficient approximate nearest neighbor search method and has been widely adopted for large-scale multimedia retrieval. While supervised learning is more popular for the data-dependent hashing, deep unsupervised hashing methods have recently been developed to learn non-linear transformations for converting multimedia inputs to binary codes. Most of existing deep unsupervised hashing methods make use of a quadratic constraint for minimizing the difference between the compact representations and the target binary codes, which inevitably causes severe information loss. In this paper, we propose a novel deep unsupervised method called DeepQuan for hashing. The DeepQuan model utilizes a deep autoencoder network, where the encoder is used to learn compact representations and the decoder is for manifold preservation. To contrast with the existing unsupervised methods, DeepQuan learns the binary codes by minimizing the quantization error through product quantization technique. Furthermore, a weighted triplet loss is proposed to avoid trivial solution and poor generalization. Extensive experimental results on standard datasets show that the proposed DeepQuan model outperforms the state-of-the-art unsupervised hashing methods for image retrieval tasks.

#4 Deep View-Aware Metric Learning for Person Re-Identification [PDF] [Copy] [Kimi] [REL]

Authors: Pu Chen ; Xinyi Xu ; Cheng Deng

Person re-identification remains a challenging issue due to the dramatic changes in visual appearance caused by the variations in camera views, human pose, and background clutter. In this paper, we propose a deep view-aware metric learning (DVAML) model, where image pairs with similar and dissimilar views are projected into different feature subspaces, which can discover the intrinsic relevance between image pairs from different aspects. Additionally, we employ multiple metrics to jointly learn feature subspaces on which the relevance between image pairs are explicitly captured and thus greatly promoting the retrieval accuracy. Extensive experiment results on datasets CUHK01, CUHK03, and PRID2011 demonstrate the superiority of our method compared with state-of-the-art approaches.

#5 Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Tianshui Chen ; Liang Lin ; Riquan Chen ; Yang Wu ; Xiaonan Luo

Humans can naturally understand an image in depth with the aid of rich knowledge accumulated from daily lives or professions. For example, to achieve fine-grained image recognition (e.g., categorizing hundreds of subordinate categories of birds) usually requires a comprehensive visual concept organization including category labels and part-level attributes. In this work, we investigate how to unify rich professional knowledge with deep neural network architectures and propose a Knowledge-Embedded Representation Learning (KERL) framework for handling the problem of fine-grained image recognition. Specifically, we organize the rich visual concepts in the form of knowledge graph and employ a Gated Graph Neural Network to propagate node message through the graph for generating the knowledge representation. By introducing a novel gated mechanism, our KERL framework incorporates this knowledge representation into the discriminative image feature learning, i.e., implicitly associating the specific attributes with the feature maps. Compared with existing methods of fine-grained image classification, our KERL framework has several appealing properties: i) The embedded high-level knowledge enhances the feature representation, thus facilitating distinguishing the subtle differences among subordinate categories. ii) Our framework can learn feature maps with a meaningful configuration that the highlighted regions finely accord with the nodes (specific attributes) of the knowledge graph. Extensive experiments on the widely used Caltech-UCSD bird dataset demonstrate the superiority of our KERL framework over existing state-of-the-art methods.

#6 Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Yunpeng Chen ; Xiaojie Jin ; Bingyi Kang ; Jiashi Feng ; Shuicheng Yan

The residual unit and its variations are wildly used in building very deep neural networks for alleviating optimization difficulty. In this work, we revisit the standard residual function as well as its several successful variants and propose a unified framework based on tensor Block Term Decomposition (BTD) to explain these apparently different residual functions from the tensor decomposition view. With the BTD framework, we further propose a novel basic network architecture, named the Collective Residual Unit (CRU). CRU further enhances parameter efficiency of deep residual neural networks by sharing core factors derived from collective tensor factorization over the involved residual units. It enables efficient knowledge sharing across multiple residual units, reduces the number of model parameters, lowers the risk of over-fitting, and provides better generalization ability. Extensive experimental results show that our proposed CRU network brings outstanding parameter efficiency -- it achieves comparable classification performance with ResNet-200 while using a model size as small as ResNet-50 on the ImageNet-1k and Places365-Standard benchmark datasets.

#7 Scanpath Prediction for Visual Attention using IOR-ROI LSTM [PDF] [Copy] [Kimi] [REL]

Authors: Zhenzhong Chen ; Wanjie Sun

Predicting scanpath when a certain stimulus is presented plays an important role in modeling visual attention and search. This paper presents a model that integrates convolutional neural network and long short-term memory (LSTM) to generate realistic scanpaths. The core part of the proposed model is a dual LSTM unit, i.e., an inhibition of return LSTM (IOR-LSTM) and a region of interest LSTM (ROI-LSTM), capturing IOR dynamics and gaze shift behavior simultaneously. IOR-LSTM simulates the visual working memory to adaptively integrate and forget scene information. ROI-LSTM is responsible for predicting the next ROI given the inhibited image features. Experimental results indicate that the proposed architecture can achieve superior performance in predicting scanpaths.

#8 Multi-scale and Discriminative Part Detectors Based Features for Multi-label Image Classification [PDF] [Copy] [Kimi] [REL]

Authors: Gong Cheng ; Decheng Gao ; Yang Liu ; Junwei Han

Convolutional neural networks (CNNs) have shown their promise for image classification task. However, global CNN features still lack geometric invariance for addressing the problem of intra-class variations and so are not optimal for multi-label image classification. This paper proposes a new and effective framework built upon CNNs to learn Multi-scale and Discriminative Part Detectors (MsDPD)-based feature representations for multi-label image classification. Specifically, at each scale level, we (i) first present an entropy-rank based scheme to generate and select a set of discriminative part detectors (DPD), and then (ii) obtain a number of DPD-based convolutional feature maps with each feature map representing the occurrence probability of a particular part detector and learn DPD-based features by using a task-driven pooling scheme. The two steps are formulated into a unified framework by developing a new objective function, which jointly trains part detectors incrementally and integrates the learning of feature representations into the classification task. Finally, the multi-scale features are fused to produce the predictions. Experimental results on PASCAL VOC 2007 and VOC 2012 datasets demonstrate that the proposed method achieves better accuracy when compared with the existing state-of-the-art multi-label classification methods.

#9 Anonymizing k Facial Attributes via Adversarial Perturbations [PDF1] [Copy] [Kimi1] [REL]

Authors: Saheb Chhabra ; Richa Singh ; Mayank Vatsa ; Gaurav Gupta

A face image not only provides details about the identity of a subject but also reveals several attributes such as gender, race, sexual orientation, and age. Advancements in machine learning algorithms and popularity of sharing images on the World Wide Web, including social media websites, have increased the scope of data analytics and information profiling from photo collections. This poses a serious privacy threat for individuals who do not want to be profiled. This research presents a novel algorithm for anonymizing selective attributes which an individual does not want to share without affecting the visual quality of images. Using the proposed algorithm, a user can select single or multiple attributes to be surpassed while preserving identity information and visual content. The proposed adversarial perturbation based algorithm embeds imperceptible noise in an image such that attribute prediction algorithm for the selected attribute yields incorrect classification result, thereby preserving the information according to user's choice. Experiments on three popular databases i.e. MUCT, LFWcrop, and CelebA show that the proposed algorithm not only anonymizes \textit{k}-attributes, but also preserves image quality and identity information.

#10 Dual Adversarial Networks for Zero-shot Cross-media Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Jingze Chi ; Yuxin Peng

Existing cross-media retrieval methods usually require that testing categories remain the same with training categories, which cannot support the retrieval of increasing new categories. Inspired by zero-shot learning, this paper proposes zeroshot cross-media retrieval for addressing the above problem, which aims to retrieve data of new categories across different media types. It is challenging that zero-shot cross-media retrieval has to handle not only the inconsistent semantics across new and known categories, but also the heterogeneous distributions across different media types. To address the above challenges, this paper proposes Dual Adversarial Networks for Zero-shot Crossmedia Retrieval (DANZCR), which is the first approach to address zero-shot cross-media retrieval to the best of our knowledge. Our DANZCR approach consists of two GANs in a dual structure for common representation generation and original representation reconstruction respectively, which capture the underlying data structures as well as strengthen relations between input data and semantic space to generalize across seen and unseen categories. Our DANZCR approach exploits word embeddings to learn common representations in semantic space via an adversarial learning method, which preserves the inherent cross-media correlation and enhances the knowledge transfer to new categories. Experiments on three widely-used cross-media retrieval datasets show the effectiveness of our approach.

#11 Siamese CNN-BiLSTM Architecture for 3D Shape Representation Learning [PDF] [Copy] [Kimi] [REL]

Authors: Guoxian Dai ; Jin Xie ; Yi Fang

Learning a 3D shape representation from a collection of its rendered 2D images has been extensively studied. However, existing view-based techniques have not yet fully exploited the information among all the views of projections. In this paper, by employing recurrent neural network to efficiently capture features across different views, we propose a siamese CNN-BiLSTM network for 3D shape representation learning. The proposed method minimizes a discriminative loss function to learn a deep nonlinear transformation, mapping 3D shapes from the original space into a nonlinear feature space. In the transformed space, the distance of 3D shapes with the same label is minimized, otherwise the distance is maximized to a large margin. Specifically, the 3D shapes are first projected into a group of 2D images from different views. Then convolutional neural network (CNN) is adopted to extract features from different view images, followed by a bidirectional long short-term memory (LSTM) to aggregate information across different views. Finally, we construct the whole CNN-BiLSTM network into a siamese structure with contrastive loss function. Our proposed method is evaluated on two benchmarks, ModelNet40 and SHREC 2014, demonstrating superiority over the state-of-the-art methods.

#12 Cross-Modality Person Re-Identification with Generative Adversarial Training [PDF] [Copy] [Kimi] [REL]

Authors: Pingyang Dai ; Rongrong Ji ; Haibin Wang ; Qiong Wu ; Yuyu Huang

Person re-identification (Re-ID) is an important task in video surveillance which automatically searches and identifies people across different cameras. Despite the extensive Re-ID progress in RGB cameras, few works have studied the Re-ID between infrared and RGB images, which is essentially a cross-modality problem and widely encountered in real-world scenarios. The key challenge lies in two folds, i.e., the lack of discriminative information to re-identify the same person between RGB and infrared modalities, and the difficulty to learn a robust metric towards such a large-scale cross-modality retrieval. In this paper, we tackle the above two challenges by proposing a novel cross-modality generative adversarial network (termed cmGAN). To handle the issue of insufficient discriminative information, we leverage the cutting-edge generative adversarial training to design our own discriminator to learn discriminative feature representation from different modalities. To handle the issue of large-scale cross-modality metric learning, we integrates both identification loss and cross-modality triplet loss, which minimize inter-class ambiguity while maximizing cross-modality similarity among instances. The entire cmGAN can be trained in an end-to-end manner by using standard deep neural network framework. We have quantized the performance of our work in the newly-released SYSU RGB-IR Re-ID benchmark, and have reported superior performance, i.e., Cumulative Match Characteristic curve (CMC) and Mean Average Precision (MAP), over the state-of-the-art works [Wu et al., 2017], respectively.

#13 R³Net: Recurrent Residual Refinement Network for Saliency Detection [PDF] [Copy] [Kimi] [REL]

Authors: Zijun Deng ; Xiaowei Hu ; Lei Zhu ; Xuemiao Xu ; Jing Qin ; Guoqiang Han ; Pheng-Ann Heng

Saliency detection is a fundamental yet challenging task in computer vision, aiming at highlighting the most visually distinctive objects in an image. We propose a novel recurrent residual refinement network (R^3Net) equipped with residual refinement blocks (RRBs) to more accurately detect salient regions of an input image. Our RRBs learn the residual between the intermediate saliency prediction and the ground truth by alternatively leveraging the low-level integrated features and the high-level integrated features of a fully convolutional network (FCN). While the low-level integrated features are capable of capturing more saliency details, the high-level integrated features can reduce non-salient regions in the intermediate prediction. Furthermore, the RRBs can obtain complementary saliency information of the intermediate prediction, and add the residual into the intermediate prediction to refine the saliency maps. We evaluate the proposed R^3Net on five widely-used saliency detection benchmarks by comparing it with 16 state-of-the-art saliency detectors. Experimental results show that our network outperforms our competitors in all the benchmark datasets.

#14 Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss [PDF] [Copy] [Kimi] [REL]

Authors: Qi Dou ; Cheng Ouyang ; Cheng Chen ; Hao Chen ; Pheng-Ann Heng

Convolutional networks (ConvNets) have achieved great successes in various challenging vision tasks. However, the performance of ConvNets would degrade when encountering the domain shift. The domain adaptation is more significant while challenging in the field of biomedical image analysis, where cross-modality data have largely different distributions. Given that annotating the medical data is especially expensive, the supervised transfer learning approaches are not quite optimal. In this paper, we propose an unsupervised domain adaptation framework with adversarial learning for cross-modality biomedical image segmentations. Specifically, our model is based on a dilated fully convolutional network for pixel-wise prediction. Moreover, we build a plug-and-play domain adaptation module (DAM) to map the target input to features which are aligned with source domain feature space. A domain critic module (DCM) is set up for discriminating the feature space of both domains. We optimize the DAM and DCM via an adversarial loss without using any target domain label. Our proposed method is validated by adapting a ConvNet trained with MRI images to unpaired CT data for cardiac structures segmentations, and achieved very promising results.

#15 Enhanced-alignment Measure for Binary Foreground Map Evaluation [PDF] [Copy] [Kimi] [REL]

Authors: Deng-Ping Fan ; Cheng Gong ; Yang Cao ; Bo Ren ; Ming-Ming Cheng ; Ali Borji

The existing binary foreground map (FM) measures address various types of errors in either pixel-wise or structural ways. These measures consider pixel-level match or image-level information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective E-measure (Enhanced-alignment measure). Our measure combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 meta-measures, including ranking models for applications, demoting generic, random Gaussian noise maps, ground-truth switch, as well as human judgments. We find large improvements in almost all the meta-measures. For instance, in terms of application ranking, we observe improvement ranging from 9.08% to 19.65% compared with other popular measures.

#16 Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification [PDF] [Copy] [Kimi] [REL]

Authors: Hehe Fan ; Zhongwen Xu ; Linchao Zhu ; Chenggang Yan ; Jianjun Ge ; Yi Yang

We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.

#17 Age Estimation Using Expectation of Label Distribution Learning [PDF] [Copy] [Kimi] [REL]

Authors: Bin-Bin Gao ; Hong-Yu Zhou ; Jianxin Wu ; Xin Geng

Age estimation performance has been greatly improved by using convolutional neural network. However, existing methods have an inconsistency between the training objectives and evaluation metric, so they may be suboptimal. In addition, these methods always adopt image classification or face recognition models with a large amount of parameters, which bring expensive computation cost and storage overhead. To alleviate these issues, we design a lightweight network architecture and propose a unified framework which can jointly learn age distribution and regress age. The effectiveness of our approach has been demonstrated on apparent and real age estimation tasks. Our method achieves new state-of-the-art results using the single model with 36$\times$ fewer parameters and 2.6$\times$ reduction in inference time. Moreover, our method can achieve comparable results as the state-of-the-art even though model parameters are further reduced to 0.9M~(3.8MB disk storage). We also analyze that Ranking methods are implicitly learning label distributions.

#18 Coarse-to-fine Image Co-segmentation with Intra and Inter Rank Constraints [PDF] [Copy] [Kimi] [REL]

Authors: Lianli Gao ; Jingkuan Song ; Dongxiang Zhang ; Heng Tao Shen

Image co-segmentation is the problem of automatically discovering the common objects co-occurring in a set of relevant images and segmenting them as foreground simultaneously. Although a bunch of approaches have been proposed to address this problem, many of them still suffer from certain limitations, e.g., supervised feature learning and complex models, which hinder their capability in the real-world scenarios. To alleviate these limitations, we propose a novel coarse-to-fine co-segmentation (CFC) framework, which utilizes the coarse foreground and background proposals to learn a robust similarity measure of the features in an unsupervised way, and then devises a simple objective function based on the definition of image co-segmentation. Specifically, we first generate superpixels for all the images and extract their features. Instead of using existing distance metrics, we utilize object proposal methods to generate coarse foreground and background to learn a similarity measure of superpixels to construct a robust feature similarity graph. Then we design an intuitive objective function to learn a segmentation similarity graph which should be consistent with feature similarity graph and also be able to co-segment the superpixels in the images into either foreground and background. This objective function can be further reformulated as a graph learning problem with intra and inter rank constraints. Experiments on two commonly used image datasets (iCoseg and MSRC) demonstrate that CFC outperforms other state-of-the-art methods. Notably, this performance is achieved by using only HSV feature.

#19 View-Volume Network for Semantic Scene Completion from a Single Depth Image [PDF] [Copy] [Kimi] [REL]

Authors: Yuxiao Guo ; Xin Tong

We introduce a View-Volume convolutional neural network (VVNet) for inferring the occupancy and semantic labels of a volumetric 3D scene from a single depth image. Our method extracts the detailed geometric features from the input depth image with a 2D view CNN and then projects the features into a 3D volume according to the input depth map via a projection layer. After that, we learn the 3D context information of the scene with a 3D volume CNN for computing the result volumetric occupancy and semantic labels. With combined 2D and 3D representations, the VVNet efficiently reduces the computational cost, enables feature extraction from multi-channel high resolution inputs, and thus significantly improve the result accuracy. We validate our method and demonstrate its efficiency and effectiveness on both synthetic SUNCG and real NYU dataset.

#20 Harnessing Synthesized Abstraction Images to Improve Facial Attribute Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Keke He ; Yanwei Fu ; Wuhao Zhang ; Chengjie Wang ; Yu-Gang Jiang ; Feiyue Huang ; Xiangyang Xue

Facial attribute recognition is an important and yet challenging research topic. Different from most previous approaches which predict attributes only based on the whole images, this paper leverages facial parts locations for better attribute prediction. A facial abstraction image which contains both local facial parts and facial texture information is introduced. This abstraction image is generated by a Generative Adversarial Network (GAN). Then we build a dual-path facial attribute recognition network to utilize features from the original face images and facial abstraction images. Empirically, the features of facial abstraction images are complementary to features of original face images. With the facial parts localized by the abstraction images, our method improves facial attributes recognition, especially the attributes located on small face regions. Extensive evaluations conducted on CelebA and LFWA benchmark datasets show that state-of-the-art performance is achieved.

#21 StackDRL: Stacked Deep Reinforcement Learning for Fine-grained Visual Categorization [PDF] [Copy] [Kimi] [REL]

Authors: Xiangteng He ; Yuxin Peng ; Junjie Zhao

Fine-grained visual categorization (FGVC) is the discrimination of similar subcategories, whose main challenge is to localize the quite subtle visual distinctions between similar subcategories. There are two pivotal problems: discovering which region is discriminative and representative, and determining how many discriminative regions are necessary to achieve the best performance. Existing methods generally solve these two problems relying on the prior knowledge or experimental validation, which extremely restricts the usability and scalability of FGVC. To address the "which" and "how many" problems adaptively and intelligently, this paper proposes a stacked deep reinforcement learning approach (StackDRL). It adopts a two-stage learning architecture, which is driven by the semantic reward function. Two-stage learning localizes the object and its parts in sequence ("which"), and determines the number of discriminative regions adaptively ("how many"), which is quite appealing in FGVC. Semantic reward function drives StackDRL to fully learn the discriminative and conceptual visual information, via jointly combining the attention-based reward and category-based reward. Furthermore, unsupervised discriminative localization avoids the heavy labor consumption of labeling, and extremely strengthens the usability and scalability of our StackDRL approach. Comparing with ten state-of-the-art methods on CUB-200-2011 dataset, our StackDRL approach achieves the best categorization accuracy.

#22 Co-attention CNNs for Unsupervised Object Co-segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Kuang-Jui Hsu ; Yen-Yu Lin ; Yung-Yu Chuang

Object co-segmentation aims to segment the common objects in images. This paper presents a CNN-based method that is unsupervised and end-to-end trainable to better solve this task. Our method is unsupervised in the sense that it does not require any training data in the form of object masks but merely a set of images jointly covering objects of a specific class. Our method comprises two collaborative CNN modules, a feature extractor and a co-attention map generator. The former module extracts the features of the estimated objects and backgrounds, and is derived based on the proposed co-attention loss which minimizes inter-image object discrepancy while maximizing intra-image figure-ground separation. The latter module is learned to generated co-attention maps by which the estimated figure-ground segmentation can better fit the former module. Besides, the co-attention loss, the mask loss is developed to retain the whole objects and remove noises. Experiments show that our method achieves superior results, even outperforming the state-of-the-art, supervised methods.

#23 Human Motion Generation via Cross-Space Constrained Sampling [PDF] [Copy] [Kimi] [REL]

Authors: Zhongyue Huang ; Jingwei Xu ; Bingbing Ni

We aim to automatically generate human motion sequence from a single input person image, with some specific action label. To this end, we propose a cross-space human motion video generation network which features two paths: a forward path that first samples/generates a sequence of low dimensional motion vectors based on Gaussian Process (GP), which is paired with the input person image to form a moving human figure sequence; and a backward path based on the predicted human images to re-extract the corresponding latent motion representations. As lack of supervision, the reconstructed latent motion representations are expected to be as close as possible to the GP sampled ones, thus yielding a cyclic objective function for cross-space (i.e., motion and appearance) mutual constrained generation. We further propose an alternative sampling/generation algorithm with respect to constraints from both spaces. Extensive experimental results show that the proposed framework successfully generates novel human motion sequences with reasonable visual quality.

#24 Semantic Locality-Aware Deformable Network for Clothing Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Wei Ji ; Xi Li ; Yueting Zhuang ; Omar El Farouk Bourahla ; Yixin Ji ; Shihao Li ; Jiabao Cui

Clothing segmentation is a challenging vision problem typically implemented within a fine-grained semantic segmentation framework. Different from conventional segmentation, clothing segmentation has some domain-specific properties such as texture richness, diverse appearance variations, non-rigid geometry deformations, and small sample learning. To deal with these points, we propose a semantic locality-aware segmentation model, which adaptively attaches an original clothing image with a semantically similar (e.g., appearance or pose) auxiliary exemplar by search. Through considering the interactions of the clothing image and its exemplar, more intrinsic knowledge about the locality manifold structures of clothing images is discovered to make the learning process of small sample problem more stable and tractable. Furthermore, we present a CNN model based on the deformable convolutions to extract the non-rigid geometry-aware features for clothing images. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art approaches.

#25 Deep CNN Denoiser and Multi-layer Neighbor Component Embedding for Face Hallucination [PDF] [Copy] [Kimi] [REL]

Authors: Junjun Jiang ; Yi Yu ; Jinhui Hu ; Suhua Tang ; Jiayi Ma

Most of the current face hallucination methods, whether they are shallow learning-based or deep learning-based, all try to learn a relationship model between Low-Resolution (LR) and High-Resolution (HR) spaces with the help of a training set. They mainly focus on modeling image prior through either model-based optimization or discriminative inference learning. However, when the input LR face is tiny, the learned prior knowledge is no longer effective and their performance will drop sharply. To solve this problem, in this paper we propose a general face hallucination method that can integrate model-based optimization and discriminative inference. In particular, to exploit the model based prior, the Deep Convolutional Neural Networks (CNN) denoiser prior is plugged into the super-resolution optimization model with the aid of image-adaptive Laplacian regularization. Additionally, we further develop a high-frequency details compensation method by dividing the face image to facial components and performing face hallucination in a multi-layer neighbor embedding manner. Experiments demonstrate that the proposed method can achieve promising super-resolution results for tiny input LR faces.