AAAI.2018 - Vision

| Total: 122

#1 Tracking Occluded Objects and Recovering Incomplete Trajectories by Reasoning About Containment Relations and Human Actions [PDF] [Copy] [Kimi] [REL]

Authors: Wei Liang, Yixin Zhu, Song-Chun Zhu

This paper studies a challenging problem of tracking severely occluded objects in long video sequences. The proposed method reasons about the containment relations and human actions, thus infers and recovers occluded objects identities while contained or blocked by others. There are two conditions that lead to incomplete trajectories: i) Contained. The occlusion is caused by a containment relation formed between two objects, e.g., an unobserved laptop inside a backpack forms containment relation between the laptop and the backpack. ii) Blocked. The occlusion is caused by other objects blocking the view from certain locations, during which the containment relation does not change. By explicitly distinguishing these two causes of occlusions, the proposed algorithm formulates tracking problem as a network flow representation encoding containment relations and their changes. By assuming all the occlusions are not spontaneously happened but only triggered by human actions, an MAP inference is applied to jointly interpret the trajectory of an object by detection in space and human actions in time. To quantitatively evaluate our algorithm, we collect a new occluded object dataset captured by Kinect sensor, including a set of RGB-D videos and human skeletons with multiple actors, various objects, and different changes of containment relations. In the experiments, we show that the proposed method demonstrates better performance on tracking occluded objects compared with baseline methods.


#2 Learning Adversarial 3D Model Generation With 2D Image Enhancer [PDF] [Copy] [Kimi] [REL]

Authors: Jing Zhu, Jin Xie, Yi Fang

Recent advancements in generative adversarial nets (GANs) and volumetric convolutional neural networks (CNNs) enable generating 3D models from a probabilistic space. In this paper, we have developed a novel GAN-based deep neural network to obtain a better latent space for the generation of 3D models. In the proposed method, an enhancer neural network is introduced to extract information from other corresponding domains (e.g. image) to improve the performance of the 3D model generator, and the discriminative power of the unsupervised shape features learned from the 3D model discriminator. Specifically, we train the 3D generative adversarial networks on 3D volumetric models, and at the same time, the enhancer network learns image features from rendered images. Different from the traditional GAN architecture that uses uninformative random vectors as inputs, we feed the high-level image features learned from the enhancer into the 3D model generator for better training. The evaluations on two large-scale 3D model datasets, ShapeNet and ModelNet, demonstrate that our proposed method can not only generate high-quality 3D models, but also successfully learn discriminative shape representation for classification and retrieval without supervision.


#3 Face Sketch Synthesis From Coarse to Fine [PDF] [Copy] [Kimi] [REL]

Authors: Mingjin Zhang, Nannan Wang, Yunsong Li, Ruxin Wang, Xinbo Gao

Synthesizing fine face sketches from photos is a valuable yet challenging problem in digital entertainment. Face sketches synthesized by conventional methods usually exhibit coarse structures of faces, whereas fine details are lost especially on some critical facial components. In this paper, by imitating the coarse-to-fine drawing process of artists, we propose a novel face sketch synthesis framework consisting of a coarse stage and a fine stage. In the coarse stage, a mapping relationship between face photos and sketches is learned via the convolutional neural network. It ensures that the synthesized sketches keep coarse structures of faces. Given the test photo and the coarse synthesized sketch, a probabilistic graphic model is designed to synthesize the delicate face sketch which has fine and critical details. Experimental results on public face sketch databases illustrate that our proposed framework outperforms the state-of-the-art methods in both quantitive and visual comparisons.


#4 Multi-Channel Pyramid Person Matching Network for Person Re-Identification [PDF] [Copy] [Kimi] [REL]

Authors: Chaojie Mao, Yingming Li, Yaqing Zhang, Zhongfei Zhang, Xi Li

In this work, we present a Multi-Channel deep convolutional Pyramid Person Matching Network (MC-PPMN) based on the combination of the semantic-components and the color-texture distributions to address the problem of person re-identification. In particular, we learn separate deep representations for semantic-components and color-texture distributions from two person images and then employ pyramid person matching network (PPMN) to obtain correspondence representations. These correspondence representations are fused to perform the re-identification task. Further, the proposed framework is optimized via a unified end-to-end deep learning scheme. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art literature, especially on the rank-1 recognition rate.


#5 Asymmetric Joint Learning for Heterogeneous Face Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Bing Cao, Nannan Wang, Xinbo Gao, Jie Li

Heterogeneous face recognition (HFR) refers to matching a probe face image taken from one modality to face images acquired from another modality. It plays an important role in security scenarios. However, HFR is still a challenging problem due to great discrepancies between cross-modality images. This paper proposes an asymmetric joint learning (AJL) approach to handle this issue. The proposed method transforms the cross-modality differences mutually by incorporating the synthesized images into the learning process which provides more discriminative information. Although the aggregated data would augment the scale of intra-classes, it also reduces the diversity (i.e. discriminative information) for inter-classes. Then, we develop the AJL model to balance this dilemma. Finally, we could obtain the similarity score between two heterogeneous face images through the log-likelihood ratio. Extensive experiments on viewed sketch database, forensic sketch database and near infrared image database illustrate that the proposed AJL-HFR method achieve superior performance in comparison to state-of-the-art methods.


#6 Domain-Shared Group-Sparse Dictionary Learning for Unsupervised Domain Adaptation [PDF] [Copy] [Kimi] [REL]

Authors: Baoyao Yang, Andy Ma, Pong Yuen

Unsupervised domain adaptation has been proved to be a promising approach to solve the problem of dataset bias. To employ source labels in the target domain, it is required to align the joint distributions of source and target data. To do this, the key research problem is to align conditional distributions across domains without target labels. In this paper, we propose a new criterion of domain-shared group-sparsity that is an equivalent condition for conditional distribution alignment. To solve the problem in joint distribution alignment, a domain-shared group-sparse dictionary learning method is developed towards joint alignment of conditional and marginal distributions. A classifier for target domain is trained using the domain-shared group-sparse coefficients and the target-specific information from the target data. Experimental results on cross-domain face and object recognition show that the proposed method outperforms eight state-of-the-art unsupervised domain adaptation algorithms.


#7 Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, Xinwang Liu

A novel deep neural network training paradigm that exploits the conjoint information in multiple heterogeneous sources is proposed. Specifically, in a RGB-D based action recognition task, it cooperatively trains a single convolutional neural network (named c-ConvNet) on both RGB visual features and depth features, and deeply aggregates the two kinds of features for action recognition. Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classification with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities. The ranking loss consists of intra-modality and cross-modality triplet losses, and it reduces both the intra-modality and cross-modality feature variations. Furthermore, the correlations between RGB and depth data are embedded in the c-ConvNet, and can be retrieved by either of the modalities and contribute to the recognition in the case even only one of the modalities is available. The proposed method was extensively evaluated on two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art results.


#8 SAP: Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning [PDF] [Copy] [Kimi] [REL]

Authors: Jingjia Huang, Nannan Li, Tao Zhang, Ge Li, Tiejun Huang, Wen Gao

Existing action detection algorithms usually generate action proposals through an extensive search over the video at multiple temporal scales, which brings about huge computational overhead and deviates from the human perception procedure. We argue that the process of detecting actions should be naturally one of observation and refinement: observe the current window and refine the span of attended window to cover true action regions. In this paper, we propose a Self-Adaptive Proposal (SAP) model that learns to find actions through continuously adjusting the temporal bounds in a self-adaptive way. The whole process can be deemed as an agent, which is firstly placed at the beginning of the video and traverse the whole video by adopting a sequence of transformations on the current attended region to discover actions according to a learned policy. We utilize reinforcement learning, especially the Deep Q-learning algorithm to learn the agent’s decision policy. In addition, we use temporal pooling operation to extract more effective feature representation for the long temporal window, and design a regression network to adjust the position offsets between predicted results and the ground truth. Experiment results on THUMOS’14 validate the effectiveness of SAP, which can achieve competitive performance with current action detection algorithms via much fewer proposals.


#9 Order-Free RNN With Visual Attention for Multi-Label Classification [PDF] [Copy] [Kimi] [REL]

Authors: Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, Yu-Chiang Wang

We propose a recurrent neural network (RNN) based model for image multi-label classification. Our model uniquely integrates and learning of visual attention and Long Short Term Memory (LSTM) layers, which jointly learns the labels of interest and their co-occurrences, while the associated image regions are visually attended. Different from existing approaches utilize either model in their network architectures, training of our model does not require pre-defined label orders. Moreover, a robust inference process is introduced so that prediction errors would not propagate and thus affect the performance. Our experiments on NUS-WISE and MS-COCO datasets confirm the design of our network and its effectiveness in solving multi-label classification problems.


#10 Unsupervised Part-Based Weighting Aggregation of Deep Convolutional Features for Image Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Jian Xu, Cunzhao Shi, Chengzuo Qi, Chunheng Wang, Baihua Xiao

In this paper, we propose a simple but effective semantic part-based weighting aggregation (PWA) for image retrieval. The proposed PWA utilizes the discriminative filters of deep convolutional layers as part detectors. Moreover, we propose the effective unsupervised strategy to select some part detectors to generate the "probabilistic proposals," which highlight certain discriminative parts of objects and suppress the noise of background. The final global PWA representation could then be acquired by aggregating the regional representations weighted by the selected "probabilistic proposals" corresponding to various semantic content. We conduct comprehensive experiments on four standard datasets and show that our unsupervised PWA outperforms the state-of-the-art unsupervised and supervised aggregation methods.


#11 Action Recognition With Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion [PDF] [Copy] [Kimi] [REL]

Authors: Weiyao Lin, Chongyang Zhang, Ke Lu, Bin Sheng, Jianxin Wu, Bingbing Ni, Xin Liu, Hongkai Xiong

Action recognition is an important yet challenging task in computer vision. In this paper, we propose a novel deep-based framework for action recognition, which improves the recognition accuracy by: 1) deriving more precise features for representing actions, and 2) reducing the asynchrony between different information streams. We first introduce a coarse-to-fine network which extracts shared deep features at different action class granularities and progressively integrates them to obtain a more accurate feature representation for input actions. We further introduce an asynchronous fusion network. It fuses information from different streams by asynchronously integrating stream-wise features at different time points, hence better leveraging the complementary information in different streams. Experimental results on action recognition benchmarks demonstrate that our approach achieves the state-of-the-art performance.


#12 Video Generation From Text [PDF] [Copy] [Kimi] [REL]

Authors: Yitong Li, Martin Min, Dinghan Shen, David Carlson, Lawrence Carin

Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called "gist," are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.


#13 Exploring Temporal Preservation Networks for Precise Temporal Action Localization [PDF] [Copy] [Kimi] [REL]

Authors: Ke Yang, Peng Qiao, Dongsheng Li, Shaohe Lv, Yong Dou

Temporal action localization is an important task of computer vision. Though a variety of methods have been proposed, it still remains an open question how to predict the temporal boundaries of action segments precisely. Most works use segment-level classifiers to select video segments pre-determined by action proposal or dense sliding windows. However, in order to achieve more precise action boundaries, a temporal localization system should make dense predictions at a fine granularity. A newly proposed work exploits Convolutional-Deconvolutional-Convolutional (CDC) filters to upsample the predictions of 3D ConvNets, making it possible to perform per-frame action predictions and achieving promising performance in terms of temporal action localization. However, CDC network loses temporal information partially due to the temporal downsampling operation. In this paper, we propose an elegant and powerful Temporal Preservation Convolutional (TPC) Network that equips 3D ConvNets with TPC filters. TPC network can fully preserve temporal resolution and downsample the spatial resolution simultaneously, enabling frame-level granularity action localization with minimal loss of time information. TPC network can be trained in an end-to-end manner. Experiment results on public datasets show that TPC network achieves significant improvement in both per-frame action prediction and segment-level temporal action localization.


#14 Hierarchical LSTM for Sign Language Translation [PDF] [Copy] [Kimi] [REL]

Authors: Dan Guo, Wengang Zhou, Houqiang Li, Meng Wang

Continuous Sign Language Translation (SLT) is a challenging task due to its specific linguistics under sequential gesture variation without word alignment. Current hybrid HMM and CTC (Connectionist temporal classification) based models are proposed to solve frame or word level alignment. They may fail to tackle the cases with messing word order corresponding to visual content in sentences. To solve the issue, this paper proposes a hierarchical-LSTM (HLSTM) encoder-decoder model with visual content and word embedding for SLT. It tackles different granularities by conveying spatio-temporal transitions among frames, clips and viseme units. It firstly explores spatio-temporal cues of video clips by 3D CNN and packs appropriate visemes by online key clip mining with adaptive variable-length. After pooling on recurrent outputs of the top layer of HLSTM, a temporal attention-aware weighting mechanism is proposed to balance the intrinsic relationship among viseme source positions. At last, another two LSTM layers are used to separately recurse viseme vectors and translate semantic. After preserving original visual content by 3D CNN and the top layer of HLSTM, it shortens the encoding time step of the bottom two LSTM layers with less computational complexity while attaining more nonlinearity. Our proposed model exhibits promising performance on singer-independent test with seen sentences and also outperforms the comparison algorithms on unseen sentences.


#15 Cross-View Person Identification by Matching Human Poses Estimated With Confidence on Each Body Joint [PDF] [Copy] [Kimi] [REL]

Authors: Guoqiang Liang, Xuguang Lan, Kang Zheng, Song Wang, Nanning Zheng

Cross-view person identification (CVPI) from multiple temporally synchronized videos taken by multiple wearable cameras from different, varying views is a very challenging but important problem, which has attracted more interests recently. Current state-of-the-art performance of CVPI is achieved by matching appearance and motion features across videos, while the matching of pose features does not work effectively given the high inaccuracy of the 3D human pose estimation on videos/images collected in the wild. In this paper, we introduce a new metric of confidence to the 3D human pose estimation and show that the combination of the inaccurately estimated human pose and the inferred confidence metric can be used to boost the CVPI performance---the estimated pose information can be integrated to the appearance and motion features to achieve the new state-of-the-art CVPI performance. More specifically, the estimated confidence metric is measured at each human-body joint and the joints with higher confidence are weighted more in the pose matching for CVPI. In the experiments, we validate the proposed method on three wearable-camera video datasets and compare the performance against several other existing CVPI methods.


#16 Learning Coarse-to-Fine Structured Feature Embedding for Vehicle Re-Identification [PDF] [Copy] [Kimi] [REL]

Authors: Haiyun Guo, Chaoyang Zhao, Zhiwei Liu, Jinqiao Wang, Hanqing Lu

Vehicle re-identification (re-ID) is to identify the same vehicle across different cameras. It’s a significant but challenging topic, which has received little attention due to the complex intra-class and inter-class variation of vehicle images and the lack of large-scale vehicle re-ID dataset. Previous methods focus on pulling images from different vehicles apart but neglect the discrimination between vehicles from different vehicle models, which is actually quite important to obtain a correct ranking order for vehicle re-ID. In this paper, we learn a structured feature embedding for vehicle re-ID with a novel coarse-to-fine ranking loss to pull images of the same vehicle as close as possible and achieve discrimination between images from different vehicles as well as vehicles from different vehicle models. In the learnt feature space, both intra-class compactness and inter-class distinction are well guaranteed and the Euclidean distance between features directly reflects the semantic similarity of vehicle images. Furthermore, we build so far the largest vehicle re-ID dataset "Vehicle-1M," which involves nearly 1 million images captured in various surveillance scenarios. Experimental results on "Vehicle-1M" and "VehicleID" demonstrate the superiority of our proposed approach.


#17 Lateral Inhibition-Inspired Convolutional Neural Network for Visual Attention and Saliency Detection [PDF] [Copy] [Kimi] [REL]

Authors: Chunshui Cao, Yongzhen Huang, Zilei Wang, Liang Wang, Ninglong Xu, Tieniu Tan

Lateral inhibition in top-down feedback is widely existing in visual neurobiology, but such an important mechanism has not be well explored yet in computer vision. In our recent research, we find that modeling lateral inhibition in convolutional neural network (LICNN) is very useful for visual attention and saliency detection. In this paper, we propose to formulate lateral inhibition inspired by the related studies from neurobiology, and embed it into the top-down gradient computation of a general CNN for classification, i.e. only category-level information is used. After this operation (only conducted once), the network has the ability to generate accurate category-specific attention maps. Further, we apply LICNN for weakly-supervised salient object detection.Extensive experimental studies on a set of databases, e.g., ECSSD, HKU-IS, PASCAL-S and DUT-OMRON, demonstrate the great advantage of LICNN which achieves the state-of-the-art performance. It is especially impressive that LICNN with only category-level supervised information even outperforms some recent methods with segmentation-level supervised learning.


#18 Acquiring Common Sense Spatial Knowledge Through Implicit Spatial Templates [PDF] [Copy] [Kimi] [REL]

Authors: Guillem Collell, Luc Van Gool, Marie-Francine Moens

Spatial understanding is a fundamental problem with wide-reaching real-world applications. The representation of spatial knowledge is often modeled with spatial templates, i.e., regions of acceptability of two objects under an explicit spatial relationship (e.g., "on," "below," etc.). In contrast with prior work that restricts spatial templates to explicit spatial prepositions (e.g., "glass on table"), here we extend this concept to implicit spatial language, i.e., those relationships (generally actions) for which the spatial arrangement of the objects is only implicitly implied (e.g., "man riding horse"). In contrast with explicit relationships, predicting spatial arrangements from implicit spatial language requires significant common sense spatial understanding. Here, we introduce the task of predicting spatial templates for two objects under a relationship, which can be seen as a spatial question-answering task with a (2D) continuous output ("where is the man w.r.t. a horse when the man is walking the horse?"). We present two simple neural-based models that leverage annotated images and structured text to learn this task. The good performance of these models reveals that spatial locations are to a large extent predictable from implicit spatial language. Crucially, the models attain similar performance in a challenging generalized setting, where the object-relation-object combinations (e.g., "man walking dog") have never been seen before. Next, we go one step further by presenting the models with unseen objects (e.g., "dog"). In this scenario, we show that leveraging word embeddings enables the models to output accurate spatial predictions, proving that the models acquire solid common sense spatial knowledge allowing for such generalization.


#19 Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering [PDF] [Copy] [Kimi] [REL]

Authors: Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, Xiaogang Wang

Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The free-form region based and the detection-based visual attention mechanisms are mostly investigated, with the former ones attending free-form image regions and the latter ones attending pre-specified detection-box regions. We argue that the two attention mechanisms are able to provide complementary information and should be effectively integrated to better solve the VQA problem. In this paper, we propose a novel deep neural network for VQA that integrates both attention mechanisms. Our proposed framework effectively fuses features from free-form image regions, detection boxes, and question representations via a multi-modal multiplicative feature embedding scheme to jointly attend question-related free-form image regions and detection boxes for more accurate question answering. The proposed method is extensively evaluated on two publicly available datasets, COCO-QA and VQA, and outperforms state-of-the-art approaches. Source code is available at https://github.com/lupantech/dual-mfa-vqa.


#20 Graph Correspondence Transfer for Person Re-Identification [PDF] [Copy] [Kimi] [REL]

Authors: Qin Zhou, Heng Fan, Shibao Zheng, Hang Su, Xinzhe Li, Shuang Wu, Haibin Ling

In this paper, we propose a graph correspondence transfer (GCT) approach for person re-identification. Unlike existing methods, the GCT model formulates person re-identification as an off-line graph matching and on-line correspondence transferring problem. In specific, during training, the GCT model aims to learn off-line a set of correspondence templates from positive training pairs with various pose-pair configurations via patch-wise graph matching. During testing, for each pair of test samples, we select a few training pairs with the most similar pose-pair configurations as references, and transfer the correspondences of these references to test pair for feature distance calculation. The matching score is derived by aggregating distances from different references. For each probe image, the gallery image with the highest matching score is the re-identifying result. Compared to existing algorithms, our GCT can handle spatial misalignment caused by large variations in view angles and human poses owing to the benefits of patch-wise graph matching. Extensive experiments on five benchmarks including VIPeR, Road, PRID450S, 3DPES and CUHK01 evidence the superior performance of GCT model over other state-of-the-art methods.


#21 SEE: Towards Semi-Supervised End-to-End Scene Text Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Christian Bartz, Haojin Yang, Christoph Meinel

Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present SEE, a step towards semi-supervised neural networks for scene text detection and recognition, that can be optimized end-to-end. Most existing works consist of multiple deep neural networks and several pre-processing steps. In contrast to this, we propose to use a single deep neural network, that learns to detect and recognize text from natural images, in a semi-supervised way. SEE is a network that integrates and jointly learns a spatial transformer network, which can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results.


#22 Asking Friendly Strangers: Non-Semantic Attribute Transfer [PDF] [Copy] [Kimi] [REL]

Authors: Nils Murrugarra-Llerena, Adriana Kovashka

Attributes can be used to recognize unseen objects from a textual description. Their learning is oftentimes accomplished with a large amount of annotations, e.g. around 160k-180k, but what happens if for a given attribute, we do not have many annotations? The standard approach would be to perform transfer learning, where we use source models trained on other attributes, to learn a separate target attribute. However existing approaches only consider transfer from attributes in the same domain i.e. they perform semantic transfer between attributes that have related meaning. Instead, we propose to perform non-semantic transfer from attributes that may be in different domains, hence they have no semantic relation to the target attributes. We develop an attention-guided transfer architecture that learns how to weigh the available source attribute classifiers, and applies them to image features for the attribute name of interest, to make predictions for that attribute. We validate our approach on 272 attributes from five domains: animals, objects, scenes, shoes and textures. We show that semantically unrelated attributes provide knowledge that helps improve the accuracy of the target attribute of interest, more so than only allowing transfer from semantically related attributes.


#23 Deep Semantic Structural Constraints for Zero-Shot Learning [PDF] [Copy] [Kimi] [REL]

Authors: Yan Li, Zhen Jia, Junge Zhang, Kaiqi Huang, Tieniu Tan

Zero-shot learning aims to classify unseen image categories by learning a visual-semantic embedding space. In most cases, the traditional methods adopt a separated two-step pipeline that extracts image features are utilized to learn the embedding space. It leads to the lack of specific structural semantic information of image features for zero-shot learning task. In this paper, we propose an end-to-end trainable Deep Semantic Structural Constraints model to address this issue. The proposed model contains the Image Feature Structure constraint and the Semantic Embedding Structure constraint, which aim to learn structure-preserving image features and endue the learned embedding space with stronger generalization ability respectively. With the assistance of semantic structural information, the model gains more auxiliary clues for zero-shot learning. The state-of-the-art performance certifies the effectiveness of our proposed method.


#24 Adaptive Feature Abstraction for Translating Video to Text [PDF] [Copy] [Kimi] [REL]

Authors: Yunchen Pu, Martin Min, Zhe Gan, Lawrence Carin

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.


#25 Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Wei Liu, Chaofeng Chen, Kwan-Yee Wong

In this paper, we present a Character-Aware Neural Network (Char-Net) for recognizing distorted scene text. Our Char-Net is composed of a word-level encoder, a character-level encoder, and a LSTM-based decoder. Unlike previous work which employed a global spatial transformer network to rectify the entire distorted text image, we take an approach of detecting and rectifying individual characters. To this end, we introduce a novel hierarchical attention mechanism (HAM) which consists of a recurrent RoIWarp layer and a character-level attention layer. The recurrent RoIWarp layer sequentially extracts a feature region corresponding to a character from the feature map produced by the word-level encoder, and feeds it to the character-level encoder which removes the distortion of the character through a simple spatial transformer and further encodes the character region. The character-level attention layer then attends to the most relevant features of the feature map produced by the character-level encoder and composes a context vector, which is finally fed to the LSTM-based decoder for decoding. This approach of adopting a simple local transformation to model the distortion of individual characters not only results in an improved efficiency, but can also handle different types of distortion that are hard, if not impossible, to be modelled by a single global transformation. Experiments have been conducted on six public benchmark datasets. Our results show that Char-Net can achieve state-of-the-art performance on all the benchmarks, especially on the IC-IST which contains scene text with large distortion. Code will be made available.