AAAI.2017 - Vision

| Total: 53

#1 Detection and Recognition of Text Embedded in Online Images via Neural Context Models [PDF] [Copy] [Kimi] [REL]

Authors: Chulmoo Kang, Gunhee Kim, Suk Yoo

We address the problem of detecting and recognizing the text embedded in online images that are circulated over the Web. Our idea is to leverage context information for both text detection and recognition. For detection, we use local image context around the text region, based on that the text often sequentially appear in online images. For recognition, we exploit the metadata associated with the input online image, including tags, comments, and title, which are used as a topic prior for the word candidates in the image. To infuse such two sets of context information, we propose a contextual text spotting network (CTSN). We perform comparative evaluation with five state-of-the-art text spotting methods on newly collected Instagram and Flickr datasets. We show that our approach that benefits from context information is more successful for text spotting in online images.


#2 Cross-View People Tracking by Scene-Centered Spatio-Temporal Parsing [PDF] [Copy] [Kimi] [REL]

Authors: Yuanlu Xu, Xiaobai Liu, Lei Qin, Song-Chun Zhu

In this paper, we propose a Spatio-temporal Attributed Parse Graph (ST-APG) to integrate semantic attributes with trajectories for cross-view people tracking. Given videos from multiple cameras with overlapping field of view (FOV), our goal is to parse the videos and organize the trajectories of all targets into a scene-centered representation. We leverage rich semantic attributes of human, e.g., facing directions, postures and actions, to enhance cross-view tracklet associations, besides frequently used appearance and geometry features in the literature.In particular, the facing direction of a human in 3D, once detected, often coincides with his/her moving direction or trajectory. Similarly, the actions of humans, once recognized, provide strong cues for distinguishing one subject from the others. The inference is solved by iteratively grouping tracklets with cluster sampling and estimating people semantic attributes by dynamic programming.In experiments, we validate our method on one public dataset and create another new dataset that records people's daily life in public, e.g., food court, office reception and plaza, each of which includes 3-4 cameras. We evaluate the proposed method on these challenging videos and achieve promising multi-view tracking results.


#3 Zero-Shot Recognition via Direct Classifier Learning with Transferred Samples and Pseudo Labels [PDF] [Copy] [Kimi] [REL]

Authors: Yuchen Guo, Guiguang Ding, Jungong Han, Yue Gao

As an interesting and emerging topic, zero-shot recognition (ZSR) makes it possible to train a recognition model by specifying the category's attributes when there are no labeled exemplars available. The fundamental idea for ZSR is to transfer knowledge from the abundant labeled data in different but related source classes via the class attributes. Conventional ZSR approaches adopt a two-step strategy in test stage, where the samples are projected into the attribute space in the first step, and then the recognition is carried out based on considering the relationship between samples and classes in the attribute space. Due to this intermediate transformation, information loss is unavoidable, thus degrading the performance of the overall system. Rather than following this two-step strategy, in this paper, we propose a novel one-step approach that is able to perform ZSR in the original feature space by using directly trained classifiers. To tackle the problem that no labeled samples of target classes are available, we propose to assign pseudo labels to samples based on the reliability and diversity, which in turn will be used to train the classifiers. Moreover, we adopt a robust SVM that accounts for the unreliability of pseudo labels. Extensive experiments on four datasets demonstrate consistent performance gains of our approach over the state-of-the-art two-step ZSR approaches.


#4 Efficient Object Instance Search Using Fuzzy Objects Matching [PDF] [Copy] [Kimi] [REL]

Authors: Tan Yu, Yuwei Wu, Sreyasee Bhattacharjee, Junsong Yuan

Recently, global features aggregated from local convolutional features of the convolutional neural network have shown to be much more effective in comparison with hand-crafted features for image retrieval. However, the global feature might not effectively capture the relevance between the query object and reference images in the object instance search task, especially when the query object is relatively small and there exist multiple types of objects in reference images. Moreover, the object instance search requires to localize the object in the reference image, which may not be achieved through global representations. In this paper, we propose a Fuzzy Objects Matching (FOM) framework to effectively and efficiently capture the relevance between the query object and reference images in the dataset. In the proposed FOM scheme, object proposals are utilized to detect the potential regions of the query object in reference images. To achieve high search efficiency, we factorize the feature matrix of all the object proposals from one reference image into the product of a set of fuzzy objects and sparse codes. In addition, we refine the feature of the generated fuzzy objects according to its neighborhood in the feature space to generate more robust representation. The experimental results demonstrate that the proposed FOM framework significantly outperforms the state-of-the-art methods in precision with less memory and computational cost on three public datasets.


#5 Weakly-Supervised Deep Nonnegative Low-Rank Model for Social Image Tag Refinement and Assignment [PDF] [Copy] [Kimi] [REL]

Authors: Zechao Li, Jinhui Tang

It has been well known that the user-provided tags of social images are imperfect, i.e., there exist noisy, irrelevant or incomplete tags. It heavily degrades the performance of many multimedia tasks. To alleviate this problem, we propose a Weakly-supervised Deep Nonnegative Low-rank model (WDNL) to improve the quality of tags by integrating the low-rank model with deep feature learning. A nonnegative low-rank model is introduced to uncover the intrinsic relationships between images and tags by simultaneously removing noisy or irrelevant tags and complementing missing tags. The deep architecture is leveraged to seamlessly connect the visual content and the semantic tag. That is, the proposed model can well handle the scalability by assigning tags to new images. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of the proposed method compared with some state-of-the-art methods.


#6 Online Multi-Target Tracking Using Recurrent Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Anton Milan, S. Hamid Rezatofighi, Anthony Dick, Ian Reid, Konrad Schindler

We present a novel approach to online multi-target tracking based on recurrent neural networks (RNNs). Tracking multiple objects in real-world scenes involves many challenges, including a) an a-priori unknown and time-varying number of targets, b) a continuous state estimation of all present targets, and c) a discrete combinatorial problem of data association. Most previous methods involve complex models that require tedious tuning of parameters. Here, we propose for the first time, an end-to-end learning approach for online multi-target tracking. Existing deep learning methods are not designed for the above challenges and cannot be trivially applied to the task. Our solution addresses all of the above points in a principled way. Experiments on both synthetic and real data show promising results obtained at ~300 Hz on a standard CPU, and pave the way towards future research in this direction.


#7 Non-Rigid Point Set Registration with Robust Transformation Estimation under Manifold Regularization [PDF] [Copy] [Kimi] [REL]

Authors: Jiayi Ma, Ji Zhao, Junjun Jiang, Huabing Zhou

In this paper, we propose a robust transformation estimation method based on manifold regularization for non-rigid point set registration. The method iteratively recovers the point correspondence and estimates the spatial transformation between two point sets. The correspondence is established based on existing local feature descriptors which typically results in a number of outliers. To achieve an accurate estimate of the transformation from such putative point correspondence, we formulate the registration problem by a mixture model with a set of latent variables introduced to identify outliers, and a prior involving manifold regularization is imposed on the transformation to capture the underlying intrinsic geometry of the input data. The non-rigid transformation is specified in a reproducing kernel Hilbert space and a sparse approximation is adopted to achieve a fast implementation. Extensive experiments on both 2D and 3D data demonstrate that our method can yield superior results compared to other state-of-the-arts, especially in case of badly degraded data.


#8 TextBoxes: A Fast Text Detector with a Single Deep Neural Network [PDF] [Copy] [Kimi] [REL]

Authors: Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, Wenyu Liu

This paper presents an end-to-end trainable fast scene text detector, named TextBoxes, which detects scene text with both high accuracy and efficiency in a single network forward pass, involving no post-process except for a standard non-maximum suppression. TextBoxes outperforms competing methods in terms of text localization accuracy and is much faster, taking only 0.09s per image in a fast implementation. Furthermore, combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks.


#9 Attention Correctness in Neural Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Chenxi Liu, Junhua Mao, Fei Sha, Alan Yuille

Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the ``correctness'' of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.


#10 Reference Based LSTM for Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Minghai Chen, Guiguang Ding, Sicheng Zhao, Hui Chen, Qiang Liu, Jungong Han

Image captioning is an important problem in artificial intelligence, related to both computer vision and natural language processing. There are two main problems in existing methods: in the training phase, it is difficult to find which parts of the captions are more essential to the image; in the caption generation phase, the objects or the scenes are sometimes misrecognized. In this paper, we consider the training images as the references and propose a Reference based Long Short Term Memory (R-LSTM) model, aiming to solve these two problems in one goal. When training the model, we assign different weights to different words, which enables the network to better learn the key information of the captions. When generating a caption, the consensus score is utilized to exploit the reference information of neighbor images, which might fix the misrecognition and make the descriptions more natural-sounding. The proposed R-LSTM model outperforms the state-of-the-art approaches on the benchmark dataset MS COCO and obtains top 2 position on 11 of the 14 metrics on the online test server.


#11 Multi-Path Feedback Recurrent Neural Networks for Scene Parsing [PDF] [Copy] [Kimi] [REL]

Authors: Xiaojie Jin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan

In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images. MPF-RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Different from feedforward CNNs and RNNs with only single feedback, MPF-RNN propagates the contextual features learned at top layer through multiple weighted recurrent connections to learn bottom features. For better training MPF-RNN, we propose a new strategy that considers accumulative loss at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects. With these two novel components, MPF-RNN has achieved significant improvement over strong baselines (VGG16 and Res101) on five challenging scene parsing benchmarks, including traditional SiftFlow, Barcelona, CamVid, Stanford Background as well as the recently released large-scale ADE20K.


#12 Learning Patch-Based Dynamic Graph for Visual Tracking [PDF] [Copy] [Kimi] [REL]

Authors: Chenglong Li, Liang Lin, Wangmeng Zuo, Jin Tang

Existing visual tracking methods usually localize the object with a bounding box, in which the foreground object trackers/detectors are often disturbed by the introduced background information. To handle this problem, we aim to learn a more robust object representation for visual tracking. In particular, the tracked object is represented with a graph structure (i.e., a set of non-overlapping image patches), in which the weight of each node (patch) indicates how likely it belongs to the foreground and edges are also weighed for indicating the appearance compatibility of two neighboring nodes. This graph is dynamically learnt (i.e., the nodes and edges received weights) and applied in object tracking and model updating. We constrain the graph learning from two aspects: i) the global low-rank structure over all nodes and ii) the local sparseness of node neighbors. During the tracking process, our method performs the following steps at each frame. First, the graph is initialized by assigning either 1 or 0 to the weights of some image patches according to the predicted bounding box. Second, the graph is optimized through designing a new ALM (Augmented Lagrange Multiplier) based algorithm. Third, the object feature representation is updated by imposing the weights of patches on the extracted image features. The object location is finally predicted by adopting the Struck tracker. Extensive experiments show that our approach outperforms the state-of-the-art tracking methods on two standard benchmarks, i.e., OTB100 and NUS-PRO.


#13 A Multi-Task Deep Network for Person Re-Identification [PDF] [Copy] [Kimi] [REL]

Authors: Weihua Chen, Xiaotang Chen, Jianguo Zhang, Kaiqi Huang

Person re-identification (ReID) focuses on identifying people across different scenes in video surveillance, which is usually formulated as a binary classification task or a ranking task in current person ReID approaches. In this paper, we take both tasks into account and propose a multi-task deep network (MTDnet) that makes use of their own advantages and jointly optimize the two tasks simultaneously for person ReID. To the best of our knowledge, we are the first to integrate both tasks in one network to solve the person ReID. We show that our proposed architecture significantly boosts the performance. Furthermore, deep architecture in general requires a sufficient dataset for training, which is usually not met in person ReID. To cope with this situation, we further extend the MTDnet and propose a cross-domain architecture that is capable of using an auxiliary set to assist training on small target sets. In the experiments, our approach outperforms most of existing person ReID algorithms on representative datasets including CUHK03, CUHK01, VIPeR, iLIDS and PRID2011, which clearly demonstrates the effectiveness of the proposed approach.


#14 Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, Yuanqing Lin

A key challenge in fine-grained recognition is how to find and represent discriminative local regions.Recent attention models are capable of learning discriminative region localizers only from category labels with reinforcement learning. However, not utilizing any explicit part information, they are not able to accurately find multiple distinctive regions.In this work, we introduce an attribute-guided attention localization scheme where the local region localizers are learned under the guidance of part attribute descriptions.By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm. The attribute labeling requirement of the scheme is more amenable than the accurate part location annotation required by traditional part-based fine-grained recognition methods.Experimental results on the CUB-200-2011 dataset demonstrate the superiority of the proposed scheme on both fine-grained recognition and attribute recognition.


#15 Image Cosegmentation via Saliency-Guided Constrained Clustering with Cosine Similarity [PDF] [Copy] [Kimi] [REL]

Authors: Zhiqiang Tao, Hongfu Liu, Huazhu Fu, Yun Fu

Cosegmentation jointly segments the common objects from multiple images. In this paper, a novel clustering algorithm, called Saliency-Guided Constrained Clustering approach with Cosine similarity (SGC3), is proposed for the image cosegmentation task, where the common foregrounds are extracted via a one-step clustering process. In our method, the unsupervised saliency prior is utilized as a partition-level side information to guide the clustering process. To guarantee the robustness to noise and outlier in the given prior, the similarities of instance-level and partition-level are jointly computed for cosegmentation. Specifically, we employ cosine distance to calculate the feature similarity between data point and its cluster centroid, and introduce a cosine utility function to measure the similarity between clustering result and the side information. These two parts are both based on the cosine similarity, which is able to capture the intrinsic structure of data, especially for the non-spherical cluster structure. Finally, a K-means-like optimization is designed to solve our objective function in an efficient way. Experimental results on two widely-used datasets demonstrate our approach achieves competitive performance over the state-of-the-art cosegmentation methods.


#16 Boosting Complementary Hash Tables for Fast Nearest Neighbor Search [PDF] [Copy] [Kimi] [REL]

Authors: Xianglong Liu, Cheng Deng, Yadong Mu, Zhujin Li

Hashing has been proven a promising technique for fast nearest neighbor search over massive databases. In many practical tasks it usually builds multiple hash tables for a desired level of recall performance. However, existing multi-table hashing methods suffer from the heavy table redundancy, without strong table complementarity and effective hash code learning. To address the problem, this paper proposes a multi-table learning method which pursues a specified number of complementary and informative hash tables from a perspective of ensemble learning. By regarding each hash table as a neighbor prediction model, the multi-table search procedure boils down to a linear assembly of predictions stemming from multiple tables. Therefore, a sequential updating and learning framework is naturally established in a boosting mechanism, theoretically guaranteeing the table complementarity and algorithmic convergence. Furthermore, each boosting round pursues the discriminative hash functions for each table by a discrete optimization in the binary code space. Extensive experiments carried out on two popular tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed boosted complementary hash-tables method enjoys the strong table complementarity and significantly outperforms the state-of-the-arts.


#17 Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models [PDF] [Copy] [Kimi] [REL]

Authors: Siyi Li, Dit-Yan Yeung

Despite recent advances in the visual tracking community, most studies so far have focused on the observation model. As another important component in the tracking system, the motion model is much less well-explored especially for some extreme scenarios. In this paper, we consider one such scenario in which the camera is mounted on an unmanned aerial vehicle (UAV) or drone. We build a benchmark dataset of high diversity, consisting of 70 videos captured by drone cameras. To address the challenging issue of severe camera motion, we devise simple baselines to model the camera motion by geometric transformation based on background feature points. An extensive comparison of recent state-of-the-art trackers and their motion model variants on our drone tracking dataset validates both the necessity of the dataset and the effectiveness of the proposed methods. Our aim for this work is to lay the foundation for further research in the UAV tracking area.


#18 Face Hallucination with Tiny Unaligned Images by Transformative Discriminative Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Xin Yu, Fatih Porikli

Conventional face hallucination methods rely heavily on accurate alignment of low-resolution (LR) faces before upsampling them. Misalignment often leads to deficient results and unnatural artifacts for large upscaling factors. However, due to the diverse range of poses and different facial expressions, aligning an LR input image, in particular when it is tiny, is severely difficult. To overcome this challenge, here we present an end-to-end transformative discriminative neural network (TDN) devised for super-resolving unaligned and very small face images with an extreme upscaling factor of 8. Our method employs an upsampling network where we embed spatial transformation layers to allow local receptive fields to line-up with similar spatial supports. Furthermore, we incorporate a class-specific loss in our objective through a successive discriminative network to improve the alignment and upsampling performance with semantic information. Extensive experiments on large face datasets show that the proposed method significantly outperforms the state-of-the-art.


#19 Robust Visual Tracking via Local-Global Correlation Filter [PDF] [Copy] [Kimi] [REL]

Authors: Heng Fan, Jinhai Xiang

Correlation filter has drawn increasing interest in visual tracking due to its high efficiency, however, it is sensitive to partial occlusion, which may result in tracking failure. To address this problem, we propose a novel local-global correlation filter (LGCF) for object tracking. Our LGCF model utilizes both local-based and global-based strategies, and effectively combines these two strategies by exploiting the relationship of circular shifts among local object parts and global target for their motion models to preserve the structure of object. In specific, our proposed model has two advantages: (1) Owing to the benefits of local-based mechanism, our method is robust to partial occlusion by leveraging visible parts. (2) Taking into account the relationship of motion models among local parts and global target, our LGCF model is able to capture the inner structure of object, which further improves its robustness to occlusion. In addition, to alleviate the issue of drift away from object, we incorporate temporal consistencies of both local parts and global target in our LGCF model. Besides, we adopt an adaptive method to accurately estimate the scale of object. Extensive experiments on OTB15 with 100 videos demonstrate that our tracking algorithm performs favorably against state-of-the-art methods.


#20 A Multiview-Based Parameter Free Framework for Group Detection [PDF] [Copy] [Kimi] [REL]

Authors: Xuelong Li, Mulin Chen, Feiping Nie, Qi Wang

Group detection is fundamentally important for analyzing crowd behaviors, and has attracted plenty of attention in artificial intelligence. However, existing works mostly have limitations due to the insufficient utilization of crowd properties and the arbitrary processing of individuals. In this paper,we propose the Multiview-based Parameter Free (MPF) approach to detect groups in crowd scenes. The main contributions made in this study are threefold: (1) a new structural context descriptor is designed to characterize the structural property of individuals in crowd motions; (2) an self-weighted multiview clustering method is proposed to cluster feature points by incorporating their motion and context similarities;(3) a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned. Extensive experiments on various real world datasets demonstrate the effectiveness of the proposed approach, and show its superiority against state-of-the-art group detection techniques.


#21 Quantifying and Detecting Collective Motion by Manifold Learning [PDF] [Copy] [Kimi] [REL]

Authors: Qi Wang, Mulin Chen, Xuelong Li

The analysis of collective motion has attracted many researchers in artificial intelligence. Though plenty of works have been done on this topic, the achieved performance isstill unsatisfying due to the complex nature of collective motions. By investigating the similarity of individuals, this paper proposes a novel framework for both quantifying and detecting collective motions. Our main contributions are threefold: (1) the time-varying dynamics of individuals are deeply investigated to better characterize the individual motion; (2) a structure-based collectiveness measurement is designed toprecisely quantify both individual-level and scene-level properties of collective motions; (3) a multi-stage clustering strategy is presented to discover a more comprehensive understanding of the crowd scenes, containing both local and global collective motions. Extensive experimental results on realworld data sets show that our method is capable of handling crowd scenes with complicated structures and various dynamics, and demonstrate its superior performance against state-of-the-art competitors.


#22 Nonnegative Orthogonal Graph Matching [PDF] [Copy] [Kimi] [REL]

Authors: Bo Jiang, Jin Tang, Chris Ding, Bin Luo

Graph matching problem that incorporates pair-wise constraints can be formulated as Quadratic Assignment Problem(QAP). The optimal solution of QAP is discrete and combinational, which makes QAP problem NP-hard. Thus, many algorithms have been proposed to find approximate solutions. In this paper, we propose a new algorithm, called Nonnegative Orthogonal Graph Matching (NOGM), for QAP matching problem. NOGM is motivated by our new observation that the discrete mapping constraint of QAP can be equivalently encoded by a nonnegative orthogonal constraint which is much easier to implement computationally. Based on this observation, we develop an effective multiplicative update algorithm to solve NOGM and thus can find an effective approximate solution for QAP problem. Comparing with many traditional continuous methods which usually obtain continuous solutions and should be further discretized, NOGM can obtain a sparse solution and thus incorporates the desirable discrete constraint naturally in its optimization. Promising experimental results demonstrate benefits of NOGM algorithm.


#23 Deep Correlated Metric Learning for Sketch-based 3D Shape Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Guoxian Dai, Jin Xie, Fan Zhu, Yi Fang

The explosive growth of 3D models has led to the pressing demand for an efficient searching system. Traditional model-based search is usually not convenient, since people don't always have 3D model available by side. The sketch-based 3D shape retrieval is a promising candidate due to its simpleness and efficiency. The main challenge for sketch-based 3D shape retrieval is the discrepancy across different domains. In the paper, we propose a novel deep correlated metric learning (DCML) method to mitigate the discrepancy between sketch and 3D shape domains. The proposed DCML trains two distinct deep neural networks (one for each domain) jointly with one loss, which learns two deep nonlinear transformations to map features from both domains into a nonlinear feature space. The proposed loss, including discriminative loss and correlation loss, aims to increase the discrimination of features within each domain as well as the correlation between different domains. In the transfered space, the discriminative loss minimizes the intra-class distance of the deep transformed features and maximizes the inter-class distance of the deep transformed features at least a predefined margin within each domain, while the correlation loss focuses on minimizing the distribution discrepancy across different domains. Our proposed method is evaluated on SHREC 2013 and 2014 benchmarks, and the experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods.


#24 An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data [PDF] [Copy] [Kimi] [REL]

Authors: Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu

Human action recognition is an important task in computer vision. Extracting discriminative spatial and temporal features to model the spatial and temporal evolutions of different actions plays a key role in accomplishing this task. In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data. We build our model on top of the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which learns to selectively focus on discriminative joints of skeleton within each frame of the inputs and pays different levels of attention to the outputs of different frames. Furthermore, to ensure effective training of the network, we propose a regularized cross-entropy loss to drive the model learning process and develop a joint training strategy accordingly. Experimental results demonstrate the effectiveness of the proposed model, both on the small human action recognition dataset of SBU and the currently largest NTU dataset.


#25 Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network [PDF] [Copy] [Kimi] [REL]

Authors: Suha Kwak, Seunghoon Hong, Bohyung Han

We propose a weakly supervised semantic segmentation algorithm based on deep neural networks, which relies on image-level class labels only. The proposed algorithm alternates between generating segmentation annotations and learning a semantic segmentation network using the generated annotations. A key determinant of success in this framework is the capability to construct reliable initial annotations given image-level labels only. To this end, we propose Superpixel Pooling Network (SPN), which utilizes superpixel segmentation of input image as a pooling layout to reflect low-level image structure for learning and inferring semantic segmentation. The initial annotations generated by SPN are then used to learn another neural network that estimates pixel-wise semantic labels. The architecture of the segmentation network decouples semantic segmentation task into classification and segmentation so that the network learns class-agnostic shape prior from the noisy annotations. It turns out that both networks are critical to improve semantic segmentation accuracy. The proposed algorithm achieves outstanding performance in weakly supervised semantic segmentation task compared to existing techniques on the challenging PASCAL VOC 2012 segmentation benchmark.