IJCAI.2021 - Computer Vision

| Total: 108

#1 GASP: Gated Attention for Saliency Prediction [PDF] [Copy] [Kimi] [REL]

Authors: Fares Abawi, Tom Weber, Stefan Wermter

Saliency prediction refers to the computational task of modeling overt attention. Social cues greatly influence our attention, consequently altering our eye movements and behavior. To emphasize the efficacy of such features, we present a neural model for integrating social cues and weighting their influences. Our model consists of two stages. During the first stage, we detect two social cues by following gaze, estimating gaze direction, and recognizing affect. These features are then transformed into spatiotemporal maps through image processing operations. The transformed representations are propagated to the second stage (GASP) where we explore various techniques of late fusion for integrating social cues and introduce two sub-networks for directing attention to relevant stimuli. Our experiments indicate that fusion approaches achieve better results for static integration methods, whereas non-fusion approaches for which the influence of each modality is unknown, result in better outcomes when coupled with recurrent models for dynamic saliency prediction. We show that gaze direction and affective representations contribute a prediction to ground-truth correspondence improvement of at least 5% compared to dynamic saliency models without social cues. Furthermore, affective representations improve GASP, supporting the necessity of considering affect-biased attention in predicting saliency.


#2 Explaining Self-Supervised Image Representations with Visual Probing [PDF] [Copy] [Kimi] [REL]

Authors: Dominika Basaj, Witold Oleszkiewicz, Igor Sieradzki, Michał Górszczak, Barbara Rychalska, Tomasz Trzcinski, Bartosz Zieliński

Recently introduced self-supervised methods for image representation learning provide on par or superior results to their fully supervised competitors, yet the corresponding efforts to explain the self-supervised approaches lag behind. Motivated by this observation, we introduce a novel visual probing framework for explaining the self-supervised models by leveraging probing tasks employed previously in natural language processing. The probing tasks require knowledge about semantic relationships between image parts. Hence, we propose a systematic approach to obtain analogs of natural language in vision, such as visual words, context, and taxonomy. We show the effectiveness and applicability of those analogs in the context of explaining self-supervised representations. Our key findings emphasize that relations between language and vision can serve as an effective yet intuitive tool for discovering how machine learning models work, independently of data modality. Our work opens a plethora of research pathways towards more explainable and transparent AI.


#3 Themis: A Fair Evaluation Platform for Computer Vision Competitions [PDF] [Copy] [Kimi] [REL]

Authors: Zinuo Cai, Jianyong Yuan, Yang Hua, Tao Song, Hao Wang, Zhengui Xue, Ningxin Hu, Jonathan Ding, Ruhui Ma, Mohammad Reza Haghighat, Haibing Guan

It has become increasingly thorny for computer vision competitions to preserve fairness when participants intentionally fine-tune their models against the test datasets to improve their performance. To mitigate such unfairness, competition organizers restrict the training and evaluation process of participants' models. However, such restrictions introduce massive computation overheads for organizers and potential intellectual property leakage for participants. Thus, we propose Themis, a framework that trains a noise generator jointly with organizers and participants to prevent intentional fine-tuning by protecting test datasets from surreptitious manual labeling. Specifically, with the carefully designed noise generator, Themis adds noise to perturb test sets without twisting the performance ranking of participants' models. We evaluate the validity of Themis with a wide spectrum of real-world models and datasets. Our experimental results show that Themis effectively enforces competition fairness by precluding manual labeling of test sets and preserving the performance ranking of participants' models.


#4 Novelty Detection via Contrastive Learning with Negative Data Augmentation [PDF] [Copy] [Kimi] [REL]

Authors: Chengwei Chen, Yuan Xie, Shaohui Lin, Ruizhi Qiao, Jian Zhou, Xin Tan, Yi Zhang, Lizhuang Ma

Novelty detection is the process of determining whether a query example differs from the learned training distribution. Previous generative adversarial networks based methods and self-supervised approaches suffer from instability training, mode dropping, and low discriminative ability. We overcome such problems by introducing a novel decoder-encoder framework. Firstly, a generative network (decoder) learns the representation by mapping the initialized latent vector to an image. In particular, this vector is initialized by considering the entire distribution of training data to avoid the problem of mode-dropping. Secondly, a contrastive network (encoder) aims to ``learn to compare'' through mutual information estimation, which directly helps the generative network to obtain a more discriminative representation by using a negative data augmentation strategy. Extensive experiments show that our model has significant superiority over cutting-edge novelty detectors and achieves new state-of-the-art results on various novelty detection benchmarks, e.g. CIFAR10 and DCASE. Moreover, our model is more stable for training in a non-adversarial manner, compared to other adversarial based novelty detection methods.


#5 Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition [PDF] [Copy] [Kimi] [REL]

Authors: Jingye Chen, Bin Li, Xiangyang Xue

Chinese character recognition has attracted much research interest due to its wide applications. Although it has been studied for many years, some issues in this field have not been completely resolved yet, \textit{e.g.} the zero-shot problem. Previous character-based and radical-based methods have not fundamentally addressed the zero-shot problem since some characters or radicals in test sets may not appear in training sets under a data-hungry condition. Inspired by the fact that humans can generalize to know how to write characters unseen before if they have learned stroke orders of some characters, we propose a stroke-based method by decomposing each character into a sequence of strokes, which are the most basic units of Chinese characters. However, we observe that there is a one-to-many relationship between stroke sequences and Chinese characters. To tackle this challenge, we employ a matching-based strategy to transform the predicted stroke sequence to a specific character. We evaluate the proposed method on handwritten characters, printed artistic characters, and scene characters. The experimental results validate that the proposed method outperforms existing methods on both character zero-shot and radical zero-shot tasks. Moreover, the proposed method can be easily generalized to other languages whose characters can be decomposed into strokes.


#6 Leveraging Human Attention in Novel Object Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Xianyu Chen, Ming Jiang, Qi Zhao

Image captioning models depend on training with paired image-text corpora, which poses various challenges in describing images containing novel objects absent from the training data. While previous novel object captioning methods rely on external image taggers or object detectors to describe novel objects, we present the Attention-based Novel Object Captioner (ANOC) that complements novel object captioners with human attention features that characterize generally important information independent of tasks. It introduces a gating mechanism that adaptively incorporates human attention with self-learned machine attention, with a Constrained Self-Critical Sequence Training method to address the exposure bias while maintaining constraints of novel object descriptions. Extensive experiments conducted on the nocaps and Held-Out COCO datasets demonstrate that our method considerably outperforms the state-of-the-art novel object captioners. Our source code is available at https://github.com/chenxy99/ANOC.


#7 Boundary Knowledge Translation based Reference Semantic Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Lechao Cheng, Zunlei Feng, Xinchao Wang, Ya Jie Liu, Jie Lei, Mingli Song

Given a reference object of an unknown type in an image, human observers can effortlessly find the objects of the same category in another image and precisely tell their visual boundaries. Such visual cognition capability of humans seems absent from the current research spectrum of computer vision. Existing segmentation networks, for example, rely on a humongous amount of labeled data, which is laborious and costly to collect and annotate; besides, the performance of segmentation networks tend to downgrade as the number of the category increases. In this paper, we introduce a novel Reference semantic segmentation Network (Ref-Net) to conduct visual boundary knowledge translation. Ref-Net contains a Reference Segmentation Module (RSM) and a Boundary Knowledge Translation Module (BKTM). Inspired by the human recognition mechanism, RSM is devised only to segment the same category objects based on the features of the reference objects. BKTM, on the other hand, introduces two boundary discriminator branches to conduct inner and outer boundary segmentation of the target object in an adversarial manner, and translate the annotated boundary knowledge of open-source datasets into the segmentation network. Exhaustive experiments demonstrate that, with tens of finely-grained annotated samples as guidance, Ref-Net achieves results on par with fully supervised methods on six datasets. Our code can be found in the supplementary material.


#8 Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering [PDF1] [Copy] [Kimi1] [REL]

Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.


#9 Phonovisual Biases in Language: is the Lexicon Tied to the Visual World? [PDF] [Copy] [Kimi] [REL]

Authors: Andrea Gregor de Varda, Carlo Strapparava

The present paper addresses the study of cross-linguistic and cross-modal iconicity within a deep learning framework. An LSTM-based Recurrent Neural Network is trained to associate the phonetic representation of a concrete word, encoded as a sequence of feature vectors, to the visual representation of its referent, expressed as an HCNN-transformed image. The processing network is then tested, without further training, in a language that does not appear in the training set and belongs to a different language family. The performance of the model is evaluated through a comparison with a randomized baseline; we show that such an imaginative network is capable of extracting language-independent generalizations in the mapping from linguistic sounds to visual features, providing empirical support for the hypothesis of a universal sound-symbolic substrate underlying all languages.


#10 Direction-aware Feature-level Frequency Decomposition for Single Image Deraining [PDF] [Copy] [Kimi] [REL]

Authors: Sen Deng, Yidan Feng, Mingqiang Wei, Haoran Xie, Yiping Chen, Jonathan Li, Xiao-Ping Zhang, Jing Qin

We present a novel direction-aware feature-level frequency decomposition network for single image deraining. Compared with existing solutions, the proposed network has three compelling characteristics. First, unlike previous algorithms, we propose to perform frequency decomposition at feature-level instead of image-level, allowing both low-frequency maps containing structures and high-frequency maps containing details to be continuously refined during the training procedure. Second, we further establish communication channels between low-frequency maps and high-frequency maps to interactively capture structures from high-frequency maps and add them back to low-frequency maps and, simultaneously, extract details from low-frequency maps and send them back to high-frequency maps, thereby removing rain streaks while preserving more delicate features in the input image. Third, different from existing algorithms using convolutional filters consistent in all directions, we propose a direction-aware filter to capture the direction of rain streaks in order to more effectively and thoroughly purge the input images of rain streaks. We extensively evaluate the proposed approach in three representative datasets and experimental results corroborate our approach consistently outperforms state-of-the-art deraining algorithms.


#11 TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun Shan, Xuanjing Huang

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.


#12 Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads [PDF] [Copy] [Kimi] [REL]

Authors: Chenyu Gao, Qi Zhu, Peng Wang, Qi Wu

Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling 12 different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.


#13 Feature Space Targeted Attacks by Statistic Alignment [PDF] [Copy] [Kimi] [REL]

Authors: Lianli Gao, Yaya Cheng, Qilong Zhang, Xing Xu, Jingkuan Song

By adding human-imperceptible perturbations to images, DNNs can be easily fooled. As one of the mainstream methods, feature space targeted attacks perturb images by modulating their intermediate feature maps, for the discrepancy between the intermediate source and target features is minimized. However, the current choice of pixel-wise Euclidean Distance to measure the discrepancy is questionable because it unreasonably imposes a spatial-consistency constraint on the source and target features. Intuitively, an image can be categorized as "cat'' no matter the cat is on the left or right of the image. To address this issue, we propose to measure this discrepancy using statistic alignment. Specifically, we design two novel approaches called Pair-wise Alignment Attack and Global-wise Alignment Attack, which attempt to measure similarities between feature maps by high-order statistics with translation invariance. Furthermore, we systematically analyze the layer-wise transferability with varied difficulties to obtain highly reliable attacks. Extensive experiments verify the effectiveness of our proposed method, and it outperforms the state-of-the-art algorithms by a large margin. Our code is publicly available at https://github.com/yaya-cheng/PAA-GAA.


#14 Multi-view Feature Augmentation with Adaptive Class Activation Mapping [PDF] [Copy] [Kimi] [REL]

Authors: Xiang Gao, Yingjie Tian, Zhiquan Qi

We propose an end-to-end-trainable feature augmentation module built for image classification that extracts and exploits multi-view local features to boost model performance. Different from using global average pooling (GAP) to extract vectorized features from only the global view, we propose to sample and ensemble diverse multi-view local features to improve model robustness. To sample class-representative local features, we incorporate a simple auxiliary classifier head (comprising only one 1x1 convolutional layer) which efficiently and adaptively attends to class-discriminative local regions of feature maps via our proposed AdaCAM (Adaptive Class Activation Mapping). Extensive experiments demonstrate consistent and noticeable performance gains achieved by our multi-view feature augmentation module.


#15 Learning Spectral Dictionary for Local Representation of Mesh [PDF] [Copy] [Kimi] [REL]

Authors: Zhongpai Gao, Junchi Yan, Guangtao Zhai, Xiaokang Yang

For meshes, sharing the topology of a template is a common and practical setting in face-, hand-, and body-related applications. Meshes are irregular since each vertex's neighbors are unordered and their orientations are inconsistent with other vertices. Previous methods use isotropic filters or predefined local coordinate systems or learning weighting matrices for each vertex of the template to overcome the irregularity. Learning weighting matrices for each vertex to soft-permute the vertex's neighbors into an implicit canonical order is an effective way to capture the local structure of each vertex. However, learning weighting matrices for each vertex increases the parameter size linearly with the number of vertices and large amounts of parameters are required for high-resolution 3D shapes. In this paper, we learn spectral dictionary (i.e., bases) for the weighting matrices such that the parameter size is independent of the resolution of 3D shapes. The coefficients of the weighting matrix bases for each vertex are learned from the spectral features of the template's vertex and its neighbors in a weight-sharing manner. Comprehensive experiments demonstrate that our model produces state-of-the-art results with a much smaller model size.


#16 Self-Supervised Video Action Localization with Adversarial Temporal Transforms [PDF] [Copy] [Kimi] [REL]

Authors: Guoqiang Gong, Liangfeng Zheng, Wenhao Jiang, Yadong Mu

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.


#17 EventDrop: Data Augmentation for Event-based Learning [PDF] [Copy] [Kimi] [REL]

Authors: Fuqiang Gu, Weicong Sng, Xuke Hu, Fangwen Yu

The advantages of event-sensing over conventional sensors (e.g., higher dynamic range, lower time latency, and lower power consumption) have spurred research into machine learning for event data. Unsurprisingly, deep learning has emerged as a competitive methodology for learning with event sensors; in typical setups, discrete and asynchronous events are first converted into frame-like tensors on which standard deep networks can be applied. However, over-fitting remains a challenge, particularly since event datasets remain small relative to conventional datasets (e.g., ImageNet). In this paper, we introduce EventDrop, a new method for augmenting asynchronous event data to improve the generalization of deep models. By dropping events selected with various strategies, we are able to increase the diversity of training data (e.g., to simulate various levels of occlusion). From a practical perspective, EventDrop is simple to implement and computationally low-cost. Experiments on two event datasets (N-Caltech101 and N-Cars) demonstrate that EventDrop can significantly improve the generalization performance across a variety of deep networks.


#18 AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [PDF] [Copy] [Kimi] [REL]

Authors: Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Feng Ji, Ji Zhang, Alberto Del Bimbo

A number of studies point out that current Visual Question Answering (VQA) models are severely affected by the language prior problem, which refers to blindly making predictions based on the language shortcut. Some efforts have been devoted to overcoming this issue with delicate models. However, there is no research to address it from the view of the answer feature space learning, despite the fact that existing VQA methods all cast VQA as a classification task. Inspired by this, in this work, we attempt to tackle the language prior problem from the viewpoint of the feature space learning. An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space under each question type properly. In this way, the limited patterns within the language modality can be largely reduced to eliminate the language priors. We apply this loss function to several baseline models and evaluate its effectiveness on two VQA-CP benchmarks. Experimental results demonstrate that our proposed adapted margin cosine loss can enhance the baseline models with an absolute performance gain of 15\% on average, strongly verifying the potential of tackling the language prior problem in VQA from the angle of the answer feature space learning.


#19 Disentangled Face Attribute Editing via Instance-Aware Latent Space Search [PDF] [Copy] [Kimi] [REL]

Authors: Yuxuan Han, Jiaolong Yang, Ying Fu

Recent works have shown that a rich set of semantic directions exist in the latent space of Generative Adversarial Networks (GANs), which enables various facial attribute editing applications. However, existing methods may suffer poor attribute variation disentanglement, leading to unwanted change of other attributes when altering the desired one. The semantic directions used by existing methods are at attribute level, which are difficult to model complex attribute correlations, especially in the presence of attribute distribution bias in GAN's training set. In this paper, we propose a novel framework (IALS) that performs Instance-Aware Latent-Space Search to find semantic directions for disentangled attribute editing. The instance information is injected by leveraging the supervision from a set of attribute classifiers evaluated on the input images. We further propose a Disentanglement-Transformation (DT) metric to quantify the attribute transformation and disentanglement efficacy and find the optimal control factor between attribute-level and instance-specific directions based on it. Experimental results on both GAN-generated and real-world images collectively show that our method outperforms state-of-the-art methods proposed recently by a wide margin. Code is available at https://github.com/yxuhan/IALS.


#20 DeepME: Deep Mixture Experts for Large-scale Image Classification [PDF] [Copy] [Kimi] [REL]

Authors: Ming He, Guangyi Lv, Weidong He, Jianping Fan, Guihua Zeng

Although deep learning has demonstrated its outstanding performance on image classification, most well-known deep networks make efforts to optimize both their structures and their node weights for recognizing fewer (e.g., no more than 1000) object classes. Therefore, it is attractive to extend or mixture such well-known deep networks to support large-scale image classification. According to our best knowledge, how to adaptively and effectively fuse multiple CNNs for large-scale image classification is still under-explored. On this basis, a deep mixture algorithm is developed to support large-scale image classification in this paper. First, a soft spectral clustering method is developed to construct a two-layer ontology (group layer and category layer) by assigning large numbers of image categories into a set of groups according to their inter-category semantic correlations, where the semantically-related image categories under the neighbouring group nodes may share similar learning complexities. Then, such two-layer ontology is further used to generate the task groups, in which each task group contains partial image categories with similar learning complexities and one particular base deep network is learned. Finally, a gate network is learned to combine all base deep networks with fewer diverse outputs to generate a mixture network with larger outputs. Our experimental results on ImageNet10K have demonstrated that our proposed deep mixture algorithm can achieve very competitive results (top 1 accuracy: 32.13%) on large-scale image classification tasks.


#21 Multi-Scale Selective Feedback Network with Dual Loss for Real Image Denoising [PDF] [Copy] [Kimi] [REL]

Authors: Xiaowan Hu, Yuanhao Cai, Zhihong Liu, Haoqian Wang, Yulun Zhang

The feedback mechanism in the human visual system extracts high-level semantics from noisy scenes. It then guides low-level noise removal, which has not been fully explored in image denoising networks based on deep learning. The commonly used fully-supervised network optimizes parameters through paired training data. However, unpaired images without noise-free labels are ubiquitous in the real world. Therefore, we proposed a multi-scale selective feedback network (MSFN) with the dual loss. We allow shallow layers to access valuable contextual information from the following deep layers selectively between two adjacent time steps. Iterative refinement mechanism can remove complex noise from coarse to fine. The dual regression is designed to reconstruct noisy images to establish closed-loop supervision that is training-friendly for unpaired data. We use the dual loss to optimize the primary clean-to-noisy task and the dual noisy-to-clean task simultaneously. Extensive experiments prove that our method achieves state-of-the-art results and shows better adaptability on real-world images than the existing methods.


#22 Dynamic Inconsistency-aware DeepFake Video Detection [PDF] [Copy] [Kimi] [REL]

Authors: Ziheng Hu, Hongtao Xie, YuXin Wang, Jiahong Li, Zhongyuan Wang, Yongdong Zhang

The spread of DeepFake videos causes a serious threat to information security, calling for effective detection methods to distinguish them. However, the performance of recent frame-based detection methods become limited due to their ignorance of the inter-frame inconsistency of fake videos. In this paper, we propose a novel Dynamic Inconsistency-aware Network to handle the inconsistent problem, which uses a Cross-Reference module (CRM) to capture both the global and local inter-frame inconsistencies. The CRM contains two parallel branches. The first branch takes faces from adjacent frames as input, and calculates a structure similarity map for a global inconsistency representation. The second branch only focuses on the inter-frame variation of independent critical regions, which captures the local inconsistency. To the best of our knowledge, this is the first work to totally use the inter-frame inconsistency information from the global and local perspectives. Compared with existing methods, our model provides a more accurate and robust detection on FaceForensics++, DFDC-preview and Celeb-DFv2 datasets.


#23 AgeFlow: Conditional Age Progression and Regression with Normalizing Flows [PDF] [Copy] [Kimi] [REL]

Authors: Zhizhong Huang, Shouzhen Chen, Junping Zhang, Hongming Shan

Age progression and regression aim to synthesize photorealistic appearance of a given face image with aging and rejuvenation effects, respectively. Existing generative adversarial networks (GANs) based methods suffer from the following three major issues: 1) unstable training introducing strong ghost artifacts in the generated faces, 2) unpaired training leading to unexpected changes in facial attributes such as genders and races, and 3) non-bijective age mappings increasing the uncertainty in the face transformation. To overcome these issues, this paper proposes a novel framework, termed AgeFlow, to integrate the advantages of both flow-based models and GANs. The proposed AgeFlow contains three parts: an encoder that maps a given face to a latent space through an invertible neural network, a novel invertible conditional translation module (ICTM) that translates the source latent vector to target one, and a decoder that reconstructs the generated face from the target latent vector using the same encoder network; all parts are invertible achieving bijective age mappings. The novelties of ICTM are two-fold. First, we propose an attribute-aware knowledge distillation to learn the manipulation direction of age progression while keeping other unrelated attributes unchanged, alleviating unexpected changes in facial attributes. Second, we propose to use GANs in the latent space to ensure the learned latent vector indistinguishable from the real ones, which is much easier than traditional use of GANs in the image domain. Experimental results demonstrate superior performance over existing GANs-based methods on two benchmarked datasets. The source code is available at https://github.com/Hzzone/AgeFlow.


#24 Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw [PDF] [Copy] [Kimi] [REL]

Authors: Yuqi Huo, Mingyu Ding, Haoyu Lu, Ziyuan Huang, Mingqian Tang, Zhiwu Lu, Tao Xiang

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned by detecting spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, as we demonstrate in the experiments, this task turns out to be intractable. We thus propose Constrained Spatiotemporal Jigsaw (CSJ) whereby the 3D jigsaws are formed in a constrained manner to ensure that large continuous spatiotemporal cuboids exist. This provides sufficient cues for the model to reason about the continuity. Instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable. The four tasks aim to learn representations sensitive to spatiotemporal continuity at both the local and global levels. Extensive experiments show that our CSJ achieves state-of-the-art on various benchmarks.


#25 Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Arjit Jain, Pranay Reddy Samala, Preethi Jyothi, Deepak Mittal, Maneesh Singh

Recent semi-supervised learning (SSL) methods are predominantly focused on multi-class classification tasks. Classification tasks allow for easy mixing of class labels during augmentation which does not trivially extend to structured outputs such as word sequences that appear in tasks like image captioning. Noisy Student Training is a recent SSL paradigm proposed for image classification that is an extension of self-training and teacher-student learning. In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. The original algorithm relies on computationally expensive data augmentation steps that involve perturbing the raw images and computing features for each perturbed image. We show that, even in the absence of raw image augmentation, the use of simple model and feature perturbations to the input images for the student model are beneficial to SSL training. We also show how a paraphrase generator could be effectively used for label augmentation to improve the quality of pseudo labels and significantly improve performance. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores.