| Total: 91
Deep image compression systems mainly contain four components: encoder, quantizer, entropy model, and decoder. To optimize these four components, a joint rate-distortion framework was proposed, and many deep neural network-based methods achieved great success in image compression. However, almost all convolutional neural network-based methods treat channel-wise feature maps equally, reducing the flexibility in handling different types of information. In this paper, we propose a channel-level variable quantization network to dynamically allocate more bitrates for significant channels and withdraw bitrates for negligible channels. Specifically, we propose a variable quantization controller. It consists of two key components: the channel importance module, which can dynamically learn the importance of channels during training, and the splitting-merging module, which can allocate different bitrates for different channels. We also formulate the quantizer into a Gaussian mixture model manner. Quantitative and qualitative experiments verify the effectiveness of the proposed model and demonstrate that our method achieves superior performance and can produce much better visual reconstructions.
Vehicle Re-Identification (ReID) has attracted lots of research efforts due to its great significance to the public security. In vehicle ReID, we aim to learn features that are powerful in discriminating subtle differences between vehicles which are visually similar, and also robust against different orientations of the same vehicle. However, these two characteristics are hard to be encapsulated into a single feature representation simultaneously with unified supervision. Here we propose a Disentangled Feature Learning Network (DFLNet) to learn orientation specific and common features concurrently, which are discriminative at details and invariant to orientations, respectively. Moreover, to effectively use these two types of features for ReID, we further design a feature metric alignment scheme to ensure the consistency of the metric scales. The experiments show the effectiveness of our method that achieves state-of-the-art performance on three challenging datasets.
The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.
Scene perceiving and understanding tasks including depth estimation, visual odometry (VO) and camera relocalization are fundamental for applications such as autonomous driving, robots and drones. Driven by the power of deep learning, significant progress has been achieved on individual tasks but the rich correlations among the three tasks are largely neglected. In previous studies, VO is generally accurate in local scope yet suffers from drift in long distances. By contrast, camera relocalization performs well in the global sense but lacks local precision. We argue that these two tasks should be strategically combined to leverage the complementary advantages, and be further improved by exploiting the 3D geometric information from depth data, which is also beneficial for depth estimation in turn. Therefore, we present a collaborative learning framework, consisting of DepthNet, LocalPoseNet and GlobalPoseNet with a joint optimization loss to estimate depth, VO and camera localization unitedly. Moreover, the Geometric Attention Guidance Model is introduced to exploit the geometric relevance among three branches during learning. Extensive experiments demonstrate that the joint learning scheme is useful for all tasks and our method outperforms current state-of-the-art techniques in depth estimation and camera relocalization with highly competitive performance in VO.
We formulate object segmentation in video as a spectral graph clustering problem in space and time, in which nodes are pixels and their relations form local neighbourhoods. We claim that the strongest cluster in this pixel-level graph represents the salient object segmentation. We compute the main cluster using a novel and fast 3D filtering technique that finds the spectral clustering solution, namely the principal eigenvector of the graph's adjacency matrix, without building the matrix explicitly - which would be intractable. Our method is based on the power iteration which we prove is equivalent to performing a specific set of 3D convolutions in the space-time feature volume. This allows us to avoid creating the matrix and have a fast parallel implementation on GPU. We show that our method is much faster than classical power iteration applied directly on the adjacency matrix. Different from other works, ours is dedicated to preserving object consistency in space and time at the level of pixels. In experiments, we obtain consistent improvement over the top state of the art methods on DAVIS-2016 dataset. We also achieve top results on the well-known SegTrackv2 dataset.
Face portrait editing has achieved great progress in recent years. However, previous methods either 1) operate on pre-defined face attributes, lacking the flexibility of controlling shapes of high-level semantic facial components (e.g., eyes, nose, mouth), or 2) take manually edited mask or sketch as an intermediate representation for observable changes, but such additional input usually requires extra efforts to obtain. To break the limitations (e.g. shape, mask or sketch) of the existing methods, we propose a novel framework termed r FACE (Reference Guided FAce Component Editing) for diverse and controllable face component editing with geometric changes. Specifically, r-FACE takes an image inpainting model as the backbone, utilizing reference images as conditions for controlling the shape of face components. In order to encourage the framework to concentrate on the target face components, an example-guided attention module is designed to fuse attention features and the target face component features extracted from the reference image. Through extensive experimental validation and comparisons, we verify the effectiveness of the proposed framework.
Pedestrian detection at nighttime is a crucial and frontier problem in surveillance, but has not been well explored by the computer vision and artificial intelligence communities. Most of existing methods detect pedestrians under favorable lighting conditions (e.g. daytime) and achieve promising performances. In contrast, they often fail under unstable lighting conditions (e.g. nighttime). Night is a critical time for criminal suspects to act in the field of security. The existing nighttime pedestrian detection dataset is captured by a car camera, specially designed for autonomous driving scenarios. The dataset for nighttime surveillance scenario is still vacant. There are vast differences between autonomous driving and surveillance, including viewpoint and illumination. In this paper, we build a novel pedestrian detection dataset from the nighttime surveillance aspect: NightSurveillance1. As a benchmark dataset for pedestrian detection at nighttime, we compare the performances of state-of-the-art pedestrian detectors and the results reveal that the methods cannot solve all the challenging problems of NightSurveillance. We believe that NightSurveillance can further advance the research of pedestrian detection, especially in the field of surveillance security at nighttime.
Arbitrary shape text detection in natural scenes is an extremely challenging task. Unlike existing text detection approaches that only perceive texts based on limited feature representations, we propose a novel framework, namely TextFuseNet, to exploit the use of richer features fused for text detection. More specifically, we propose to perceive texts from three levels of feature representations, i.e., character-, word- and global-level, and then introduce a novel text representation fusion technique to help achieve robust arbitrary text detection. The multi-level feature representation can adequately describe texts by dissecting them into individual characters while still maintaining their general semantics. TextFuseNet then collects and merges the texts’ features from different levels using a multi-path fusion architecture which can effectively align and fuse different representations. In practice, our proposed TextFuseNet can learn a more adequate description of arbitrary shapes texts, suppressing false positives and producing more accurate detection results. Our proposed framework can also be trained with weak supervision for those datasets that lack character-level annotations. Experiments on several datasets show that the proposed TextFuseNet achieves state-of-the-art performance. Specifically, we achieve an F-measure of 94.3% on ICDAR2013, 92.1% on ICDAR2015, 87.1% on Total-Text and 86.6% on CTW-1500, respectively.
Convolutional neural networks (CNNs) have achieved remarkable success in image recognition. Although the internal patterns of the input images are effectively learned by the CNNs, these patterns only constitute a small proportion of useful patterns contained in the input images. This can be attributed to the fact that the CNNs will stop learning if the learned patterns are enough to make a correct classification. Network regularization methods like dropout and SpatialDropout can ease this problem. During training, they randomly drop the features. These dropout methods, in essence, change the patterns learned by the networks, and in turn, forces the networks to learn other patterns to make the correct classification. However, the above methods have an important drawback. Randomly dropping features is generally inefficient and can introduce unnecessary noise. To tackle this problem, we propose SelectScale. Instead of randomly dropping units, SelectScale selects the important features in networks and adjusts them during training. Using SelectScale, we improve the performance of CNNs on CIFAR and ImageNet.
The popular tracking-by-detection paradigm for multi-object tracking (MOT) focuses on solving data association problem, of which a robust similarity model lies in the heart. Most previous works make effort to improve feature representation for individual object while leaving the relations among objects less explored, which may be problematic in some complex scenarios. In this paper, we focus on leveraging the relations among objects to improve robustness of the similarity model. To this end, we propose a novel graph representation that takes both the feature of individual object and the relations among objects into consideration. Besides, a graph matching module is specially designed for the proposed graph representation to alleviate the impact of unreliable relations. With the help of the graph representation and the graph matching module, the proposed graph similarity model, named GSM, is more robust to the occlusion and the targets sharing similar appearance. We conduct extensive experiments on challenging MOT benchmarks and the experimental results demonstrate the effectiveness of the proposed method.
Recently, Convolutional Neural Networks (CNN) based image super-resolution (SR) have shown significant success in the literature. However, these methods are implemented as single-path stream to enrich feature maps from the input for the final prediction, which fail to fully incorporate former low-level features into later high-level features. In this paper, to tackle this problem, we propose a deep interleaved network (DIN) to learn how information at different states should be combined for image SR where shallow information guides deep representative features prediction. Our DIN follows a multi-branch pattern allowing multiple interconnected branches to interleave and fuse at different states. Besides, the asymmetric co-attention (AsyCA) is proposed and attacked to the interleaved nodes to adaptively emphasize informative features from different states and improve the discriminative ability of networks. Extensive experiments demonstrate the superiority of our proposed DIN in comparison with the state-of-the-art SR methods.
Despite recent great progress on semantic segmentation, there still exist huge challenges in medical ultra-resolution image segmentation. The methods based on multi-branch structure can make a good balance between computational burdens and segmentation accuracy. However, the fusion structure in these methods require to be designed elaborately to achieve desirable result, which leads to model redundancy. In this paper, we propose Meta Segmentation Network (MSN) to solve this challenging problem. With the help of meta-learning, the fusion module of MSN is quite simple but effective. MSN can fast generate the weights of fusion layers through a simple meta-learner, requiring only a few training samples and epochs to converge. In addition, to avoid learning all branches from scratch, we further introduce a particular weight sharing mechanism to realize a fast knowledge adaptation and share the weights among multiple branches, resulting in the performance improvement and significant parameters reduction. The experimental results on two challenging ultra-resolution medical datasets BACH and ISIC show that MSN achieves the best performance compared with the state-of-the-art approaches.
Zero-Shot Learning (ZSL) handles the problem that some testing classes never appear in training set. Existing ZSL methods are designed for learning from a fixed training set, which do not have the ability to capture and accumulate the knowledge of multiple training sets, causing them infeasible to many real-world applications. In this paper, we propose a new ZSL setting, named as Lifelong Zero-Shot Learning (LZSL), which aims to accumulate the knowledge during the learning from multiple datasets and recognize unseen classes of all trained datasets. Besides, a novel method is conducted to realize LZSL, which effectively alleviates the Catastrophic Forgetting in the continuous training process. Specifically, considering those datasets containing different semantic embeddings, we utilize Variational Auto-Encoder to obtain unified semantic representations. Then, we leverage selective retraining strategy to preserve the trained weights of previous tasks and avoid negative transfer when fine-tuning the entire model. Finally, knowledge distillation is employed to transfer knowledge from previous training stages to current stage. We also design the LZSL evaluation protocol and the challenging benchmarks. Extensive experiments on these benchmarks indicate that our method tackles LZSL problem effectively, while existing ZSL methods fail.
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).
Existing deep learning-based image de-blocking methods use only pixel-level loss functions to guide network training. The JPEG compression factor, which reflects the degradation degree, has not been fully utilized. However, due to the non-differentiability, the compression factor cannot be directly utilized to train deep networks. To solve this problem, we propose compression quality ranker-guided networks for this specific JPEG artifacts removal. We first design a quality ranker to measure the compression degree, which is highly correlated with the JPEG quality. Based on this differentiable ranker, we then propose one quality-related loss and one feature matching loss to guide de-blocking and perceptual quality optimization. In addition, we utilize dilated convolutions to extract multi-scale features, which enables our single model to handle multiple compression quality factors. Our method can implicitly use the information contained in the compression factors to produce better results. Experiments demonstrate that our model can achieve comparable or even better performance in both quantitative and qualitative measurements.
Few-shot segmentation (FSS) methods perform image segmentation for a particular object class in a target (query) image, using a small set of (support) image-mask pairs. Recent deep neural network based FSS methods leverage high-dimensional feature similarity between the foreground features of the support images and the query image features. In this work, we demonstrate gaps in the utilization of this similarity information in existing methods, and present a framework - SimPropNet, to bridge those gaps. We propose to jointly predict the support and query masks to force the support features to share characteristics with the query features. We also propose to utilize similarities in the background regions of the query and support images using a novel foreground-background attentive fusion mechanism. Our method achieves state-of-the-art results for one-shot and five-shot segmentation on the PASCAL-5i dataset. The paper includes detailed analysis and ablation studies for the proposed improvements and quantitative comparisons with contemporary methods.
Brain functional connectivity analysis on fMRI data could improve the understanding of human brain function. However, due to the influence of the inter-subject variability and the heterogeneity across subjects, previous methods of functional connectivity analysis are often insufficient in capturing disease-related representation so that decreasing disease diagnosis performance. In this paper, we first propose a new multi-graph fusion framework to fine-tune the original representation derived from Pearson correlation analysis, and then employ L1-SVM on fine-tuned representations to conduct joint brain region selection and disease diagnosis for avoiding the issue of the curse of dimensionality on high-dimensional data. The multi-graph fusion framework automatically learns the connectivity number for every node (i.e., brain region) and integrates all subjects in a unified framework to output homogenous and discriminative representations of all subjects. Experimental results on two real data sets, i.e., fronto-temporal dementia (FTD) and obsessive-compulsive disorder (OCD), verified the effectiveness of our proposed framework, compared to state-of-the-art methods.
Despite the huge progress in scene graph generation in recent years, its long-tail distribution in object relationships remains a challenging and pestering issue. Existing methods largely rely on either external knowledge or statistical bias information to alleviate this problem. In this paper, we tackle this issue from another two aspects: (1) scene-object interaction aiming at learning specific knowledge from a scene via an additive attention mechanism; and (2) long-tail knowledge transfer which tries to transfer the rich knowledge learned from the head into the tail. Extensive experiments on the benchmark dataset Visual Genome on three tasks demonstrate that our method outperforms current state-of-the-art competitors. Our source code is available at https://github.com/htlsn/issg.
Image edge detection is considered as a cornerstone task in computer vision. Due to the nature of hierarchical representations learned in CNN, it is intuitive to design side networks utilizing the richer convolutional features to improve the edge detection. However, there is no consensus way to integrate the hierarchical information. In this paper, we propose an effective and end-to-end framework, named Bidirectional Additive Net (BAN), for image edge detection. In the proposed framework, we focus on two main problems: 1) how to design a universal network for incorporating hierarchical information sufficiently; and 2) how to achieve effective information flow between different stages and gradually improve the edge map stage by stage. To tackle these problems, we design a consecutive bottom-up and top-down architecture, where a bottom-up branch can gradually remove detailed or sharp boundaries to enable accurate edge detection and a top-down branch offers a chance of error-correcting by revisiting the low-level features that contain rich textual and spatial information. And attended additive module (AAM) is designed to cumulatively refine edges by selecting pivotal features in each stage. Experimental results show that our proposed methods can improve the edge detection performance to new records and achieve state-of-the-art results on two public benchmarks: BSDS500 and NYUDv2.
Besides local features, global information plays an essential role in semantic segmentation, while recent works usually fail to explicitly extract the meaningful global information and make full use of it. In this paper, we propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information. The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene and directly guides the point-level semantic segmentation through filtering out categories not belonging to this scene. Additionally, to alleviate segmentation noise in local region, we design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label, leading to the enhancement of the distinguishing ability of point-wise features. We integrate our methods into several prevailing networks and conduct extensive experiments on benchmark datasets ScanNet and ShapeNet. Results show that our methods greatly improve the performance of baselines and achieve state-of-the-art performance.
The problem of characterizing brain functions such as memory, perception, and processing of stimuli has received significant attention in neuroscience literature. These experiments rely on carefully calibrated, albeit complex inputs, to record brain response to signals. A major problem in analyzing brain response to common stimuli such as audio-visual input from videos (e.g., movies) or story narration through audio books, is that observed neuronal responses are due to combinations of ``pure'' factors, many of which may be latent. In this paper, we present a novel methodological framework for deconvolving the brain's response to mixed stimuli into its constituent responses to underlying pure factors. This framework, based on archetypal analysis, is applied to the analysis of imaging data from an adult cohort watching the BBC show, Sherlock. By focusing on visual stimulus, we show strong correlation between our observed deconvolved response and third-party textual video annotations -- demonstrating the significant power of our analyses techniques. Building on these results, we show that our techniques can be used to predict neuronal responses in new subjects (how other individuals react to Sherlock), as well as to new visual content (how individuals react to other videos with known annotations). This paper reports on the first study that relates video features with neuronal responses in a rigorous algorithmic and statistical framework based on deconvolution of observed mixed imaging signals using archetypal analysis.
Spatiotemporal super-resolution (SR) aims to upscale both the spatial and temporal dimensions of input videos, and produces videos with higher frame resolutions and rates. It involves two essential sub-tasks: spatial SR and temporal SR. We design a two-stream network for spatiotemporal SR in this work. One stream contains a temporal SR module followed by a spatial SR module, while the other stream has the same two modules in the reverse order. Based on the interchangeability of performing the two sub-tasks, the two network streams are supposed to produce consistent spatiotemporal SR results. Thus, we present a cross-stream consistency to enforce the similarity between the outputs of the two streams. In this way, the training of the two streams is correlated, which allows the two SR modules to share their supervisory signals and improve each other. In addition, the proposed cross-stream consistency does not consume labeled training data and can guide network training in an unsupervised manner. We leverage this property to carry out semi-supervised spatiotemporal SR. It turns out that our method makes the most of training data, and can derive an effective model with few high-resolution and high-frame-rate videos, achieving the state-of-the-art performance. The source code of this work is available at https://hankweb.github.io/STSRwithCrossTask/.
Generation of high-quality person images is challenging, due to the sophisticated entanglements among image factors, e.g., appearance, pose, foreground, background, local details, global structures, etc. In this paper, we present a novel end-to-end framework to generate realistic person images based on given person poses and appearances. The core of our framework is a novel generator called Appearance-aware Pose Stylizer (APS) which generates human images by coupling the target pose with the conditioned person appearance progressively. The framework is highly flexible and controllable by effectively decoupling various complex person image factors in the encoding phase, followed by re-coupling them in the decoding phase. In addition, we present a new normalization method named adaptive patch normalization, which enables region-specific normalization and shows a good performance when adopted in person image generation model. Experiments on two benchmark datasets show that our method is capable of generating visually appealing and realistic-looking results using arbitrary image and pose inputs.
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
The R, G and B channels of a color image generally have different noise statistical properties or noise strengths. It is thus problematic to apply grayscale image denoising algorithms to color image denoising. In this paper, based on the non-local self-similarity of an image and the different noise strength across each channel, we propose a MultiChannel Weighted Schatten p-Norm Minimization (MCWSNM) model for RGB color image denoising. More specifically, considering a small local RGB patch in a noisy image, we first find its nonlocal similar cubic patches in a search window with an appropriate size. These similar cubic patches are then vectorized and grouped to construct a noisy low-rank matrix, which can be recovered using the Schatten p-norm minimization framework. Moreover, a weight matrix is introduced to balance each channel’s contribution to the final denoising results. The proposed MCWSNM can be solved via the alternating direction method of multipliers. Convergence property of the proposed method are also theoretically analyzed . Experiments conducted on both synthetic and real noisy color image datasets demonstrate highly competitive denoising performance, outperforming comparison algorithms, including several methods based on neural networks.