IJCAI.2022 - Computer Vision | Cool Papers

#1 MotionMixer: MLP-based 3D Human Body Pose Forecasting [PDF] [Copy] [Kimi] [REL]

Authors: Arij Bouazizi ; Adrian Holzbock ; Ulrich Kressel ; Klaus Dietmayer ; Vasileios Belagiannis

In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine-grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer.

#2 Event-driven Video Deblurring via Spatio-Temporal Relation-Aware Network [PDF] [Copy] [Kimi] [REL]

Authors: Chengzhi Cao ; Xueyang Fu ; Yurui Zhu ; Gege Shi ; Zheng-Jun Zha

Video deblurring with event information has attracted considerable attention. To help deblur each frame, existing methods usually compress a specific event sequence into a feature tensor with the same size as the corresponding video. However, this strategy neither considers the pixel-level spatial brightness changes nor the temporal correlation between events at each time step, resulting in insufficient use of spatio-temporal information. To address this issue, we propose a new Spatio-Temporal Relation-Attention network (STRA), for the specific event-based video deblurring. Concretely, to utilize spatial consistency between the frame and event, we model the brightness changes as an extra prior to aware blurring contexts in each frame; to record temporal relationship among different events, we develop a temporal memory block to restore long-range dependencies of event sequences continuously. In this way, the complementary information contained in the events and frames, as well as the correlation of neighboring events, can be fully utilized to recover spatial texture from events constantly. Experiments show that our STRA significantly outperforms several competing methods, e.g., on the HQF dataset, our network achieves up to 1.3 dB in terms of PSNR over the most advanced method. The code is available at https://github.com/Chengzhi-Cao/STRA.

#3 KPN-MFI: A Kernel Prediction Network with Multi-frame Interaction for Video Inverse Tone Mapping [PDF] [Copy] [Kimi] [REL]

Authors: Gaofeng Cao ; Fei Zhou ; Han Yan ; Anjie Wang ; Leidong Fan

Up to now, the image-based inverse tone mapping (iTM) models have been widely investigated, while there is little research on video-based iTM methods. It would be interesting to make use of these existing image-based models in the video iTM task. However, directly transferring the imagebased iTM models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both the intra-frame quality and the inter-frame consistency of a video, this article presents a new video iTM method based on a kernel prediction network (KPN), which takes advantage of multi-frame interaction (MFI) module to capture temporal-spatial information for video data. Specifically, a basic encoder-decoder KPN, essentially designed for image iTM, is trained to guarantee the mapping quality within each frame. More importantly, the MFI module is incorporated to capture temporal-spatial context information and preserve the inter-frame consistency by exploiting the correction between adjacent frames. Notably, we can readily extend any existing image iTM models to video iTM ones by involving the proposed MFI module. Furthermore, we propose an inter-frame brightness consistency loss function based on the Gaussian pyramid to reduce the video temporal inconsistency. Extensive experiments demonstrate that our model outperforms state-ofthe-art image and video-based methods. The code is available at https://github.com/caogaofeng/KPNMFI.

#4 Zero-Shot Logit Adjustment [PDF] [Copy] [Kimi] [REL]

Authors: Dubing Chen ; Yuming Shen ; Haofeng Zhang ; Philip H.S. Torr

Semantic-descriptor-based Generalized Zero-Shot Learning (GZSL) poses challenges in recognizing novel classes in the test phase. The development of generative models enables current GZSL techniques to probe further into the semantic-visual link, culminating in a two-stage form that includes a generator and a classifier. However, existing generation-based methods focus on enhancing the generator's effect while neglecting the improvement of the classifier. In this paper, we first analyze of two properties of the generated pseudo unseen samples: bias and homogeneity. Then, we perform variational Bayesian inference to back-derive the evaluation metrics, which reflects the balance of the seen and unseen classes. As a consequence of our derivation, the aforementioned two properties are incorporated into the classifier training as seen-unseen priors via logit adjustment. The Zero-Shot Logit Adjustment further puts semantic-based classifiers into effect in generation-based GZSL. Our experiments demonstrate that the proposed technique achieves state-of-the-art when combined with the basic generator, and it can improve various generative Zero-Shot Learning frameworks. Our codes are available on https://github.com/cdb342/IJCAI-2022-ZLA.

#5 Uncertainty-Aware Representation Learning for Action Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Lei Chen ; Muheng Li ; Yueqi Duan ; Jie Zhou ; Jiwen Lu

In this paper, we propose an uncertainty-aware representation Learning (UARL) method for action segmentation. Most existing action segmentation methods exploit continuity information of the action period to predict frame-level labels, which ignores the temporal ambiguity of the transition region between two actions. Moreover, similar periods of different actions, e.g., the beginning of some actions, will confuse the network if they are annotated with different labels, which causes spatial ambiguity. To address this, we design the UARL to exploit the transitional expression between two action periods by uncertainty learning. Specially, we model every frame of actions with an active distribution that represents the probabilities of different actions, which captures the uncertainty of the action and exploits the tendency during the action. We evaluate our method on three popular action prediction datasets: Breakfast, Georgia Tech Egocentric Activities (GTEA), and 50Salads. The experimental results demonstrate that our method achieves the performance with state-of-the-art.

#6 AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection [PDF] [Copy] [Kimi] [REL]

Authors: Zehui Chen ; Zhenyu Li ; Shiquan Zhang ; Liangji Fang ; Qinhong Jiang ; Feng Zhao ; Bolei Zhou ; Hang Zhao

Object detection through either RGB images or the LiDAR point clouds has been extensively explored in autonomous driving. However, it remains challenging to make these two data sources complementary and beneficial to each other. In this paper, we propose AutoAlign, an automatic feature fusion strategy for 3D object detection. Instead of establishing deterministic correspondence with camera projection matrix, we model the mapping relationship between the image and point clouds with a learnable alignment map. This map enables our model to automate the alignment of non-homogenous features in a dynamic and data-driven manner. Specifically, a cross-attention feature alignment module is devised to adaptively aggregate pixel-level image features for each voxel. To enhance the semantic consistency during feature alignment, we also design a self-supervised cross-modal feature interaction module, through which the model can learn feature aggregation with instance-level feature guidance. Extensive experimental results show that our approach can lead to 2.3 mAP and 7.0 mAP improvements on the KITTI and nuScenes datasets respectively. Notably, our best model reaches 70.9 NDS on the nuScenes testing leaderboard, achieving competitive performance among various state-of-the-arts.

#7 Unsupervised Multi-Modal Medical Image Registration via Discriminator-Free Image-to-Image Translation [PDF] [Copy] [Kimi] [REL]

Authors: Zekang Chen ; Jia Wei ; Rui Li

In clinical practice, well-aligned multi-modal images, such as Magnetic Resonance (MR) and Computed Tomography (CT), together can provide complementary information for image-guided therapies. Multi-modal image registration is essential for the accurate alignment of these multi-modal images. However, it remains a very challenging task due to complicated and unknown spatial correspondence between different modalities. In this paper, we propose a novel translation-based unsupervised deformable image registration approach to convert the multi-modal registration problem to a mono-modal one. Specifically, our approach incorporates a discriminator-free translation network to facilitate the training of the registration network and a patchwise contrastive loss to encourage the translation network to preserve object shapes. Furthermore, we propose to replace an adversarial loss, that is widely used in previous multi-modal image registration methods, with a pixel loss in order to integrate the output of translation into the target modality. This leads to an unsupervised method requiring no ground-truth deformation or pairs of aligned images for training. We evaluate four variants of our approach on the public Learn2Reg 2021 datasets. The experimental results demonstrate that the proposed architecture achieves state-of-the-art performance. Our code is available at https://github.com/heyblackC/DFMIR.

#8 SpanConv: A New Convolution via Spanning Kernel Space for Lightweight Pansharpening [PDF] [Copy] [Kimi] [REL]

Authors: Zhi-Xuan Chen ; Cheng Jin ; Tian-Jing Zhang ; Xiao Wu ; Liang-Jian Deng

Standard convolution operations can effectively perform feature extraction and representation but result in high computational cost, largely due to the generation of the original convolution kernel corresponding to the channel dimension of the feature map, which will cause unnecessary redundancy. In this paper, we focus on kernel generation and present an interpretable span strategy, named SpanConv, for the effective construction of kernel space. Specifically, we first learn two navigated kernels with single channel as bases, then extend the two kernels by learnable coefficients, and finally span the two sets of kernels by their linear combination to construct the so-called SpanKernel. The proposed SpanConv is realized by replacing plain convolution kernel by SpanKernel. To verify the effectiveness of SpanConv, we design a simple network with SpanConv. Experiments demonstrate the proposed network significantly reduces parameters comparing with benchmark networks for remote sensing pansharpening, while achieving competitive performance and excellent generalization. Code is available at https://github.com/zhi-xuan-chen/IJCAI-2022 SpanConv.

#9 Robust Single Image Dehazing Based on Consistent and Contrast-Assisted Reconstruction [PDF] [Copy] [Kimi] [REL]

Authors: De Cheng ; Yan Li ; Dingwen Zhang ; Nannan Wang ; Xinbo Gao ; Jiande Sun

Single image dehazing as a fundamental low-level vision task, is essential for the development of robust intelligent surveillance system. In this paper, we make an early effort to consider dehazing robustness under variational haze density, which is a realistic while under-studied problem in the research filed of singe image dehazing. To properly address this problem, we propose a novel density-variational learning framework to improve the robustness of the image dehzing model assisted by a variety of negative hazy images, to better deal with various complex hazy scenarios. Specifically, the dehazing network is optimized under the consistency-regularized framework with the proposed Contrast-Assisted Reconstruction Loss (CARL). The CARL can fully exploit the negative information to facilitate the traditional positive-orient dehazing objective function, by squeezing the dehazed image to its clean target from different directions. Meanwhile, the consistency regularization keeps consistent outputs given multi-level hazy images, thus improving the model robustness. Extensive experimental results on two synthetic and three real-world datasets demonstrate that our method significantly surpasses the state-of-the-art approaches.

#10 I²R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Yiwei Ding ; Wenjin Deng ; Yinglin Zheng ; Pengfei Liu ; Meihong Wang ; Xuan Cheng ; Jianmin Bao ; Dong Chen ; Ming Zeng

In this paper, we present the Intra- and Inter-Human Relation Networks I²R-Net for Multi-Person Pose Estimation. It involves two basic modules. First, the Intra-Human Relation Module operates on a single person and aims to capture Intra-Human dependencies. Second, the Inter-Human Relation Module considers the relation between multiple instances and focuses on capturing Inter-Human interactions. The Inter-Human Relation Module can be designed very lightweight by reducing the resolution of feature map, yet learn useful relation information to significantly boost the performance of the Intra-Human Relation Module. Even without bells and whistles, our method can compete or outperform current competition winners. We conduct extensive experiments on COCO, CrowdPose, and OCHuman datasets. The results demonstrate that the proposed model surpasses all the state-of-the-art methods. Concretely, the proposed method achieves 77.4% AP on CrowPose dataset and 67.8% AP on OCHuman dataset respectively, outperforming existing methods by a large margin. Additionally, the ablation study and visualization analysis also prove the effectiveness of our model.

#11 Region-Aware Metric Learning for Open World Semantic Segmentation via Meta-Channel Aggregation [PDF] [Copy] [Kimi] [REL]

Authors: Hexin Dong ; Zifan Chen ; Mingze Yuan ; Yutong Xie ; Jie Zhao ; Fei Yu ; Bin Dong ; Li Zhang

As one of the most challenging and practical segmentation tasks, open-world semantic segmentation requires the model to segment the anomaly regions in the images and incrementally learn to segment out-of-distribution (OOD) objects, especially under a few-shot condition. The current state-of-the-art (SOTA) method, Deep Metric Learning Network (DMLNet), relies on pixel-level metric learning, with which the identification of similar regions having different semantics is difficult. Therefore, we propose a method called region-aware metric learning (RAML), which first separates the regions of the images and generates region-aware features for further metric learning. RAML improves the integrity of the segmented anomaly regions. Moreover, we propose a novel meta-channel aggregation (MCA) module to further separate anomaly regions, forming high-quality sub-region candidates and thereby improving the model performance for OOD objects. To evaluate the proposed RAML, we have conducted extensive experiments and ablation studies on Lost And Found and Road Anomaly datasets for anomaly segmentation and the CityScapes dataset for incremental few-shot learning. The results show that the proposed RAML achieves SOTA performance in both stages of open world segmentation. Our code and appendix are available at https://github.com/czifan/RAML.

#12 MNet: Rethinking 2D/3D Networks for Anisotropic Medical Image Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Zhangfu Dong ; Yuting He ; Xiaoming Qi ; Yang Chen ; Huazhong Shu ; Jean-Louis Coatrieux ; Guanyu Yang ; Shuo Li

The nature of thick-slice scanning causes severe inter-slice discontinuities of 3D medical images, and the vanilla 2D/3D convolutional neural networks (CNNs) fail to represent sparse inter-slice information and dense intra-slice information in a balanced way, leading to severe underfitting to inter-slice features (for vanilla 2D CNNs) and overfitting to noise from long-range slices (for vanilla 3D CNNs). In this work, a novel mesh network (MNet) is proposed to balance the spatial representation inter axes via learning. 1) Our MNet latently fuses plenty of representation processes by embedding multi-dimensional convolutions deeply into basic modules, making the selections of representation processes flexible, thus balancing representation for sparse inter-slice information and dense intra-slice information adaptively. 2) Our MNet latently fuses multi-dimensional features inside each basic module, simultaneously taking the advantages of 2D (high segmentation accuracy of the easily recognized regions in 2D view) and 3D (high smoothness of 3D organ contour) representations, thus obtaining more accurate modeling for target regions. Comprehensive experiments are performed on four public datasets (CT\&MR), the results consistently demonstrate the proposed MNet outperforms the other methods. The code and datasets are available at: https://github.com/zfdong-code/MNet

#13 ICGNet: Integration Context-based Reverse-Contour Guidance Network for Polyp Segmentation [PDF] [Copy] [Kimi] [REL]

Authors: Xiuquan Du ; Xuebin Xu ; Kunpeng Ma

Precise segmentation of polyps from colonoscopic images is extremely significant for the early diagnosis and treatment of colorectal cancer. However, it is still a challenging task due to: (1)the boundary between the polyp and the background is blurred makes delineation difficult; (2)the various size and shapes causes feature representation of polyps difficult. In this paper, we propose an integration context-based reverse-contour guidance network (ICGNet) to solve these challenges. The ICGNet firstly utilizes a reverse-contour guidance module to aggregate low-level edge detail information and meanwhile constraint reverse region. Then, the newly designed adaptive context module is used to adaptively extract local-global information of the current layer and complementary information of the previous layer to get larger and denser features. Lastly, an innovative hybrid pyramid pooling fusion module fuses the multi-level features generated from the decoder in the case of considering salient features and less background. Our proposed approach is evaluated on the EndoScene, Kvasir-SEG and CVC-ColonDB datasets with eight evaluation metrics, and gives competitive results compared with other state-of-the-art methods in both learning ability and generalization capability.

#14 SVTR: Scene Text Recognition with a Single Visual Model [PDF] [Copy] [Kimi] [REL]

Authors: Yongkun Du ; Zhineng Chen ; Caiyan Jia ; Xiaoting Yin ; Tianlun Zheng ; Chenxia Li ; Yuning Du ; Yu-Gang Jiang

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/PaddlePaddle/PaddleOCR.

#15 Learning Coated Adversarial Camouflages for Object Detectors [PDF] [Copy] [Kimi] [REL]

Authors: Yexin Duan ; Jialin Chen ; Xingyu Zhou ; Junhua Zou ; Zhengyun He ; Jin Zhang ; Wu Zhang ; Zhisong Pan

An adversary can fool deep neural network object detectors by generating adversarial noises. Most of the existing works focus on learning local visible noises in an adversarial "patch" fashion. However, the 2D patch attached to a 3D object tends to suffer from an inevitable reduction in attack performance as the viewpoint changes. To remedy this issue, this work proposes the Coated Adversarial Camouflage (CAC) to attack the detectors in arbitrary viewpoints. Unlike the patch trained in the 2D space, our camouflage generated by a conceptually different training framework consists of 3D rendering and dense proposals attack. Specifically, we make the camouflage perform 3D spatial transformations according to the pose changes of the object. Based on the multi-view rendering results, the top-n proposals of the region proposal network are fixed, and all the classifications in the fixed dense proposals are attacked simultaneously to output errors. In addition, we build a virtual 3D scene to fairly and reproducibly evaluate different attacks. Extensive experiments demonstrate the superiority of CAC over the existing attacks, and it shows impressive performance both in the virtual scene and the real world. This poses a potential threat to the security-critical computer vision systems.

#16 D-DPCC: Deep Dynamic Point Cloud Compression via 3D Motion Prediction [PDF] [Copy] [Kimi] [REL]

Authors: Tingyu Fan ; Linyao Gao ; Yiling Xu ; Zhu Li ; Dong Wang

The non-uniformly distributed nature of the 3D Dynamic Point Cloud (DPC) brings significant challenges to its high-efficient inter-frame compression. This paper proposes a novel 3D sparse convolution-based Deep Dynamic Point Cloud Compression (D-DPCC) network to compensate and compress the DPC geometry with 3D motion estimation and motion compensation in the feature space. In the proposed D-DPCC network, we design a Multi-scale Motion Fusion (MMF) module to accurately estimate the 3D optical flow between the feature representations of adjacent point cloud frames. Specifically, we utilize a 3D sparse convolution-based encoder to obtain the latent representation for motion estimation in the feature space and introduce the proposed MMF module for fused 3D motion embedding. Besides, for motion compensation, we propose a 3D Adaptively Weighted Interpolation (3DAWI) algorithm with a penalty coefficient to adaptively decrease the impact of distant neighbours. We compress the motion embedding and the residual with a lossy autoencoder-based network. To our knowledge, this paper is the first work proposing an end-to-end deep dynamic point cloud compression framework. The experimental result shows that the proposed D-DPCC framework achieves an average 76% BD-Rate (Bjontegaard Delta Rate) gains against state-of-the-art Video-based Point Cloud Compression (V-PCC) v13 in inter mode.

#17 SparseTT: Visual Tracking with Sparse Transformers [PDF] [Copy] [Kimi] [REL]

Authors: Zhihong Fu ; Zehua Fu ; Qingjie Liu ; Wenrui Cai ; Yunhong Wang

Transformers have been successfully applied to the visual tracking task and significantly promote tracking performance. The self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. However, self-attention lacks focusing on the most relevant information in the search regions, making it easy to be distracted by background. In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions, which enables a much accurate tracking. Furthermore, we introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes, which further improve the tracking performance. Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on LaSOT, GOT-10k, TrackingNet, and UAV123, while running at 40 FPS. Notably, the training time of our method is reduced by 75% compared to that of TransT. The source code and models are available at https://github.com/fzh0917/SparseTT.

#18 Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer [PDF] [Copy] [Kimi] [REL]

Authors: Guangwei Gao ; Zhengxue Wang ; Juncheng Li ; Wenjie Li ; Yi Yu ; Tieyong Zeng

Single-image super-resolution (SISR) has achieved significant breakthroughs with the development of deep learning. However, these methods are difficult to be applied in real-world scenarios since they are inevitably accompanied by the problems of computational and memory costs caused by the complex operations. To solve this issue, we propose a Lightweight Bimodal Network (LBNet) for SISR. Specifically, an effective Symmetric CNN is designed for local feature extraction and coarse image reconstruction. Meanwhile, we propose a Recursive Transformer to fully learn the long-term dependence of images thus the global information can be fully used to further refine texture details. Studies show that the hybrid of CNN and Transformer can build a more efficient model. Extensive experiments have proved that our LBNet achieves more prominent performance than other state-of-the-art methods with a relatively low computational cost and memory consumption. The code is available at https://github.com/IVIPLab/LBNet.

#19 Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection [PDF] [Copy] [Kimi] [REL]

Authors: Zhihao Gu ; Taiping Yao ; Yang Chen ; Ran Yi ; Shouhong Ding ; Lizhuang Ma

The rapid development of face forgery techniques has drawn growing attention due to security concerns. Existing deepfake video detection methods always attempt to capture the discriminative features by directly exploiting static temporal convolution to mine temporal inconsistency, without explicit exploration on the diverse temporal dynamics of different forged regions. To effectively and comprehensively capture the various inconsistency, in this paper, we propose a novel Region-Aware Temporal Filter (RATF) module which automatically generates corresponding temporal filters for different spatial regions. Specifically, we decouple the dynamic temporal kernel into a set of region-agnostic basic filters and region-sensitive aggregation weights. And different weights guide the corresponding regions to adaptively learn temporal inconsistency, which greatly enhances the overall representational ability. Moreover, to cover the long-term temporal dynamics, we divide the video into multiple snippets and propose a Cross-Snippet Attention (CSA) to promote the cross-snippet information interaction. Extensive experiments and visualizations on several benchmarks demonstrate the effectiveness of our method against state-of-the-art competitors.

#20 Learning Target-aware Representation for Visual Tracking via Informative Interactions [PDF] [Copy] [Kimi] [REL]

Authors: Mingzhe Guo ; Zhipeng Zhang ; Heng Fan ; Liping Jing ; Yilin Lyu ; Bing Li ; Weiming Hu

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Having observed de facto frameworks perform feature matching simply using the backbone outputs for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. Concretely, only the matching module can directly access the target information, while the representation learning of candidate frame is blind to the reference target. Therefore, the accumulated target-irrelevant interference in shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). The core of InBN is a general interaction modeler (GIM) that injects the target information to different stages of the backbone network, leading to better target-perception of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced on multiple benchmarks. In particular, the CNN version improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K. The Transformer version obtains SUC of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent SOTAs.

#21 Exploring Fourier Prior for Single Image Rain Removal [PDF] [Copy] [Kimi] [REL]

Authors: Xin Guo ; Xueyang Fu ; Man Zhou ; Zhen Huang ; Jialun Peng ; Zheng-Jun Zha

Deep convolutional neural networks (CNNs) have become dominant in the task of single image rain removal. Most of current CNN methods, however, suffer from the problem of overfitting on one single synthetic dataset as they neglect the intrinsic prior of the physical properties of rain streaks. To address this issue, we propose a simple but effective prior - Fourier prior to improve the generalization ability of an image rain removal model. The Fourier prior is a kind of property of rainy images. It is based on a key observation of us - replacing the Fourier amplitude of rainy images with that of clean images greatly suppresses the synthetic and real-world rain streaks. This means the amplitude contains most of the rain streak information and the phase keeps the similar structures of the background. So it is natural for single image rain removal to process the amplitude and phase information of the rainy images separately. In this paper, we develop a two-stage model where the first stage restores the amplitude of rainy images to clean rain streaks, and the second stage restores the phase information to refine fine-grained background structures. Extensive experiments on synthetic rainy data demonstrate the power of Fourier prior. Moreover, when trained on synthetic data, a robust generalization ability to real-world images can also be obtained. The code will be publicly available at https://github.com/willinglucky/ExploringFourier-Prior-for-Single-Image-Rain-Removal.

#22 Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks [PDF] [Copy] [Kimi] [REL]

Authors: Shuai He ; Yongchang Zhang ; Rui Xie ; Dongxiang Jiang ; Anlong Ming

Challenges in image aesthetics assessment (IAA) arise from that images of different themes correspond to different evaluation criteria, and learning aesthetics directly from images while ignoring the impact of theme variations on human visual perception inhibits the further development of IAA; however, existing IAA datasets and models overlook this problem. To address this issue, we show that a theme-oriented dataset and model design are effective for IAA. Specifically, 1) we elaborately build a novel dataset, called TAD66K, that contains 66K images covering 47 popular themes, and each image is densely annotated by more than 1200 people with dedicated theme evaluation criteria. 2) We develop a baseline model, TANet, which can effectively extract theme information and adaptively establish perception rules to evaluate images with different themes. 3) We develop a large-scale benchmark (the most comprehensive thus far) by comparing 17 methods with TANet on three representative datasets: AVA, FLICKR-AES and the proposed TAD66K, TANet achieves state-of-the-art performance on all three datasets. Our work offers the community an opportunity to explore more challenging directions; the code, dataset and supplementary material are available at https://github.com/woshidandan/TANet.

#23 Self-supervised Semantic Segmentation Grounded in Visual Concepts [PDF] [Copy] [Kimi] [REL]

Authors: Wenbin He ; William Surmeier ; Arvind Kumar Shekar ; Liang Gou ; Liu Ren

Unsupervised semantic segmentation requires assigning a label to every pixel without any human annotations. Despite recent advances in self-supervised representation learning for individual images, unsupervised semantic segmentation with pixel-level representations is still a challenging task and remains underexplored. In this work, we propose a self-supervised pixel representation learning method for semantic segmentation by using visual concepts (i.e., groups of pixels with semantic meanings, such as parts, objects, and scenes) extracted from images. To guide self-supervised learning, we leverage three types of relationships between pixels and concepts, including the relationships between pixels and local concepts, local and global concepts, as well as the co-occurrence of concepts. We evaluate the learned pixel embeddings and visual concepts on three datasets, including PASCAL VOC 2012, COCO 2017, and DAVIS 2017. Our results show that the proposed method gains consistent and substantial improvements over recent unsupervised semantic segmentation approaches, and also demonstrate that visual concepts can reveal insights into image datasets.

#24 Semantic Compression Embedding for Generative Zero-Shot Learning [PDF] [Copy] [Kimi] [REL]

Authors: Ziming Hong ; Shiming Chen ; Guo-Sen Xie ; Wenhan Yang ; Jian Zhao ; Yuanjie Shao ; Qinmu Peng ; Xinge You

Generative methods have been successfully applied in zero-shot learning (ZSL) by learning an implicit mapping to alleviate the visual-semantic domain gaps and synthesizing unseen samples to handle the data imbalance between seen and unseen classes. However, existing generative methods simply use visual features extracted by the pre-trained CNN backbone. These visual features lack attribute-level semantic information. Consequently, seen classes are indistinguishable, and the knowledge transfer from seen to unseen classes is limited. To tackle this issue, we propose a novel Semantic Compression Embedding Guided Generation (SC-EGG) model, which cascades a semantic compression embedding network (SCEN) and an embedding guided generative network (EGGN). The SCEN extracts a group of attribute-level local features for each sample and further compresses them into the new low-dimension visual feature. Thus, a dense-semantic visual space is obtained. The EGGN learns a mapping from the class-level semantic space to the dense-semantic visual space, thus improving the discriminability of the synthesized dense-semantic unseen visual features. Extensive experiments on three benchmark datasets, i.e., CUB, SUN and AWA2, demonstrate the signiﬁcant performance gains of SC-EGG over current state-of-the-art methods and its baselines.

#25 ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation [PDF²] [Copy] [Kimi] [REL]

Authors: Huimin Huang ; Shiao Xie ; Lanfen Lin ; Yutaro Iwamoto ; Xian-Hua Han ; Yen-Wei Chen ; Ruofeng Tong

Recently, a variety of vision transformers have been developed as their capability of modeling long-range dependency. In current transformer-based backbones for medical image segmentation, convolutional layers were replaced with pure transformers, or transformers were added to the deepest encoder to learn global context. However, there are mainly two challenges in a scale-wise perspective: (1) intra-scale problem: the existing methods lacked in extracting local-global cues in each scale, which may impact the signal propagation of small objects; (2) inter-scale problem: the existing methods failed to explore distinctive information from multiple scales, which may hinder the representation learning from objects with widely variable size, shape and location. To address these limitations, we propose a novel backbone, namely ScaleFormer, with two appealing designs: (1) A scale-wise intra-scale transformer is designed to couple the CNN-based local features with the transformer-based global cues in each scale, where the row-wise and column-wise global dependencies can be extracted by a lightweight Dual-Axis MSA. (2) A simple and effective spatial-aware inter-scale transformer is designed to interact among consensual regions in multiple scales, which can highlight the cross-scale dependency and resolve the complex scale variations. Experimental results on different benchmarks demonstrate that our Scale-Former outperforms the current state-of-the-art methods. The code is publicly available at: https://github.com/ZJUGiveLab/ScaleFormer.