ECCV.2022 - Oral

| Total: 155

#1 InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images [PDF3] [Copy] [Kimi11] [REL]

Authors: Zhengqi Li ; Qianqian Wang ; Noah Snavely ; Angjoo Kanazawa

We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view. This capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm where we sample and render virtual camera trajectories, including cyclic camera paths, allowing our model to learn stable view generation from a collection of single views. At test time, despite never having seen a video, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse content. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality. Our project webpage, including video results, is at infinite-nature-zero.github.io.

#2 Organic Priors in Non-rigid Structure from Motion [PDF1] [Copy] [Kimi1] [REL]

Authors: Suryansh Kumar ; Luc Van Gool

This paper advocates the use of organic priors in classical non-rigid structure from motion (NRSfM). By organic priors, we mean invaluable intermediate prior information intrinsic to the NRSfM matrix factorization theory. It is shown that such priors reside in the factorized matrices, and quite surprisingly, existing methods generally disregard them. The paper’s main contribution is to put forward a simple, methodical, and practical method that can effectively exploit such organic priors to solve NRSfM. The proposed method does not make assumptions other than the popular one on the low-rank shape and offers a reliable solution to NRSfM under orthographic projection. Our work reveals that the accessibility of organic priors is independent of the camera motion and shape deformation type. Besides that, the paper provides insights into the NRSfM factorization---both in terms of shape and motion---and is the first approach to show the benefit of single rotation averaging for NRSfM. Furthermore, we outline how to effectively recover motion and non-rigid 3D shape using the proposed organic prior based approach and demonstrate results that outperform prior-free NRSfM performance by a significant margin. Finally, we present the benefits of our method via extensive experiments and evaluations on several benchmark datasets.

#3 FBNet: Feedback Network for Point Cloud Completion [PDF3] [Copy] [Kimi3] [REL]

Authors: Xuejun Yan ; Hongyu Yan ; Jingjing Wang ; Hang Du ; Zhihong Wu ; Di Xie ; Shiliang Pu ; Li Lu

The rapid development of point cloud learning has driven point cloud completion into a new era. However, the information flows of most existing completion methods are solely feedforward, and high-level information is rarely reused to improve low-level feature learning. To this end, we propose a novel Feedback Network (FBNet) for point cloud completion, in which present features are efficiently refined by rerouting subsequent fine-grained ones. Firstly, partial inputs are fed to a Hierarchical Graph-based Network (HGNet) to generate coarse shapes. Then, we cascade several Feedback-Aware Completion (FBAC) Blocks and unfold them across time recurrently. Feedback connections between two adjacent time steps exploit fine-grained features to improve present shape generations. The main challenge of building feedback connections is the dimension mismatching between present and subsequent features. To address this, the elaborately designed point Cross Transformer exploits efficient information from feedback features via cross attention strategy and then refines present features with the enhanced feedback features. Quantitative and qualitative experiments on several datasets demonstrate the superiority of proposed FBNet compared to state-of-the-art methods on point completion task. The source code and model are available at https://github.com/hikvision-research/3DVision/tree/main/PointCompletion/FBNet.

#4 Implicit Field Supervision for Robust Non-rigid Shape Matching [PDF1] [Copy] [Kimi1] [REL]

Authors: Ramana Sundararaman ; Gautam Pai ; Maks Ovsjanikov

Establishing a correspondence between two non-rigidly deforming shapes is one of the most fundamental problems in visual computing. Existing methods often show weak resilience when presented with challenges innate to real-world data such as noise, outliers, self-occlusion etc. On the other hand, auto-decoders have demonstrated strong expressive power in learning geometrically meaningful latent embeddings. However, their use in shape analysis has been limited. In this paper, we introduce an approach based on an auto-decoder framework, that learns a continuous shape-wise deformation field over a fixed template. By supervising the deformation field for points on-surface and regularizing for points off-surface through a novel Signed Distance Regularization (SDR), we learn an alignment between the template and shape volumes. Trained on clean water-tight meshes, without any data-augmentation, we demonstrate compelling performance on compromised data and real-world scans.

#5 Shape-Pose Disentanglement Using SE(3)-Equivariant Vector Neurons [PDF] [Copy] [Kimi1] [REL]

Authors: Oren Katzir ; Dani Lischinski ; Daniel Cohen-Or

We introduce an unsupervised technique for encoding point clouds into a canonical shape representation, by disentangling shape and pose. Our encoder is stable and consistent, meaning that the shape encoding is purely pose-invariant, while the extracted rotation and translation are able to semantically align different input shapes of the same class to a common canonical pose. Specifically, we design an auto-encoder based on Vector Neuron Networks, a rotation-equivariant neural network, whose layers we extend to provide translation-equivariance in addition to rotation-equivariance only. The resulting encoder produces pose-invariant shape encoding by construction, enabling our approach to focus on learning a consistent canonical pose for a class of objects. Quantitative and qualitative experiments validate the superior stability and consistency of our approach.

#6 Unsupervised Pose-Aware Part Decomposition for Man-Made Articulated Objects [PDF1] [Copy] [Kimi] [REL]

Authors: Yuki Kawana ; Yusuke Mukuta ; Tatsuya Harada

Man-made articulated objects exist widely in the real world. However, previous methods for unsupervised part decomposition are unsuitable for such objects because they assume a spatially fixed part location, resulting in inconsistent part parsing. In this paper, we propose PPD (unsupervised Pose-aware Part Decomposition) to address a novel setting that explicitly targets man-made articulated objects with mechanical joints, considering the part poses in part parsing. As an analysis-by-synthesis approach, We show that category-common prior learning for both part shapes and poses facilitates the unsupervised learning of (1) part parsing with abstracted part shapes, and (2) part poses as joint parameters under single-frame shape supervision. We evaluate our method on synthetic and real datasets, and we show that it outperforms previous works in consistent part parsing of the articulated objects based on comparable part pose estimation performance to the supervised baseline.

#7 CMD: Self-Supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation [PDF] [Copy] [Kimi1] [REL]

Authors: Yunyao Mao ; Wengang Zhou ; Zhenbo Lu ; Jiajun Deng ; Houqiang Li

In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at https://github.com/maoyunyao/CMD

#8 Expanding Language-Image Pretrained Models for General Video Recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Bolin Ni ; Houwen Peng ; Minghao Chen ; Songyang Zhang ; Gaofeng Meng ; Jianlong Fu ; Shiming Xiang ; Haibin Ling

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://github.com/microsoft/VideoX/tree/master/X-CLIP

#9 Delving into Details: Synopsis-to-Detail Networks for Video Recognition [PDF1] [Copy] [Kimi] [REL]

Authors: Shuxian Liang ; Xu Shen ; Jianqiang Huang ; Xian-Sheng Hua

In this paper, we explore the details in video recognition with the aim to improve the accuracy. It is observed that most failure cases in recent works fall on the mis-classifications among very similar actions (such as high kick vs. side kick) that need a capturing of fine-grained discriminative details. To solve this problem, we propose synopsis-to-detail networks for video action recognition. Firstly, a synopsis network is introduced to predict the top-k likely actions and generate the synopsis (location & scale of details and contextual features). Secondly, according to the synopsis, a detail network is applied to extract the discriminative details in the input and infer the final action prediction. The proposed synopsis-to-detail networks enable us to train models directly from scratch in an end-to-end manner and to investigate various architectures for synopsis/detail recognition. Extensive experiments on benchmark datasets, including Kinetics-400, Mini-Kinetics and Something-Something V1 & V2, show that our method is more effective and efficient than the competitive baselines. Code is available at: https://github.com/liang4sx/S2DNet.

#10 PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens [PDF] [Copy] [Kimi2] [REL]

Authors: Carlos Hinojosa ; Miguel Marquez ; Henry Arguello ; Ehsan Adeli ; Li Fei-Fei ; Juan Carlos Niebles

The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.

#11 Frequency Domain Model Augmentation for Adversarial Attack [PDF] [Copy] [Kimi1] [REL]

Authors: Yuyang Long ; Qilong Zhang ; Boheng Zeng ; Lianli Gao ; Xianglong Liu ; Jian Zhang ; Jingkuan Song

For black-box attacks, the gap between the substitute model and the victim model is usually large, which manifests as a weak attack performance. Motivated by the observation that the transferability of adversarial examples can be improved by attacking diverse models simultaneously, model augmentation methods which simulate different models by using transformed images are proposed. However, existing transformations for spatial domain do not translate to significantly diverse augmented models. To tackle this issue, we propose a novel spectrum simulation attack to craft more transferable adversarial examples against both normally trained and defense models. Specifically, we apply a spectrum transformation to the input and thus perform the model augmentation in the frequency domain. We theoretically prove that the transformation derived from frequency domain leads to a diverse spectrum saliency map, an indicator we proposed to reflect the diversity of substitute models. Notably, our method can be generally combined with existing attacks. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our method, e.g., attacking nine state-of-the-art defense models with an average success rate of 95.4%. Our code is available in https://github.com/yuyang-long/SSA.

#12 Generative Multiplane Images: Making a 2D GAN 3D-Aware [PDF1] [Copy] [Kimi] [REL]

Authors: Xiaoming Zhao ; Fangchang Ma ; David Güera ; Zhile Ren ; Alexander G. Schwing ; Alex Colburn

What is really needed to make an existing 2D GAN 3Daware? To answer this question, we modify a classical GAN, i.e., StyleGANv2, as little as possible. We find that only two modifications are absolutely necessary: 1) a multiplane image style generator branch which produces a set of alpha maps conditioned on their depth; 2) a pose conditioned discriminator. We refer to the generated output as a ‘generative multiplane image’ (GMPI) and emphasize that its renderings are not only high-quality but also guaranteed to be view-consistent, which makes GMPIs different from many prior works. Importantly, the number of alpha maps can be dynamically adjusted and can differ between training and inference, alleviating memory concerns and enabling fast training of GMPIs in less than half a day at a resolution of 1024^2. Our findings are consistent across three challenging and common high-resolution datasets, including FFHQ, AFHQv2 and MetFaces.

#13 UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Wanyi Zhuang ; Qi Chu ; Zhentao Tan ; Qiankun Liu ; Haojie Yuan ; Changtao Miao ; Zixiang Luo ; Nenghai Yu

Intra-frame inconsistency has been proved to be effective for the generalization of face forgery detection. However, learning to focus on these inconsistency requires extra pixel-level forged location annotations. Acquiring such annotations is non-trivial. Some existing methods generate large-scale synthesized data with location annotations, which is time-consuming. Others generate forgery location labels by subtracting paired real and fake images, yet such paired data is difficult to collected and the generated label is usually discontinuous. To overcome these limitations, we propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT. Thanks to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning. Specifically, we propose two key components: Unsupervised Patch Consistency Learning (UPCL) and Progressive Consistency Weighted Assemble (PCWA). UPCL is designed for learning the consistency-related representation with progressive optimized pseudo annotations that are derived from multivariate Gaussian estimation. PCWA enhances the final classification embedding with previous patch embeddings optimized by UPCL to further improve the detection performance. Extensive experiments demonstrate the effectiveness of the proposed method.

#14 D&D: Learning Human Dynamics from Dynamic Camera [PDF] [Copy] [Kimi] [REL]

Authors: Jiefeng Li ; Siyuan Bian ; Chao Xu ; Gang Liu ; Gang Yu ; Cewu Lu

3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D&D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D&D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D&D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeff-sjtu/DnD

#15 CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Zhihao Li ; Jianzhuang Liu ; Zhensong Zhang ; Songcen Xu ; Youliang Yan

Top-down methods dominate the field of 3D human pose and shape estimation, because they are decoupled from human detection and allow researchers to focus on the core problem. However, cropping, their first step, discards the location information from the very beginning, which makes themselves unable to accurately predict the global rotation in the original camera coordinate system. To address this problem, we propose to Carry Location Information in Full Frames (CLIFF) into this task. Specifically, we feed more holistic features to CLIFF by concatenating the cropped image feature with its bounding box information. We calculate the 2D reprojection loss with a broader view of the full frame, taking a projection process similar to that of the person projected in the image. Fed and supervised by global-location-aware information, CLIFF directly predicts the global rotation along with more accurate articulated poses. Besides, we propose a pseudo-ground-truth annotator based on CLIFF, which provides high-quality 3D annotations for in-the-wild 2D datasets and offers crucial full supervision for regression-based methods. Extensive experiments on popular benchmarks show that CLIFF outperforms prior arts by a significant margin, and reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track).

#16 SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation [PDF1] [Copy] [Kimi] [REL]

Authors: Yanjie Li ; Sen Yang ; Peidong Liu ; Shoukui Zhang ; Yunxiao Wang ; Zhicheng Wang ; Wankou Yang ; Shu-Tao Xia

The 2D heatmap-based approaches have dominated Human Pose Estimation (HPE) for years due to high performance. However, the long-standing quantization error problem in the 2D heatmap-based methods leads to several well-known drawbacks: 1) The performance for the low-resolution inputs is limited; 2) To improve the feature map resolution for higher localization precision, multiple costly upsampling layers are required; 3) Extra post-processing is adopted to reduce the quantization error. To address these issues, we aim to explore a brand new scheme, called SimCC, which reformulates HPE as two classification tasks for horizontal and vertical coordinates. The proposed SimCC uniformly divides each pixel into several bins, thus achieving sub-pixel localization precision and low quantization error. Benefiting from that, SimCC can omit additional refinement post-processing and exclude upsampling layers under certain settings, resulting in a more simple and effective pipeline for HPE. Extensive experiments conducted over COCO, CrowdPose, and MPII datasets show that SimCC outperforms heatmap-based counterparts, especially in low-resolution settings by a large margin. Code is now publicly available at https://github.com/leeyegy/SimCC.

#17 Grasp’D: Differentiable Contact-Rich Grasp Synthesis for Multi-Fingered Hands [PDF1] [Copy] [Kimi] [REL]

Authors: Dylan Turpin ; Liquan Wang ; Eric Heiden ; Yun-Chun Chen ; Miles Macklin ; Stavros Tsogkas ; Sven Dickinson ; Animesh Garg

The study of hand-object interaction requires generating viable grasp poses for high-dimensional multi-finger models, often relying on analytic grasp synthesis which tends to produce brittle and unnatural results. This paper presents Grasp’D, an approach to grasp synthesis by differentiable contact simulation that can work with both known models and visual inputs. We use gradient-based methods as an alternative to sampling-based grasp synthesis, which fails without simplifying assumptions, such as pre-specified contact locations and eigengrasps. Such assumptions limit grasp discovery and, in particular, exclude high-contact power grasps. In contrast, our simulation-based approach allows for stable, efficient, physically realistic, high-contact grasp synthesis, even for gripper morphologies with high-degrees of freedom. We identify and address challenges in making grasp simulation amenable to gradient-based optimization, such as non-smooth object surface geometry, contact sparsity, and a rugged optimization landscape. Grasp’D compares favorably to analytic grasp synthesis on human and robotic hand models, and resultant grasps achieve over 4× denser contact, leading to significantly higher grasp stability. Video and code available at: graspd-eccv22.github.io.

#18 Deep Radial Embedding for Visual Sequence Learning [PDF] [Copy] [Kimi1] [REL]

Authors: Yuecong Min ; Peiqi Jiao ; Yanan Li ; Xiaotao Wang ; Lei Lei ; Xiujuan Chai ; Xilin Chen

Connectionist Temporal Classification (CTC) is a popular objective function in sequence recognition, which provides supervision for unsegmented sequence data through aligning sequence and its corresponding labeling iteratively. The blank class of CTC plays a crucial role in the alignment process and is often considered responsible for the peaky behavior of CTC. In this study, we propose an objective function named RadialCTC that constrains sequence features on a hypersphere while retaining the iterative alignment mechanism of CTC. The learned features of each non-blank class are distributed on a radial arc from the center of the blank class, which provides a clear geometric interpretation and makes the alignment process more efficient. Besides, RadialCTC can control the peaky behavior by simply modifying the logit of the blank class. Experimental results of recognition and localization demonstrate the effectiveness of RadialCTC on two sequence recognition applications.

#19 PressureVision: Estimating Hand Pressure from a Single RGB Image [PDF1] [Copy] [Kimi1] [REL]

Authors: Patrick Grady ; Chengcheng Tang ; Samarth Brahmbhatt ; Christopher D. Twigg ; Chengde Wan ; James Hays ; Charles C. Kemp

People often interact with their surroundings by applying pressure with their hands. While hand pressure can be measured by placing pressure sensors between the hand and the environment, doing so can alter contact mechanics, interfere with human tactile perception, require costly sensors, and scale poorly to large environments. We explore the possibility of using a conventional RGB camera to infer hand pressure, enabling machine perception of hand pressure from uninstrumented hands and surfaces. The central insight is that the application of pressure by a hand results in informative appearance changes. Hands share biomechanical properties that result in similar observable phenomena, such as soft-tissue deformation, blood distribution, hand pose, and cast shadows. We collected videos of 36 participants with diverse skin tone applying pressure to an instrumented planar surface. We then trained a deep model (PressureVisionNet) to infer a pressure image from a single RGB image. Our model infers pressure for participants outside of the training data and outperforms baselines. We also show that the output of our model depends on the appearance of the hand and cast shadows near contact regions. Overall, our results suggest the appearance of a previously unobserved human hand can be used to accurately infer applied pressure. Data, code, and models are available online.

#20 Pose for Everything: Towards Category-Agnostic Pose Estimation [PDF2] [Copy] [Kimi1] [REL]

Authors: Lumin Xu ; Sheng Jin ; Wang Zeng ; Wentao Liu ; Chen Qian ; Wanli Ouyang ; Ping Luo ; Xiaogang Wang

Existing works on 2D pose estimation mainly focus on a certain category, e.g. human, animal, and vehicle. However, there are lots of application scenarios that require detecting the poses/keypoints of the unseen class of objects. In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. To achieve this goal, we formulate the pose estimation problem as a keypoint matching problem and design a novel CAPE framework, termed POse Matching Network (POMNet). A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms. Experiments show that our method outperforms other baseline approaches by a large margin. Codes and data are available at https://github.com/luminxu/Pose-for-Everything.

#21 Estimating Spatially-Varying Lighting in Urban Scenes with Disentangled Representation [PDF] [Copy] [Kimi] [REL]

Authors: Jiajun Tang ; Yongjie Zhu ; Haoyu Wang ; Jun Hoong Chan ; Si Li ; Boxin Shi

We present an end-to-end network for spatially-varying outdoor lighting estimation in urban scenes given a single limited field-of-view LDR image and any assigned 2D pixel position. We use three disentangled latent spaces learned by our network to represent sky light, sun light, and lighting-independent local contents respectively. At inference time, our lighting estimation network can run efficiently in an end-to-end manner by merging the global lighting and the local appearance rendered by the local appearance renderer with the predicted local silhouette. We enhance an existing synthetic dataset with more realistic material models and diverse lighting conditions for more effective training. We also capture the first real dataset with HDR labels for evaluating spatially-varying outdoor lighting estimation. Experiments on both synthetic and real datasets show that our method achieves state-of-the-art performance with more flexible editability.

#22 Practical and Scalable Desktop-Based High-Quality Facial Capture [PDF] [Copy] [Kimi] [REL]

Authors: Alexandros Lattas ; Yiming Lin ; Jayanth Kannan ; Ekin Ozturk ; Luca Filipi ; Giuseppe Claudio Guarnera ; Gaurav Chawla ; Abhijeet Ghosh

We present a novel desktop-based system for high-quality facial capture including geometry and facial appearance. The proposed acquisition system is highly practical and scalable, consisting purely of commodity components. The setup consists of a set of displays for controlled illumination for reflectance capture, in conjunction with multiview acquisition of facial geometry. We additionally present a novel set of binary illumination patterns for efficient acquisition of reflectance and photometric normals using our setup, with diffuse-specular separation. We demonstrate high-quality results with two different variants of the capture setup - one entirely consisting of portable mobile devices targeting static facial capture, and the other consisting of desktop LCD displays targeting both static and dynamic facial capture.

#23 Physically-Based Editing of Indoor Scene Lighting from a Single Image [PDF] [Copy] [Kimi] [REL]

Authors: Zhengqin Li ; Jia Shi ; Sai Bi ; Rui Zhu ; Kalyan Sunkavalli ; Miloš Hašan ; Zexiang Xu ; Ravi Ramamoorthi ; Manmohan Chandraker

We present a method to edit complex indoor lighting from a single image with its predicted depth and light source segmentation masks. This is an extremely challenging problem that requires modeling complex light transport, and disentangling HDR lighting from material and geometry with only a partial LDR observation of the scene. We tackle this problem using two novel components: 1) a holistic scene reconstruction method that estimates scene reflectance and parametric 3D lighting, and 2) a neural rendering framework that re-renders the scene from our predictions. We use physically-based indoor light representations that allow for intuitive editing, and infer both visible and invisible light sources. Our neural rendering framework combines physically-based direct illumination and shadow rendering with deep networks to approximate global illumination. It can capture challenging lighting effects, such as soft shadows, directional lighting, specular materials, and interreflections. Previous single image inverse rendering methods usually entangle scene lighting and geometry and only support applications like object insertion. Instead, by combining parametric 3D lighting estimation with neural scene rendering, we demonstrate the first automatic method to achieve full scene relighting, including light source insertion, removal, and replacement, from a single image. All source code and data will be publicly released.

#24 CT2: Colorization Transformer via Color Tokens [PDF1] [Copy] [Kimi1] [REL]

Authors: Shuchen Weng ; Jimeng Sun ; Yu Li ; Si Li ; Boxin Shi

Automatic image colorization is an ill-posed problem with multi-modal uncertainty, and there remains two main challenges with previous methods: incorrect semantic colors and under-saturation. In this paper, we propose an end-to-end transformer-based model to overcome these challenges. Benefited from the long-range context extraction of transformer and our holistic architecture, our method could colorize images with more diverse colors. Besides, we introduce color tokens into our approach and treat the colorization task as a classification problem, which increases the saturation of results. We also propose a series of modules to make image features interact with color tokens, and restrict the range of possible color candidates, which makes our results visually pleasing and reasonable. In addition, our method does not require any additional external priors, which ensures its well generalization capability. Extensive experiments and user studies demonstrate that our method achieves superior performance than previous works.

#25 Synthesizing Light Field Video from Monocular Video [PDF] [Copy] [Kimi] [REL]

Authors: Shrisudhan Govindarajan ; Prasan Shedligeri ; Sarah ; Kaushik Mitra

The hardware challenges associated with light-field(LF) imaging has made it difficult for consumers to access its benefits like applications in post-capture focus and aperture control. Learning-based techniques which solve the ill-posed problem of LF reconstruction from sparse (1, 2 or 4) views have significantly reduced the requirement for complex hardware. LF video reconstruction from sparse views poses a special challenge as acquiring ground-truth for training these models is hard. Hence, we propose a self-supervised learning-based algorithm for LF video reconstruction from monocular videos. We use self-supervised geometric, photometric and temporal consistency constraints inspired from a recent self-supervised technique for LF video reconstruction from stereo video. Additionally, we propose three key techniques that are relevant to our monocular video input. We propose an explicit disocclusion handling technique that encourages the network to inpaint disoccluded regions in a LF frame, using information from adjacent input temporal frames. This is crucial for a self-supervised technique as a single input frame does not contain any information about the disoccluded regions. We also propose an adaptive low-rank representation that provides a significant boost in performance by tailoring the representation to each input scene. Finally, we also propose a novel refinement block that is able to exploit the available LF image data using supervised learning to further refine the reconstruction quality. Our qualitative and quantitative analysis demonstrates the significance of each of the proposed building blocks and also the superior results compared to previous state-of-the-art monocular LF reconstruction techniques. We further validate our algorithm by reconstructing LF videos from monocular videos acquired using a commercial GoPro camera.