CVPR.2024 - Accept

| Total: 2302

#1 Efficient Vision-Language Pre-training by Cluster Masking [PDF¹¹⁵] [Copy] [Kimi¹⁴⁴] [REL]

Authors: Zihao Wei, Zixuan Pan, Andrew Owens

The quest for optimal vision-language pretraining strategies has led to the exploration of masking techniques as a way to enhance data efficiency. Previous approaches include random masking and semantic masking, the latter requiring the retention or exclusion of patches in areas with similar semantics. Despite its effectiveness, semantic masking often needs an additional, complex model for identifying semantically related patches, increasing computational demands. Our method utilizes naturally emerging clusters within images unlike other approaches using text supervision. We employ random clusters of image patches for masking, utilizing the raw RGB values of patches as the feature representation. This method capitalizes on the observation that basic visual similarity measures can effectively identify coherent visual structures, such as parts of objects. Our approach, therefore, combines the computational efficiency of random patch dropping with the enhanced performance achieved through masking coherent visual structures.

Subject: CVPR.2024 - Accept

#2 MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections [PDF⁶⁰] [Copy] [Kimi⁵⁴] [REL]

Authors: mude hui, Zihao Wei, Hongru Zhu, Fei Xia, Yuyin Zhou

Volumetric optical microscopy using non-diffracting beams, offers rapid imaging of 3D volumes by projecting them axially to 2D images, but it often falls short in providing depth information. To address this limitation, we introduce MicroDiffusion, a pioneering tool designed for high-quality, depth-resolved 3D volume reconstruction from a limited set of 2D microscopy projections. Existing 3D reconstruction methods, such as Implicit Neural Representation (INR) models, often result in incomplete and noisy outputs. In contrast, Denoising Diffusion Probabilistic Models (DDPM) excel at capturing fine-grained details. Our method merges the strengths of INR’s ability to maintain structural 3D coherence with DDPM’s proficiency in enhancing details.Initially, we pretrain an INR model that transforms the 2D axially-projected images into a preliminary 3D volume. Then, the pretrained INR serves as a global prior, directing DDPM's generative process through linear interpolation between INR outputs and noise inputs. This strategy effectively enriches the diffusion process with structured 3D information while simultaneously enhancing detail and minimizing noise in localized 2D images. Furthermore, by conditioning the diffusion model on the closest 2D image, MicroDiffusion substantially enhances the fidelity of the resulting 3D reconstructions. MicroDiffusion enables depth-resolved volumetric microscopy by delivering high-quality 3D reconstructions that are sharper than those produced by INR models and more coherent than standard DDPM outputs.Extensive results from three microscopy datasets demonstrate MicroDiffusion's superiority in producing 3D reconstructions with enhanced image quality, structural coherence, and fidelity, compared to traditional INR and diffusion models.

Subject: CVPR.2024 - Accept

#3 Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection [PDF⁴⁸] [Copy] [Kimi⁴⁷] [REL]

Authors: Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

In this paper, we address the limitations of the DETR-based semi-supervised object detection (SSOD) framework, particularly focusing on the challenges posed by the quality of object queries. In DETR-based SSOD, the one-to-one assignment strategy provides inaccurate pseudo-labels, while the one-to-many assignments strategy leads to overlapping predictions. These issues compromise training efficiency and degrade model performance, especially in detecting small or occluded objects. We introduce Sparse Semi-DETR, a novel transformer-based, end-to-end semi-supervised object detection solution to overcome these challenges. Sparse Semi-DETR incorporates a Query Refinement Module to enhance the quality of object queries, significantly improving detection capabilities for small and partially obscured objects. Additionally, we integrate a Reliable Pseudo-Label Filtering Module that selectively filters high-quality pseudo-labels, thereby enhancing detection accuracy and consistency. On the MS-COCO and Pascal VOC object detection benchmarks, Sparse Semi-DETR achieves a significant improvement over current state-of-the-art methods that highlight Sparse Semi-DETR's effectiveness in semi-supervised object detection, particularly in challenging scenarios involving small or partially obscured objects.

Subject: CVPR.2024 - Accept

#4 Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling [PDF²⁷] [Copy] [Kimi¹⁷] [REL]

Authors: Olaf Dünkel, Tim Salzmann, Florian Pfaff

Normalizing flows have proven their efficacy for density estimation in Euclidean space, but their application to rotational representations, crucial in various domains such as robotics or human pose modeling, remains underexplored. Probabilistic models of the human pose can benefit from approaches that rigorously consider the rotational nature of human joints. For this purpose, we introduce HuProSO3, a normalizing flow model that operates on a high-dimensional product space of SO(3) manifolds, modeling the joint distribution for human joints with three degrees of freedom. HuProSO3's advantage over state-of-the-art approaches is demonstrated through its superior modeling accuracy in three different applications. This work not only addresses the technical challenge of learning densities on SO(3) manifolds, but it also has broader implications for domains where the probabilistic regression of correlated 3D rotations is of importance.

Subject: CVPR.2024 - Accept

#5 Label Propagation for Zero-shot Classification with Vision-Language Models [PDF³²] [Copy] [Kimi⁴²] [REL]

Authors: Vladan Stojnić, Yannis Kalantidis, Giorgos Tolias

Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification, i.e. classification when provided merely with a list of class names. In this paper, we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP, a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://github.com/vladan-stojnic/ZLaP

Subject: CVPR.2024 - Accept

#6 LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering [PDF¹²] [Copy] [Kimi²²] [REL]

Authors: Jaehoon Choi, Rajvi Shah, Qinbo Li, Yipeng Wang, Ayush Saraf, Changil Kim, Jia-Bin Huang, Dinesh Manocha, Suhib Alsisan, Johannes Kopf

Advancements in neural signed distance fields (SDFs) have enabled modeling 3D surface geometry from a set of 2D images of real-world scenes. Baking neural SDFs can extract explicit mesh with appearance baked into texture maps as neural features. The baked meshes still have a large memory footprint and require a powerful GPU for real-time rendering. Neural optimization of such large meshes with differentiable rendering pose significant challenges. We propose a method to produce optimized meshes for large unbounded scenes with low triangle budget and high fidelity of geometry and appearance. We achieve this by combining advancements in baking neural SDFs with classical mesh simplification techniques and proposing a joint appearance-geometry refinement step. The visual quality is comparable to or better than state-of-the-art neural meshing and baking methods with high geometric accuracy despite significant reduction in triangle count, making the produced meshes efficient for storage, transmission, and rendering on mobile hardware. We validate the effectiveness of the proposed method on large unbounded scenes from mip-NeRF 360, Tanks & Temples, and Deep Blending datasets, achieving at-par rendering quality with 73× reduced triangles and 11× reduction in memory footprint.

Subject: CVPR.2024 - Accept

#7 MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric [PDF²¹] [Copy] [Kimi³⁰] [REL]

Authors: Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei, Zhenan Sun

Vision-language pre-trained models have achieved impressive performance on various downstream tasks.However, their large model sizes hinder their utilization on platforms with limited computational resources.We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance.Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks.In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks.Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities.For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models.Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.

Subject: CVPR.2024 - Accept

#8 HEAL-SWIN: A Vision Transformer On The Sphere [PDF²⁴] [Copy] [Kimi³¹] [REL]

Authors: Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spiess, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson

High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, enabling the network to process spherical representations with minimal computational overhead. We demonstrate the superior performance of our model on both synthetic and real automotive datasets, as well as a selection of other image datasets, for semantic segmentation, depth regression and classification tasks. Our code will be made available.

Subject: CVPR.2024 - Accept

#9 Loopy-SLAM: Dense Neural SLAM with Loop Closures [PDF⁵] [Copy] [Kimi¹⁸] [REL]

Authors: Lorenzo Liso, Erik Sandström, Vladimir Yugay, Luc Van Gool, Martin R. Oswald

Neural RGBD SLAM techniques have shown promise in dense Simultaneous Localization And Mapping (SLAM), yet face challenges such as error accumulation during camera tracking resulting in distorted maps. In response, we introduce Loopy-SLAM that globally optimizes poses and the dense 3D model. We use frame to model tracking using a data-driven point-based submap generation method and trigger loop closures online by performing global place recognition. Robust pose graph optimization is used to rigidly align the local submaps. As our representation is point based, map corrections can be performed efficiently without the need to store the entire history of input frames as required by methods employing a grid based mapping structure. Evaluation on the synthetic Replica and real-world TUM-RGBD and ScanNet datasets demonstrate competitive or superior performance in tracking, mapping, and rendering accuracy when compared to existing dense neural RGBD SLAM methods. Our source code will be made available.

Subject: CVPR.2024 - Accept

#10 Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation [PDF³⁸] [Copy] [Kimi³²] [REL]

Authors: Xin Fan, Xiaolin Wang, Jiaxin Gao, Jia Wang, Zhongxuan Luo, Risheng Liu

One-shot medical image segmentation (MIS) aims to cope with the expensive, time-consuming, and inherent human bias annotations. One prevalent method to address one-shot MIS is joint registration and segmentation (JRS) with a shared encoder, which mainly explores the voxel-wise correspondence between the labeled data and unlabeled data for better segmentation. However, this method omits underlying connections between task-specific decoders for segmentation and registration, leading to unstable training. In this paper, we propose a novel Bi-level Learning of Task-Specific Decoders for one-shot MIS, employing a pretrained fixed shared encoder that is proved to be morequickly adapted to brand-new datasets than existing JRS without fixed shared encoder paradigm. To be more specific, we introduce a bi-level optimization training strategy considering registration as a major objective and segmentation as a learnable constraint by leveraging inter-task coupling dependencies. Furthermore, we design an appearance conformity constraint strategy that learns the backward transformations generating the fake labeled data used to perform data augmentation instead of the labeled image, to avoid performance degradation caused by inconsistent styles between unlabeled data and labeled data in previous methods. Extensive experiments on the brain MRI task across ABIDE, ADNI, and PPMI datasets demonstrate that the proposed Bi-JROS outperforms state-of-the-art one-shotMIS methods for both segmentation and registration tasks. The code will be available at https://github.com/Coradlut/Bi-JROS.

Subject: CVPR.2024 - Accept

#11 On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning? [PDF²⁶] [Copy] [Kimi²⁵] [REL]

Authors: Maxime Zanella, Ismail Ben Ayed

The development of large vision-language models, notably CLIP, has catalyzed research into effective adaptation techniques, with a particular focus on soft prompt tuning. Conjointly, test-time augmentation, which utilizes multiple augmented views of a single image to enhance zero-shot generalization, is emerging as a significant area of interest. This has predominantly directed research efforts towards test-time prompt tuning. In contrast, we introduce a robust $\textbf{M}$ eanShift for $\textbf{T}$ est-time $\textbf{A}$ ugmentation (MTA), which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally, our method does not rely on ad hoc rules (e.g., confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead, MTA incorporates a quality assessment variable for each view directly into its optimization process, termed as the inlierness score. This score is jointly optimized with a density mode seeking process, leading to an efficient training- and hyperparameter-free approach. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency. Deployed easily as plug-and-play module on top of zero-shot models and state-of-the-art few-shot methods, MTA shows systematic and consistent improvements.

Subject: CVPR.2024 - Accept

#12 SAOR: Single-View Articulated Object Reconstruction [PDF⁹] [Copy] [Kimi¹⁰] [REL]

Authors: Mehmet Aygun, Oisin Mac Aodha

We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.

Subject: CVPR.2024 - Accept

#13 Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture [PDF²⁵] [Copy] [Kimi¹⁹] [REL]

Authors: Huijie Zhang, Yifu Lu, Ismail Alkhouri, Saiprasad Ravishankar, Dogyoon Song, Qing Qu

Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g., images). However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories, and employing a large model with numerous parameters across multiple timesteps (i.e., noise levels).To tackle these challenges, we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference, which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework, showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models, including large-scale latent diffusion models. Furthermore, our ablation studies illustrate the impact of two important components in our framework: (i) a novel time-step clustering algorithm for stage division, and (ii) an innovative multi-decoder U-net architecture, seamlessly integrating universal and customized hyperparameters.

Subject: CVPR.2024 - Accept

#14 Friendly Sharpness-Aware Minimization [PDF¹³] [Copy] [Kimi¹⁴] [REL]

Authors: Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, Xiaolin Huang

Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success, the mechanisms behind SAM's generalization enhancements remain elusive, limiting its progress in deep learning optimization. In this work, we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation, i.e., the current minibatch gradient, which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components, we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset, creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights, F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients, and then leverages stochastic gradient noise for improved generalization. Moreover, we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM.

Subject: CVPR.2024 - Accept

#15 Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering [PDF⁹] [Copy] [Kimi¹²] [REL]

Authors: Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll

We present Paint-it, a text-driven high-fidelity texture map synthesis method for 3D meshes via neural re-parameterized texture optimization. Paint-it synthesizes texture maps from a text description by synthesis-through-optimization, exploiting the Score-Distillation Sampling (SDS). We observe that directly applying SDS yields undesirable texture quality due to its noisy gradients. We reveal the importance of texture parameterization when using SDS. Specifically, we propose Deep Convolutional Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes the physically-based rendering (PBR) texture maps with randomly initialized convolution-based neural kernels, instead of a standard pixel-based parameterization. We show that DC-PBR inherently schedules the optimization curriculum according to texture frequency and naturally filters out the noisy signals from SDS. In experiments, Paint-it obtains remarkable quality PBR texture maps within 15 min., given only a text description. We demonstrate the generalizability and practicality of Paint-it by synthesizing high-quality texture maps for large-scale mesh datasets and showing test-time applications such as relighting and material control using a popular graphics engine. Code will be publicly available.

Subject: CVPR.2024 - Accept

#16 From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers [PDF⁶] [Copy] [Kimi⁹] [REL]

Authors: Swaminathan Gurumurthy, Karnik Ram, Bingqing Chen, Zachary Manchester, Zico Kolter

Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other's output yields SOTA results across domains. However, training these models has proved challenging, requiring a litany of tricks to stabilize and speed up training. In this work, we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference, (2) linearization errors in the bundle adjustment (BA) layer, and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients, potentially leading to a slow down in training and instabilities. We then propose a simple solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to also weight the correspondence objective in the training problem. This helps the training objective 'focus' on the more important points, thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2-2.5x training speedups over a baseline visual odometry model we modify.

Subject: CVPR.2024 - Accept

#17 Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields [PDF⁷] [Copy] [Kimi¹⁴] [REL]

Authors: Tianqi Liu, Xinyi Ye, Min Shi, Zihao Huang, Zhiyu Pan, Zhan Peng, Zhiguo Cao

Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However, existing methods show limited generalization ability in challenging conditions due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. We address these issues point by point. First, we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors, our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors, we observe the two existing decoding strategies excel in different areas, which are complementary. A Consistency-Aware Fusion (CAF) strategy is introduced to leverage the advantages of both. We incorporate the above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. Code will be released.

Subject: CVPR.2024 - Accept

#18 TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation [PDF¹⁰] [Copy] [Kimi⁶] [REL]

Authors: Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, Michael J. Black

We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, however, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss, “Threshold-Adaptive Loss Scaling” (TALS), that penalizes gross 2D and p-GT errors but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively improving robustness to occlusion. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.

Subject: CVPR.2024 - Accept

#19 3D Multi-frame Fusion for Video Stabilization [PDF¹¹] [Copy] [Kimi¹⁶] [REL]

Authors: Zhan Peng, Xinyi Ye, Weiyue Zhao, Tianqi Liu, Huiqiang Sun, Baopu Li, Zhiguo Cao

In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.

Subject: CVPR.2024 - Accept

#20 BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed [PDF⁴] [Copy] [Kimi⁸] [REL]

Authors: Abhishek Tandon, Anujraaj Goyal, Henry M. Clever, Zackory Erickson

Accurately predicting the 3D human posture and the pressure exerted on the body for people resting in bed, visualized as a body mesh (3D pose & shape) with a 3D pressure map, holds significant promise for healthcare applications, particularly, in the prevention of pressure ulcers. Current methods focus on singular facets of the problem---predicting only 2D/3D poses, generating 2D pressure images, predicting pressure only for certain body regions instead of the full body, or forming indirect approximations to the 3D pressure map. In contrast, we introduce BodyMAP, which jointly predicts the human body mesh and 3D applied pressure map across the entire human body. Our network leverages multiple visual modalities, incorporating both a depth image of a person in bed and its corresponding 2D pressure image acquired from a pressure-sensing mattress. The 3D pressure map is represented as a pressure value at each mesh vertex and thus allows for precise localization of high-pressure regions on the body. Additionally, we present BodyMAP-WS, a new formulation of pressure prediction in which we implicitly learn pressure in 3D by aligning sensed 2D pressure images with a differentiable 2D projection of the predicted 3D pressure maps. In evaluations with real-world human data, our method outperforms the current state-of-the-art technique by 25% on both body mesh and 3D applied pressure map prediction tasks for people in bed.

Subject: CVPR.2024 - Accept

#21 Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [PDF⁶] [Copy] [Kimi¹⁶] [REL]

Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom.

Subject: CVPR.2024 - Accept

#22 DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation [PDF¹⁰] [Copy] [Kimi¹²] [REL]

Authors: Yifei Li, Hsiaoyu Chen, Egor Larionov, Nikolaos Sarafianos, Wojciech Matusik, Tuur Stuyck

The realism of digital avatars is crucial in enabling telepresence applications with self-expression and customization. A key aspect of this realism originates from the physical accuracy of both a true-to-life body shape and clothing.While physical simulations can produce high-quality, realistic motions for clothed humans, they require precise estimation of body shape and high-quality garment assets with associated physical parameters for cloth simulations. However, manually creating these assets and calibrating their parameters is labor-intensive and requires specialized expertise. To address this gap, we propose DiffAvatar, a novel approach that performs body and garment co-optimization using differentiable simulation. By integrating physical simulation into the optimization loop and accounting for the complex non-linear behavior of cloth and its intricate interaction with the body, our framework recovers body and garment geometry and extracts important material parameters in a physically plausible way. Our experiments demonstrate that our approach generates realistic clothing and body shape that can be easily used in downstream applications.

Subject: CVPR.2024 - Accept

#23 CCEdit: Creative and Controllable Video Editing via Diffusion Models [PDF¹³] [Copy] [Kimi¹⁴] [REL]

Authors: Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, Baining Guo

In this paper, we present CCEdit, a versatile generative video editing framework based on diffusion models. Our approach employs a novel trident network structure that separates structure and appearance control, ensuring precise and creative editing capabilities. Utilizing the foundational ControlNet architecture, we maintain the structural integrity of the video during editing. The incorporation of an additional appearance branch enables users to exert fine-grained control over the edited key frame. These two side branches seamlessly integrate into the main branch, which is constructed upon existing text-to-image (T2I) generation models, through learnable temporal layers. The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models. To facilitate comprehensive evaluation, we introduce the BalanceCC benchmark dataset, comprising 100 videos and 4 target prompts for each video. Our extensive user studies compare CCEdit with eight state-of-the-art video editing methods. The outcomes demonstrate CCEdit's substantial superiority over all other methods, affirming its exceptional editing capability.

Subject: CVPR.2024 - Accept

#24 Hierarchical Patch Diffusion Models for High-Resolution Video Generation [PDF¹²] [Copy] [Kimi¹³] [REL]

Authors: Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. In this work, we study patch diffusion models (PDMs) --- a diffusion paradigm which models the distribution of patches, rather than whole inputs, keeping up to ${\approx}$ 0.7\% of the original pixels. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop \emph{deep context fusion} --- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose \emph{adaptive computation}, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^2$ , surpassing recent methods by more than 100\%. Then, we show that it can be rapidly fine-tuned from a base $36\times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.

Subject: CVPR.2024 - Accept

#25 Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches [PDF⁹] [Copy] [Kimi¹⁵] [REL]

Authors: Qing Yu, Mikihiro Tanaka, Kent Fujiwara

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce ``motion patches'', a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.

Subject: CVPR.2024 - Accept