ECCV.2024 - Poster

| Total: 2187

#1 4D Contrastive Superflows are Dense 3D Representation Learners [PDF5] [Copy] [Kimi1] [REL]

Authors: Xiang Xu ; Lingdong Kong ; Hui Shuai ; Wenwei Zhang ; Liang Pan ; Kai Chen ; Ziwei Liu ; Qingshan Liu

In the realm of autonomous driving, accurate 3D perception is the foundation. However, developing such models relies on extensive human annotations -- a process that is both costly and labor-intensive. To address this challenge from a data representation learning perspective, we introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing spatiotemporal pretraining objectives. SuperFlow stands out by integrating two key designs: 1) a dense-to-sparse consistency regularization, which promotes insensitivity to point cloud density variations during feature learning, and 2) a flow-based contrastive learning module, carefully crafted to extract meaningful temporal cues from readily available sensor calibrations. To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances the alignment of the knowledge distilled from camera views. Extensive comparative and ablation studies across 11 heterogeneous LiDAR datasets validate our effectiveness and superiority. Additionally, we observe several interesting emerging properties by scaling up the 2D and 3D backbones during pretraining, shedding light on the future research of 3D foundation models for LiDAR-based perception. Our code is attached to the supplementary material and will be made publicly available.

#2 Octopus: Embodied Vision-Language Programmer from Environmental Feedback [PDF2] [Copy] [Kimi1] [REL]

Authors: Jingkang Yang ; Yuhao Dong ; Shuai Liu ; Bo Li ; Ziyue Wang ; ChenCheng Jiang ; Haoran Tan ; Jiamu Kang ; Yuanhan Zhang ; Kaiyang Zhou ; Ziwei Liu

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. When integrated into an embodied agent, existing embodied VLM works either output detailed action sequences at the manipulation level or only provide plans at an abstract level, leaving a gap between high-level planning and real-world manipulation. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code. To facilitate Octopus model development, we introduce OctoVerse: a suite of environments tailored for benchmarking vision-based code generators on a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games such as Grand Theft Auto (GTA) and Minecraft. To train Octopus, we leverage GPT-4 to control an explorative agent that generates training data, i.e., action blueprints and corresponding executable code. We also collect feedback that enables an enhanced training scheme called Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we demonstrate Octopus's functionality and present compelling results, showing that the proposed RLEF refines the agent's decision-making. By open-sourcing our simulation environments, dataset, and model architecture, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.

#3 ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation [PDF2] [Copy] [Kimi] [REL]

Authors: Yuyuan Liu ; Yuanhong Chen ; Hu Wang ; Vasileios Belagiannis ; Ian Reid ; Gustavo Carneiro

The costly and time-consuming annotation process to produce large training sets for modelling semantic LiDAR segmentation methods has motivated the development of semi-supervised learning (SSL) methods. However, such SSL approaches often concentrate on employing consistency learning only for individual LiDAR representations. This narrow focus results in limited perturbations that generally fail to enable effective consistency learning. Additionally, these SSL approaches employ contrastive learning based on the sampling from a limited set of positive and negative embedding samples, rather than considering a more effective sampling from a distribution of positive and negative embeddings. This paper introduces a novel semi-supervised LiDAR semantic segmentation framework called ItTakesTwo (IT2). IT2 is designed to ensure consistent predictions from peer LiDAR representations, thereby improving the perturbation effectiveness in consistency learning. Furthermore, our contrastive learning employs informative samples drawn from a distribution of positive and negative embeddings learned from the entire training set. Results on public benchmarks show that our approach achieves remarkable improvements over the previous state-of-the-art (SOTA) methods in the field. Code will be available.

#4 Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [PDF] [Copy] [Kimi1] [REL]

Authors: Kuan-Chih Huang ; Yi-Hsuan Tsai ; Ming-Hsuan Yang

Weakly supervised 3D object detection aims to learn a 3D detector with lower annotation cost, e.g., 2D labels. Unlike prior work which still relies on few accurate 3D annotations, we propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we employ visual data from three perspectives to establish connections between 2D and 3D domains. First, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data. We conduct extensive experiments on the KITTI dataset to validate the effectiveness of the proposed three constraints. Without using any 3D labels, our method achieves favorable performance against state-of-the-art approaches and is competitive with the method that uses 500-frame 3D annotations. Code and models will be made publicly available.

#5 Modeling and Driving Human Body Soundfields through Acoustic Primitives [PDF] [Copy] [Kimi] [REL]

Authors: Chao Huang ; Dejan Markovic ; Chenliang Xu ; Alexander Richard

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.

#6 Motion Mamba: Efficient and Long Sequence Motion Generation [PDF] [Copy] [Kimi] [REL]

Authors: Zeyu Zhang ; Akide Liu ; Ian Reid ; Richard Hartley ; Bohan Zhuang ; Hao Tang

Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting the SSMs to motion generation faces hurdles since the lack of specialized design architecture for modeling motion sequence. To address these multifaceted challenges, we introduce three key contributions. Firstly, we proposed Motion Mamba, an innovative yet straightforward approach that presents the pioneering motion generation model utilized SSMs. Secondly, we designed a Hierarchical Temporal Mamba (HTM) block to process temporal data by traversing through a symmetric architecture aimed at preserving motion consistency between frames. We also designed a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, in order to enhance accurate motion generation within a temporal frame. Lastly, the proposed method has outperformed other well-established methods on the HumanML3D and KIT-ML datasets, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation.

#7 Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation [PDF] [Copy] [Kimi] [REL]

Authors: Taekyung Ki ; Dongchan Min ; Gyeongsu Chae

In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator that directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different view through a differentiable volume rendering. Existing portrait animation methods heavily rely on image warping to transfer the expression in the motion space, challenging on disentanglement of appearance and expression. In contrast, we propose a contrastive pre-training framework for appearance-free expression parameter, eliminating undesirable appearance swap when transferring a cross-identity expression. Extensive experiments show that our pre-training framework can learn the appearance-free expression representation hidden in 3DMM, and our model can generate 3D-aware expression controllable portrait image without appearance swap in the cross-identity manner.

#8 SAGS: Structure-Aware 3D Gaussian Splatting [PDF1] [Copy] [Kimi] [REL]

Authors: Evangelos Ververas ; Rolandos Alexandros Potamias ; Song Jifei ; Jiankang Deng ; Stefanos Zafeiriou

Following the advent of NeRFs, 3D Gaussian Splatting (3D-GS) has paved the way to real-time neural rendering overcoming the computational burden of volumetric methods. Following the pioneering work of 3D-GS, several methods have attempted to achieve compressible and high-fidelity performance. However, by employing a geometry-agnostic optimization scheme, these methods neglect the inherent 3D structure of the scene, thereby restricting the expressivity and the quality of the representation, resulting in various floating points and artifacts. In this work, we propose a structure-aware Gaussian Splatting method (SAGS) that implicitly encodes the geometry of the scene, which reflects to state-of-the-art rendering performance and reduced storage requirements on benchmark novel-view synthesis datasets. SAGS is founded on a local-global graph representation that facilitates the learning of complex scenes and enforces meaningful point displacements that preserve the scene's geometry. Additionally, we introduce a lightweight version of SAGS, using a simple yet effective mid-point interpolation scheme, which showcases a compact representation of the scene with up to 20$\times$ size reduction without the reliance on any compression strategies. Extensive experiments across multiple benchmark datasets demonstrate the superiority of SAGS compared to state-of-the-art 3D-GS methods under both rendering quality and model size. Besides, we demonstrate that our structure-aware method can effectively mitigate floating artifacts and irregular distortions of previous methods while obtaining precise depth maps. Code and models will be publicly available.

#9 3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views [PDF] [Copy] [Kimi] [REL]

Authors: Evangelos Ververas ; Polydefkis Gkagkos ; Jiankang Deng ; Michail C Doukas ; Jia Guo ; Stefanos Zafeiriou

Developing gaze estimation models that generalize well to unseen domains and in-the-wild conditions remains a challenge with no known best solution. This is mostly due to the difficulty of acquiring ground truth data that cover the distribution of faces, head poses, and environments that exist in the real world. Most recent methods attempt to close the gap between specific source and target domains using domain adaptation. In this work, we propose to train general gaze estimation models which can be directly employed in novel environments without adaptation. To do so, we leverage the observation that head, body, and hand pose estimation benefit from revising them as dense 3D coordinate prediction, and similarly express gaze estimation as regression of dense 3D eye meshes. To close the gap between image domains, we create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene, and design a multi-view supervision framework to balance their effect during training. We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available, and up to 10% when they are. The project material will become available for research purposes.

#10 Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs [PDF] [Copy] [Kimi] [REL]

Authors: Aayam Shrestha ; Pan Liu ; German Ros ; Kai Yuan ; Alan Fern

This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion. For example, the input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals. This requires a versatile low-level humanoid controller that can handle such sparse, under-specified guidance, seamlessly switch between skills, and recover from failures. Current approaches for learning humanoid controllers from demonstration data capture some of these characteristics, but none achieve them all. To this end, we introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations. The training methodology results in an MHC that exhibits the key capabilities of catch-up to out-of-sync input commands, combining elements from multiple motion sequences, and completing unspecified parts of motions from sparse multimodal input. We demonstrate these key capabilities for an MHC learned over a dataset of 87 diverse skills and showcase different multi-modal use cases, including integration with planning frameworks to highlight MHC’s ability to solve new user-defined tasks without any finetuning.

#11 Disentangling Masked Autoencoders for Unsupervised Domain Generalization [PDF] [Copy] [Kimi1] [REL]

Authors: An Zhang ; Han Wang ; Xiang Wang ; Tat-Seng Chua

Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG) — a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied. To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which cannot be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e., DomainNet, PACS, VLCS, and Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research lines for improving generalization ability with large-scale unlabeled data.

#12 BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos [PDF] [Copy] [Kimi] [REL]

Authors: Pilhyeon Lee ; Hyeran Byun

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code will be publicly available.

#13 Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition [PDF1] [Copy] [Kimi] [REL]

Authors: Shreyank Narayana Gowda ; Anurag Arnab ; Jonathan Huang

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations (a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by ) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. We also show the generalization of this approach to other factorized encoder models. The advancements made in this work have the potential to advance research in the video understanding domain and provide valuable insights for researchers and practitioners with limited resources, paving the way for more efficient and scalable alternatives in the action recognition field.

#14 CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance [PDF] [Copy] [Kimi] [REL]

Authors: Zhipeng Hu ; Yongqiang Zhang ; Chen Liu ; Lincheng Li ; Sida Peng ; Xiaowei Zhou ; Changjie Fan ; Xin Yu

Differentiable surface rendering has significantly advanced 3D reconstruction. Existing surface rendering methods assume that the local surface is planar, and thus employ linear approximation based on the Singed Distance Field (SDF) values to predict the point on the surface. However, this assumption overlooks the inherently irregular and non-planar nature of object surfaces in the real world. Consequently, the approximate points tend to deviate from the zero-level set, affecting the fidelity of the reconstructed shape. In this paper, we propose a novel surface rendering method termed CPT-VR, which leverages the Closet Point Transform (CPT) and View and Reflection direction vectors to enhance the quality of reconstruction. Specifically, leveraging the physical property of CPT that accurately projects points near the surface onto the zero-level set, we correct the deviated points, thus achieving an accurate geometry representation. Based on our accurate geometry representation, incorporating the reflection vector into our method can facilitate the appearance modeling of specular regions. Moreover, to enable our method to no longer be dependent on any prior knowledge of the background, we present a background model to learn the background appearance. Compared to previous state-of-the-art methods, CPT-VR achieves better surface reconstruction quality, even for cases with complex structures and specular highlights.

#15 Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Friedhelm Hamann ; Ziyun Wang ; Ioannis Asmanis ; Kenneth Chaney ; Guillermo Gallego ; Kostas Daniilidis

Current optical flow and point-tracking methods rely heavily on synthetic datasets. Event cameras are novel vision sensors with advantages in challenging visual conditions, but state-of-the-art frame-based methods cannot be easily adapted to event data due to the limitations of current event simulators. We introduce a novel self-supervised loss combining the Contrast Maximization framework with a non-linear motion prior in the form of pixel-level trajectories and propose an efficient solution to solve the high-dimensional assignment problem between non-linear trajectories and events. Their effectiveness is demonstrated in two scenarios: In dense continuous-time motion estimation, our method improves the zero-shot performance of a synthetically trained model on the real-world dataset EVIMO2 by 29%. In optical flow estimation, our method elevates a simple UNet to achieve state-of-the-art performance among self-supervised methods on the DSEC optical flow benchmark.

#16 SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction [PDF] [Copy] [Kimi] [REL]

Authors: Marko Mihajlovic ; Sergey Prokudin ; Siyu Tang ; Robert Maier ; Federica Bogo ; Tony Tung ; Edmond Boyer

Digitizing static scenes and dynamic events from multi-view images has long been a challenge in the fields of computer vision and graphics. Recently, 3D Gaussian Splatting has emerged as a practical and scalable method for reconstruction, gaining popularity due to its impressive quality of reconstruction, real-time rendering speeds, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially pronounced in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach adeptly manages both static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.

#17 OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations [PDF] [Copy] [Kimi1] [REL]

Authors: Yiming Zuo ; Jia Deng

Depth completion is the task of generating a dense depth map given an image and a sparse depth map as inputs. It has important applications in various downstream tasks. In this paper, we present OGNI-DC, a novel framework for depth completion. The key to our method is "Optimization-Guided Neural Iterations" (OGNI). It consists of a recurrent unit that refines a depth gradient field and a differentiable depth integrator that integrates the depth gradients into a depth map. OGNI-DC exhibits strong generalization, outperforming baselines by a large margin on unseen datasets and across various sparsity levels. Moreover, OGNI-DC has high accuracy, achieving state-of-the-art performance on the NYUv2 and the KITTI benchmarks. Code is attached for reviewing and will be released if the paper is accepted.

#18 MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation [PDF] [Copy] [Kimi] [REL]

Authors: Xiaoshuai Hao ; Ruikai Li ; Hui Zhang ; Rong Yin ; Dingzhe Li ; Sangil Jung ; Seung-In Park ; ByungIn Yoo ; Haimei Zhao ; Jing Zhang

Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To address this, we employ the Knowledge Distillation (KD) idea for efficient HD map construction for the first time and introduce a novel KD-based approach called MapDistill to transfer knowledge from a high-performance Camera-LiDAR fusion model to a lightweight camera-only model. Specifically, we adopt the teacher-student architecture, i.e., a Camera-LiDAR fusion model as the teacher and a lightweight camera model as the student, and devise a dual BEV transform module to facilitate cross-modal knowledge distillation while maintaining cost-effective camera-only deployment. Additionally, we present a comprehensive distillation scheme encompassing cross-modal relation distillation, dual-level feature distillation, and map head distillation. This approach alleviates knowledge transfer challenges between modalities, enabling the student model to learn improved feature representations for HD map construction. Experimental results on the challenging nuScenes dataset demonstrate the effectiveness of MapDistill, surpassing existing competitors by over 7.7 mAP or 4.5x speedup. The source code will be released.

#19 High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs [PDF] [Copy] [Kimi] [REL]

Authors: Ruikang Xu ; Mingde Yao ; Yue Li ; Yueyi Zhang ; Zhiwei Xiong

Novel view synthesis has achieved remarkable quality and efficiency by the paradigm of 3D Gaussian Splatting (3D-GS), but still faces two challenges: 1) significant performance degradation when trained with only few-shot samples due to a lack of geometry constraint, and 2) incapability of rendering at a higher resolution that is beyond the input resolution of training samples. In this paper, we propose Dual-Lens 3D-GS (DL-GS) to achieve high-resolution (HR) and few-shot view synthesis, by leveraging the characteristics of the asymmetric dual-lens system commonly equipped on mobile devices. This kind of system captures the same scene with different focal lengths (\textit{i.e.}, wide-angle and telephoto) under an asymmetric stereo configuration, which naturally provides geometric hints for few-shot training and HR guidance for resolution improvement. Nevertheless, there remain two major technical problems to achieving this goal. First, how to effectively exploit the geometry information from the asymmetric stereo configuration? To this end, we propose a consistency-aware training strategy, which integrates a dual-lens-consistent loss to regularize the 3D-GS optimization. Second, how to make the best use of the dual-lens training samples to effectively improve the resolution of newly synthesized views? To this end, we design a multi-reference-guided refinement module to select proper telephoto and wide-angle guided images from training samples based on the camera pose distances, and then exploit their information for high-frequency detail enhancement. Extensive experiments on simulated and real-captured datasets validate the distinct superiority of our DL-GS over various competitors on the task of HR and few-shot view synthesis.

#20 AFreeCA: Annotation-Free Counting for All [PDF2] [Copy] [Kimi1] [REL]

Authors: Adriano DAlessandro ; Ali Mahdavi-Amiri ; Ghassan Hamarneh

Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable \textit{sorting} signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. Our approach outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available. Code to be released upon acceptance.

#21 Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap [PDF] [Copy] [Kimi] [REL]

Authors: Junhao Dong ; Piotr Koniusz ; Junxi Chen ; Yew Soon Ong

Adversarial robustness generally relies on large-scale architectures and datasets alongside extensive adversary generation, hindering resource-efficient deployment. For scalable solutions, adversarially robust knowledge distillation has emerged as a principle strategy, facilitating the transfer of adversarial robustness from a large-scale teacher model to a lightweight student model. However, existing works focus solely on sample-to-sample alignment of features or predictions between the teacher and student models, overlooking the vital role of their statistical alignment. Thus, we propose a novel adversarially robust knowledge distillation method that integrates the alignment of feature distributions between the teacher and student backbones under adversarial and clean sample sets. To motivate our idea, for an adversarially trained model (e.g., student or teacher), we show that the adversarially robust accuracy (evaluated on testing adversarial samples under an increasing perturbation radius) correlates negatively with the gap between the feature variance evaluated on testing adversarial samples and testing clean samples. Such a negative correlation exhibits a strong linear trend, suggesting that aligning the feature covariance of the student model toward the feature covariance of the teacher model should improve the adversarial robustness of the student model by reducing the variance gap. A similar trend is observed by reducing the variance gap between the gram matrices of the student and teacher models. Extensive evaluations highlight the state-of-the-art adversarial robustness and natural performance of our method across diverse datasets and distillation scenarios.

#22 LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation [PDF] [Copy] [Kimi] [REL]

Authors: Yushi Lan ; Fangzhou Hong ; Shuai Yang ; Shangchen Zhou ; Xuyi Meng ; Bo Dai ; Xingang Pan ; Chen Change Loy

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and perceptually equivalent latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

#23 Motion and Structure from Event-based Normal Flow [PDF] [Copy] [Kimi] [REL]

Authors: Zhongyang Ren ; Bangyan Liao ; Delei Kong ; Jinghang Li ; Peidong Liu ; Laurent Kneip ; Guillermo Gallego ; Yi Zhou

Recovering the camera motion and scene geometry from visual data is a fundamental problem in the field of computer vision. Its success in standard vision is attributed to the maturity of feature extraction, data association and multi-view geometry. The recent emergence of neuromorphic event-based cameras places great demands on approaches that use raw event data as input to solve this fundamental problem. Existing state-of-the-art solutions typically infer implicitly data association by iteratively reversing the event data generation process. However, the nonlinear nature of these methods limits their applicability in real-time tasks, and the constant-motion assumption leads to unstable results under agile motion. To this end, we rethink the problem formulation in a way that aligns better with the differential working principle of event cameras. We show that the event-based normal flow can be used, via the proposed geometric error term, as an alternative to the full flow in solving a family of geometric problems that involve instantaneous first-order kinematics and scene geometry. Furthermore, we develop a fast linear solver and a continuous-time nonlinear solver on top of the proposed geometric error term. Experiments on both synthetic and real data demonstrate the superiority of our linear solver in terms of accuracy and efficiency, and indicate its complementary feature as an initialization method for existing nonlinear solvers. Besides, our continuous-time non-linear solver exhibits exceptional capability in accommodating sudden variations in motion since it does not rely on the constant-motion assumption.

#24 Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [PDF1] [Copy] [Kimi] [REL]

Authors: Bohan Li ; Jiajun Deng ; Wenyao Zhang ; Zhujin Liang ; Dalong Du ; Xin Jin ; Wenjun Zeng

Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present \textbf{HTCL}, a novel \textbf{H}ierarchical \textbf{T}emporal \textbf{C}ontext \textbf{L}earning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available in the supplementary material.

#25 DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching [PDF] [Copy] [Kimi] [REL]

Authors: Paul Roetzer ; Ahmed Abbas ; Dongliang Cao ; Florian Bernard ; Paul Swoboda

In this work we propose to combine the advantages of learning-based and combinatorial formalisms for 3D shape matching. While learning-based methods lead to state-of-the-art matching performance, they do not ensure geometric consistency, so that obtained matchings are locally non-smooth. On the contrary, axiomatic, optimisation-based methods allow to take geometric consistency into account by explicitly constraining the space of valid matchings. However, existing axiomatic formalisms do not scale to practically relevant problem sizes, and require user input for the initialisation of non-convex optimisation problems. We work towards closing this gap by proposing a novel combinatorial solver that combines a unique set of favourable properties: our approach (i) is initialisation free, (ii) is massively parallelisable and powered by a quasi-Newton method, (iii) provides optimality gaps, and (iv) delivers improved matching quality with decreased runtime and globally optimal results for many instances.