Computer Vision and Pattern Recognition

Date: Fri, 19 Jul 2024 | Total: 137

#1 GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [PDF5] [Copy] [Kimi2]

Authors: Abdelrahman Shaker ; Syed Talal Wasim ; Salman Khan ; Juergen Gall ; Fahad Shahbaz Khan

Recent advancements in state-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. To address this, we introduce a Modulated Group Mamba layer which divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Our code and models are available at:

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:59:58 UTC

#2 Training-Free Model Merging for Multi-target Domain Adaptation [PDF1] [Copy] [Kimi3]

Authors: Wenyi Li ; Huan-ang Gao ; Mingju Gao ; Beiwen Tian ; Rong Zhi ; Hao Zhao

In this paper, we study multi-target domain adaptation of scene understanding models. While previous methods achieved commendable results through inter-domain consistency losses, they often assumed unrealistic simultaneous access to images from all target domains, overlooking constraints such as data transfer bandwidth limitations and data privacy concerns. Given these challenges, we pose the question: How to merge models adapted independently on distinct domains while bypassing the need for direct access to training data? Our solution to this problem involves two components, merging model parameters and merging model buffers (i.e., normalization layer statistics). For merging model parameters, empirical analyses of mode connectivity surprisingly reveal that linear merging suffices when employing the same pretrained backbone weights for adapting separate models. For merging model buffers, we model the real-world distribution with a Gaussian prior and estimate new statistics from the buffers of separately trained models. Our method is simple yet effective, achieving comparable performance with data combination training baselines, while eliminating the need for accessing training data. Project page:

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:59:57 UTC

#3 Addressing Imbalance for Class Incremental Learning in Medical Image Classification [PDF] [Copy] [Kimi]

Authors: Xuze Hao ; Wenqian Ni ; Xuhao Jiang ; Weimin Tan ; Bo Yan

Deep convolutional neural networks have made significant breakthroughs in medical image classification, under the assumption that training samples from all classes are simultaneously available. However, in real-world medical scenarios, there's a common need to continuously learn about new diseases, leading to the emerging field of class incremental learning (CIL) in the medical domain. Typically, CIL suffers from catastrophic forgetting when trained on new classes. This phenomenon is mainly caused by the imbalance between old and new classes, and it becomes even more challenging with imbalanced medical datasets. In this work, we introduce two simple yet effective plug-in methods to mitigate the adverse effects of the imbalance. First, we propose a CIL-balanced classification loss to mitigate the classifier bias toward majority classes via logit adjustment. Second, we propose a distribution margin loss that not only alleviates the inter-class overlap in embedding space but also enforces the intra-class compactness. We evaluate the effectiveness of our method with extensive experiments on three benchmark datasets (CCH5000, HAM10000, and EyePACS). The results demonstrate that our approach outperforms state-of-the-art methods.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:59:44 UTC

#4 Visual Haystacks: Answering Harder Questions About Sets of Images [PDF] [Copy] [Kimi1]

Authors: Tsung-Han Wu ; Giscard Biamby ; Jerome Quenum ; Ritwik Gupta ; Joseph E. Gonzalez ; Trevor Darrell ; David M. Chan

Recent advancements in Large Multimodal Models (LMMs) have made significant progress in the field of single-image visual question answering. However, these models face substantial challenges when tasked with queries that span extensive collections of images, similar to real-world scenarios like searching through large photo albums, finding specific information across the internet, or monitoring environmental changes through satellite imagery. This paper explores the task of Multi-Image Visual Question Answering (MIQA): given a large set of images and a natural language query, the task is to generate a relevant and grounded response. We propose a new public benchmark, dubbed "Visual Haystacks (VHs)," specifically designed to evaluate LMMs' capabilities in visual retrieval and reasoning over sets of unrelated images, where we perform comprehensive evaluations demonstrating that even robust closed-source models struggle significantly. Towards addressing these shortcomings, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA framework tailored for LMMs that confronts the challenges of MIQA with marked efficiency and accuracy improvements over baseline methods. Our evaluation shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs benchmark and offers up to 3.4x improvements in efficiency over text-focused multi-stage approaches.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:59:30 UTC

#5 Shape of Motion: 4D Reconstruction from a Single Video [PDF4] [Copy] [Kimi2]

Authors: Qianqian Wang ; Vickie Ye ; Hang Gao ; Jake Austin ; Zhengqi Li ; Angjoo Kanazawa

Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page:

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:59:08 UTC

#6 SegPoint: Segment Any Point Cloud via Large Language Model [PDF4] [Copy] [Kimi2]

Authors: Shuting He ; Henghui Ding ; Xudong Jiang ; Bihan Wen

Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:58:03 UTC

#7 Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion [PDF1] [Copy] [Kimi1]

Authors: Boyang Deng ; Richard Tucker ; Zhengqi Li ; Leonidas Guibas ; Noah Snavely ; Gordon Wetzstein

We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at

Subjects: Computer Vision and Pattern Recognition ; Graphics

Publish: 2024-07-18 17:56:30 UTC

#8 Exploring Facial Biomarkers for Depression through Temporal Analysis of Action Units [PDF] [Copy] [Kimi]

Authors: Aditya Parikh ; Misha Sadeghi ; Bjorn Eskofier

Depression is characterized by persistent sadness and loss of interest, significantly impairing daily functioning and now a widespread mental disorder. Traditional diagnostic methods rely on subjective assessments, necessitating objective approaches for accurate diagnosis. Our study investigates the use of facial action units (AUs) and emotions as biomarkers for depression. We analyzed facial expressions from video data of participants classified with or without depression. Our methodology involved detailed feature extraction, mean intensity comparisons of key AUs, and the application of time series classification models. Furthermore, we employed Principal Component Analysis (PCA) and various clustering algorithms to explore the variability in emotional expression patterns. Results indicate significant differences in the intensities of AUs associated with sadness and happiness between the groups, highlighting the potential of facial analysis in depression assessment.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:55:01 UTC

#9 LogoSticker: Inserting Logos into Diffusion Models for Customized Generation [PDF1] [Copy] [Kimi1]

Authors: Mingkang Zhu ; Xi Chen ; Zhongdao Wang ; Hengshuang Zhao ; Jiaya Jia

Recent advances in text-to-image model customization have underscored the importance of integrating new concepts with a few examples. Yet, these progresses are largely confined to widely recognized subjects, which can be learned with relative ease through models' adequate shared prior knowledge. In contrast, logos, characterized by unique patterns and textual elements, are hard to establish shared knowledge within diffusion models, thus presenting a unique challenge. To bridge this gap, we introduce the task of logo insertion. Our goal is to insert logo identities into diffusion models and enable their seamless synthesis in varied contexts. We present a novel two-phase pipeline LogoSticker to tackle this task. First, we propose the actor-critic relation pre-training algorithm, which addresses the nontrivial gaps in models' understanding of the potential spatial positioning of logos and interactions with other objects. Second, we propose a decoupled identity learning algorithm, which enables precise localization and identity extraction of logos. LogoSticker can generate logos accurately and harmoniously in diverse contexts. We comprehensively validate the effectiveness of LogoSticker over customization methods and large models such as DALLE~3. \href{}{Project page}.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:54:49 UTC

#10 Pose-guided multi-task video transformer for driver action recognition [PDF1] [Copy] [Kimi]

Authors: Ricardo Pizarro ; Roberto Valle ; Luis Miguel Bergasa ; José M. Buenaposada ; Luis Baumela

We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:53:51 UTC

#11 General Geometry-aware Weakly Supervised 3D Object Detection [PDF1] [Copy] [Kimi]

Authors: Guowen Zhang ; Junsong Fan ; Liyi Chen ; Zhaoxiang Zhang ; Zhen Lei ; Lei Zhang

3D object detection is an indispensable component for scene understanding. However, the annotation of large-scale 3D datasets requires significant human effort. To tackle this problem, many methods adopt weakly supervised 3D object detection that estimates 3D boxes by leveraging 2D boxes and scene/class-specific priors. However, these approaches generally depend on sophisticated manual priors, which is hard to generalize to novel categories and scenes. In this paper, we are motivated to propose a general approach, which can be easily adapted to new scenes and/or classes. A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes. In specific, we propose three general components: prior injection module to obtain general object geometric priors from LLM model, 2D space projection constraint to minimize the discrepancy between the boundaries of projected 3D boxes and their corresponding 2D boxes on the image plane, and 3D space geometry constraint to build a Point-to-Box alignment loss to further refine the pose of estimated 3D boxes. Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation. The source code is available at

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:52:08 UTC

#12 MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References [PDF1] [Copy] [Kimi]

Authors: Lukas Bösiger ; Mihai Dusmanu ; Marc Pollefeys ; Zuria Bauer

Rendering realistic images from 3D reconstruction is an essential task of many Computer Vision and Robotics pipelines, notably for mixed-reality applications as well as training autonomous agents in simulated environments. However, the quality of novel views heavily depends of the source reconstruction which is often imperfect due to noisy or missing geometry and appearance. Inspired by the recent success of reference-based super-resolution networks, we propose MaRINeR, a refinement method that leverages information of a nearby mapping image to improve the rendering of a target viewpoint. We first establish matches between the raw rendered image of the scene geometry from the target viewpoint and the nearby reference based on deep features, followed by hierarchical detail transfer. We show improved renderings in quantitative metrics and qualitative examples from both explicit and implicit scene representations. We further employ our method on the downstream tasks of pseudo-ground-truth validation, synthetic data enhancement and detail recovery for renderings of reduced 3D reconstructions.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:50:03 UTC

#13 HazeCLIP: Towards Language Guided Real-World Image Dehazing [PDF1] [Copy] [Kimi1]

Authors: Ruiyi Wang ; Wenhao Li ; Xiaohong Liu ; Chunyi Li ; Zicheng Zhang ; Xiongkuo Min ; Guangtao Zhai

Existing methods have achieved remarkable performance in single image dehazing, particularly on synthetic datasets. However, they often struggle with real-world hazy images due to domain shift, limiting their practical applicability. This paper introduces HazeCLIP, a language-guided adaptation framework designed to enhance the real-world performance of pre-trained dehazing networks. Inspired by the Contrastive Language-Image Pre-training (CLIP) model's ability to distinguish between hazy and clean images, we utilize it to evaluate dehazing results. Combined with a region-specific dehazing technique and tailored prompt sets, CLIP model accurately identifies hazy areas, providing a high-quality, human-like prior that guides the fine-tuning process of pre-trained networks. Extensive experiments demonstrate that HazeCLIP achieves the state-of-the-art performance in real-word image dehazing, evaluated through both visual quality and no-reference quality assessments. The code is available: .

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 17:18:25 UTC

#14 Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning [PDF] [Copy] [Kimi]

Authors: Ans Munir ; Faisal Z. Qureshi ; Muhammad Haris Khan ; Mohsen Ali

Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions. Utilizing a self-attention mechanism facilitates the model's ability to identify relationships between attribute and objects. The similarity between the self-attended textual and visual features is subsequently calculated to generate predictions during the inference phase. The potential test space may encompass implausible object-attribute combinations arising from unrestricted attribute-object pairings. To mitigate this issue, we leverage external knowledge from ConceptNet to restrict the test space to realistic compositions. Our proposed model, Attention-based Simple Primitives (ASP), demonstrates competitive performance, achieving results comparable to the state-of-the-art.

Subjects: Computer Vision and Pattern Recognition ; Machine Learning

Publish: 2024-07-18 17:11:29 UTC

#15 Are We Ready for Out-of-Distribution Detection in Digital Pathology? [PDF4] [Copy] [Kimi]

Authors: Ji-Hun Oh ; Kianoush Falahkheirkhah ; Rohit Bhargava

The detection of semantic and covariate out-of-distribution (OOD) examples is a critical yet overlooked challenge in digital pathology (DP). Recently, substantial insight and methods on OOD detection were presented by the ML community, but how do they fare in DP applications? To this end, we establish a benchmark study, our highlights being: 1) the adoption of proper evaluation protocols, 2) the comparison of diverse detectors in both a single and multi-model setting, and 3) the exploration into advanced ML settings like transfer learning (ImageNet vs. DP pre-training) and choice of architecture (CNNs vs. transformers). Through our comprehensive experiments, we contribute new insights and guidelines, paving the way for future research and discussion.

Subjects: Computer Vision and Pattern Recognition ; Machine Learning

Publish: 2024-07-18 17:07:32 UTC

#16 Cross-Task Attack: A Self-Supervision Generative Framework Based on Attention Shift [PDF1] [Copy] [Kimi]

Authors: Qingyuan Zeng ; Yunpeng Gong ; Min Jiang

Studying adversarial attacks on artificial intelligence (AI) systems helps discover model shortcomings, enabling the construction of a more robust system. Most existing adversarial attack methods only concentrate on single-task single-model or single-task cross-model scenarios, overlooking the multi-task characteristic of artificial intelligence systems. As a result, most of the existing attacks do not pose a practical threat to a comprehensive and collaborative AI system. However, implementing cross-task attacks is highly demanding and challenging due to the difficulty in obtaining the real labels of different tasks for the same picture and harmonizing the loss functions across different tasks. To address this issue, we propose a self-supervised Cross-Task Attack framework (CTA), which utilizes co-attention and anti-attention maps to generate cross-task adversarial perturbation. Specifically, the co-attention map reflects the area to which different visual task models pay attention, while the anti-attention map reflects the area that different visual task models neglect. CTA generates cross-task perturbations by shifting the attention area of samples away from the co-attention map and closer to the anti-attention map. We conduct extensive experiments on multiple vision tasks and the experimental results confirm the effectiveness of the proposed design for adversarial attacks.

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence

Publish: 2024-07-18 17:01:10 UTC

#17 HPix: Generating Vector Maps from Satellite Images [PDF] [Copy] [Kimi]

Authors: Aditya Taparia ; Keshab Nath

Vector maps find widespread utility across diverse domains due to their capacity to not only store but also represent discrete data boundaries such as building footprints, disaster impact analysis, digitization, urban planning, location points, transport links, and more. Although extensive research exists on identifying building footprints and road types from satellite imagery, the generation of vector maps from such imagery remains an area with limited exploration. Furthermore, conventional map generation techniques rely on labor-intensive manual feature extraction or rule-based approaches, which impose inherent limitations. To surmount these limitations, we propose a novel method called HPix, which utilizes modified Generative Adversarial Networks (GANs) to generate vector tile map from satellite images. HPix incorporates two hierarchical frameworks: one operating at the global level and the other at the local level, resulting in a comprehensive model. Through empirical evaluations, our proposed approach showcases its effectiveness in producing highly accurate and visually captivating vector tile maps derived from satellite images. We further extend our study's application to include mapping of road intersections and building footprints cluster based on their area.

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence ; Image and Video Processing

Publish: 2024-07-18 16:54:02 UTC

#18 Media Insights Engine for Advanced Media Analysis: A Case Study of a Computer Vision Innovation for Pet Health Diagnosis [PDF] [Copy] [Kimi]

Author: Anjanava Biswas

This paper presents a case study of how Petco, a leading pet retailer, innovated their pet health analysis processes using the Media Insights Engine to reduce the time to first diagnosis. The company leveraged this framework to build custom applications for advanced computer vision tasks, such as identifying potential health issues in pet videos and images, and validating AI outcomes with pre-built veterinary diagnoses. The Media Insights Engine provides a modular and extensible solution that enabled Petco to quickly build machine learning applications for media workloads. By utilizing this framework, Petco was able to accelerate their project development, improve the efficiency of their pet health analysis, and ultimately reduce the time to first diagnosis for pet health issues. This paper discusses the challenges of pet health analysis using media, the benefits of using the Media Insights Engine, and the architecture of Petco's custom applications built using this framework.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-05-28 16:48:06 UTC

#19 PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers [PDF2] [Copy] [Kimi]

Authors: Songlin Li ; Despoina Paschalidou ; Leonidas Guibas

The increased demand for tools that automate the 3D content creation process led to tremendous progress in deep generative models that can generate diverse 3D objects of high fidelity. In this paper, we present PASTA, an autoregressive transformer architecture for generating high quality 3D shapes. PASTA comprises two main components: An autoregressive transformer that generates objects as a sequence of cuboidal primitives and a blending network, implemented with a transformer decoder that composes the sequences of cuboids and synthesizes high quality meshes for each object. Our model is trained in two stages: First we train our autoregressive generative model using only annotated cuboidal parts as supervision and next, we train our blending network using explicit 3D supervision, in the form of watertight meshes. Evaluations on various ShapeNet objects showcase the ability of our model to perform shape generation from diverse inputs \eg from scratch, from a partial object, from text and images, as well size-guided generation, by explicitly conditioning on a bounding box that defines the object's boundaries. Moreover, as our model considers the underlying part-based structure of a 3D object, we are able to select a specific part and produce shapes with meaningful variations of this part. As evidenced by our experiments, our model generates 3D shapes that are both more realistic and diverse than existing part-based and non part-based methods, while at the same time is simpler to implement and train.

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence ; Graphics ; Machine Learning

Publish: 2024-07-18 16:52:45 UTC

#20 MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis [PDF1] [Copy] [Kimi]

Authors: Ziming Zhong ; Yanxu Xu ; Jing Li ; Jiale Xu ; Zhengxin Li ; Chaohui Yu ; Shenghua Gao

We present MeshSegmenter, a simple yet effective framework designed for zero-shot 3D semantic segmentation. This model successfully extends the powerful capabilities of 2D segmentation models to 3D meshes, delivering accurate 3D segmentation across diverse meshes and segment descriptions. Specifically, our model leverages the Segment Anything Model (SAM) model to segment the target regions from images rendered from the 3D shape. In light of the importance of the texture for segmentation, we also leverage the pretrained stable diffusion model to generate images with textures from 3D shape, and leverage SAM to segment the target regions from images with textures. Textures supplement the shape for segmentation and facilitate accurate 3D segmentation even in geometrically non-prominent areas, such as segmenting a car door within a car mesh. To achieve the 3D segments, we render 2D images from different views and conduct segmentation for both textured and untextured images. Lastly, we develop a multi-view revoting scheme that integrates 2D segmentation results and confidence scores from various views onto the 3D mesh, ensuring the 3D consistency of segmentation results and eliminating inaccuracies from specific perspectives. Through these innovations, MeshSegmenter offers stable and reliable 3D segmentation results both quantitatively and qualitatively, highlighting its potential as a transformative tool in the field of 3D zero-shot segmentation. The code is available at \url{}.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 16:50:59 UTC

#21 Beyond Dropout: Robust Convolutional Neural Networks Based on Local Feature Masking [PDF] [Copy] [Kimi1]

Authors: Yunpeng Gong ; Chuangliang Zhang ; Yongjie Hou ; Lifei Chen ; Min Jiang

In the contemporary of deep learning, where models often grapple with the challenge of simultaneously achieving robustness against adversarial attacks and strong generalization capabilities, this study introduces an innovative Local Feature Masking (LFM) strategy aimed at fortifying the performance of Convolutional Neural Networks (CNNs) on both fronts. During the training phase, we strategically incorporate random feature masking in the shallow layers of CNNs, effectively alleviating overfitting issues, thereby enhancing the model's generalization ability and bolstering its resilience to adversarial attacks. LFM compels the network to adapt by leveraging remaining features to compensate for the absence of certain semantic features, nurturing a more elastic feature learning mechanism. The efficacy of LFM is substantiated through a series of quantitative and qualitative assessments, collectively showcasing a consistent and significant improvement in CNN's generalization ability and resistance against adversarial attacks--a phenomenon not observed in current and prior methodologies. The seamless integration of LFM into established CNN frameworks underscores its potential to advance both generalization and adversarial robustness within the deep learning paradigm. Through comprehensive experiments, including robust person re-identification baseline generalization experiments and adversarial attack experiments, we demonstrate the substantial enhancements offered by LFM in addressing the aforementioned challenges. This contribution represents a noteworthy stride in advancing robust neural network architectures.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 16:25:16 UTC

#22 Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [PDF2] [Copy] [Kimi]

Authors: Xiaoyu Zhu ; Hao Zhou ; Pengfei Xing ; Long Zhao ; Hao Xu ; Junwei Liang ; Alexander Hauptmann ; Ting Liu ; Andrew Gallagher

In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, locations and their compositions in 3D scenes. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods. In particular, Diff2Scene improves the state-of-the-art method on ScanNet200 by 12%.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 16:20:56 UTC

#23 Beyond Augmentation: Empowering Model Robustness under Extreme Capture Environments [PDF] [Copy] [Kimi]

Authors: Yunpeng Gong ; Yongjie Hou ; Chuangliang Zhang ; Min Jiang

Person Re-identification (re-ID) in computer vision aims to recognize and track individuals across different cameras. While previous research has mainly focused on challenges like pose variations and lighting changes, the impact of extreme capture conditions is often not adequately addressed. These extreme conditions, including varied lighting, camera styles, angles, and image distortions, can significantly affect data distribution and re-ID accuracy. Current research typically improves model generalization under normal shooting conditions through data augmentation techniques such as adjusting brightness and contrast. However, these methods pay less attention to the robustness of models under extreme shooting conditions. To tackle this, we propose a multi-mode synchronization learning (MMSL) strategy . This approach involves dividing images into grids, randomly selecting grid blocks, and applying data augmentation methods like contrast and brightness adjustments. This process introduces diverse transformations without altering the original image structure, helping the model adapt to extreme variations. This method improves the model's generalization under extreme conditions and enables learning diverse features, thus better addressing the challenges in re-ID. Extensive experiments on a simulated test set under extreme conditions have demonstrated the effectiveness of our method. This approach is crucial for enhancing model robustness and adaptability in real-world scenarios, supporting the future development of person re-identification technology.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-07-18 16:18:58 UTC

#24 Data Alchemy: Mitigating Cross-Site Model Variability Through Test Time Data Calibration [PDF] [Copy] [Kimi]

Authors: Abhijeet Parida ; Antonia Alomar ; Zhifan Jiang ; Pooneh Roshanitabrizi ; Austin Tapp ; Maria Ledesma-Carbayo ; Ziyue Xu ; Syed Muhammed Anwar ; Marius George Linguraru ; Holger R. Roth

Deploying deep learning-based imaging tools across various clinical sites poses significant challenges due to inherent domain shifts and regulatory hurdles associated with site-specific fine-tuning. For histopathology, stain normalization techniques can mitigate discrepancies, but they often fall short of eliminating inter-site variations. Therefore, we present Data Alchemy, an explainable stain normalization method combined with test time data calibration via a template learning framework to overcome barriers in cross-site analysis. Data Alchemy handles shifts inherent to multi-site data and minimizes them without needing to change the weights of the normalization or classifier networks. Our approach extends to unseen sites in various clinical settings where data domain discrepancies are unknown. Extensive experiments highlight the efficacy of our framework in tumor classification in hematoxylin and eosin-stained patches. Our explainable normalization method boosts classification tasks' area under the precision-recall curve(AUPR) by 0.165, 0.545 to 0.710. Additionally, Data Alchemy further reduces the multisite classification domain gap, by improving the 0.710 AUPR an additional 0.142, elevating classification performance further to 0.852, from 0.545. Our Data Alchemy framework can popularize precision medicine with minimal operational overhead by allowing for the seamless integration of pre-trained deep learning-based clinical tools across multiple sites.

Subjects: Computer Vision and Pattern Recognition ; Machine Learning ; Image and Video Processing

Publish: 2024-07-18 16:03:59 UTC

#25 Training-free Composite Scene Generation for Layout-to-Image Synthesis [PDF2] [Copy] [Kimi3]

Authors: Jiaqi Liu ; Tao Huang ; Chang Xu

Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity. Code is available at

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence

Publish: 2024-07-18 15:48:07 UTC