AAAI.2025 - Computer Vision

| Total: 1044

#1 HSRDiff: A Hierarchical Self-Regulation Diffusion Model for Stochastic Semantic Segmentation [PDF⁵¹] [Copy] [Kimi¹¹] [REL]

Authors: Han Yang, Chuanguang Yang, Zhulin An, Libo Huang, Yongjun Xu

In safety-critical domains such as medical diagnostics and autonomous driving, single-image evidence is sometimes insufficient to reflect the inherent ambiguity of vision problems. Therefore, multiple plausible assumptions that match the image semantics may be needed to reflect the actual distribution of targets and support downstream tasks. However, balancing and improving the diversity and consistency of segmentation predictions under the high-dimensional output spaces and potential multimodal distributions is still challenging. This paper presents Hierarchical Self-Regulation Diffusion (HSRDiff), a unified framework that simulates joint probability distribution over entire labels. Our model self-regulates the balance between the two modes of predicting the label and noise in a novel ``differentiation to unification" pipeline and dynamically fits the optimal path to model the aleatoric uncertainty rooted in observations. In addition, we preserve the high-fidelity reconstruction of the delicate structure in images by leveraging the hierarchical multi-scale condition priors. We validate HSRDiff in three different semantic scenarios. Experimental results show that HSRDiff is superior to the comparison method with a considerable performance gap.

Subject: AAAI.2025 - Computer Vision

#2 GRICP: Granular-Ball Iterative Closest Point with Multikernel Correntropy for Point Cloud Fine Registration [PDF²⁷] [Copy] [Kimi⁴] [REL]

Authors: Yihao, Limei Hu, Feng Chen, Sen Zhao, Shukai Duan

The Iterative Closest Point (ICP) algorithm suffers from sensitivity to outliers and tendency to local optima in point cloud fine registration. In this paper, we introduce a global and robust ICP framework called Granular-Ball Iterative Closest Point with MultiKernel Correntropy (GRICP). This approach transforms the point cloud into a granular ball cloud and employs MultiKernel Correntropy (MKC) as the loss function, which is designed to smooth out the effects of noise points and provide global information for registration. Specifically, we propose a coarse-grained representation of the point cloud using the granular ball model, which adaptively captures the coarse-grained features of the data and converts the point cloud into a multi-granularity ball cloud. The normal points within each granular ball help mitigate the influence of noise points. To ensure that ICP finds the globally optimal transformation, MKC is introduced to measure the distribution of registration errors, thereby offering global insights for ICP to achieve the optimal solution. The transformations based on MKC and the granular ball cloud are then derived. Extensive experiments on both simulated and real-world datasets demonstrate that GRICP delivers superior registration performance, particularly in scenarios involving large rotation offsets, partial overlaps, and Gaussian noise.

Subject: AAAI.2025 - Computer Vision

#3 AQUAFace: Age-Invariant Quality Adaptive Face Recognition for Unconstrained Selfie vs ID Verification [PDF¹⁵] [Copy] [Kimi³] [REL]

Authors: Shivang Agarwal, Jyoti Chaudhary, Sadiq Siraj Ebrahim, Mayank Vatsa, Richa Singh, Shyam Prasad Adhikari, Sangeeth Reddy Battu

Face recognition in the presence of age and quality variations poses a formidable challenge. While recent margin-based loss functions have shown promise in addressing these variations individually, real-world scenarios such as selfie versus ID face matching often involve simultaneous variations of both age and quality. In response, we propose a comprehensive framework aimed at mitigating the impact of these variations while preserving vital identity-related information crucial for accurate face recognition. The proposed adaptive margin-based loss function AQUAFace exhibits adaptiveness towards hard samples characterized by significant age and quality variations. This loss function is meticulously designed to prioritize the preservation of identity-related features while simultaneously mitigating the adverse effects of age and quality variations on face recognition accuracy. To validate the effectiveness of our approach, we focus on the specific task of selfie versus ID document matching. Our results demonstrate that AQUAFace effectively handles age and quality differences, leading to enhanced recognition performance. Additionally, we explore the benefits of fine-tuning the recognition model with synthetic data, further boosting performance. As a result, our proposed model, AQUAFace, achieves state-of-the-art performance on six benchmark datasets (CALFW, CPLFW, CFP-FP, AgeDB, IJB-C, and TinyFace), each exhibiting diverse age and quality variations.

Subject: AAAI.2025 - Computer Vision

#4 ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO [PDF¹²] [Copy] [Kimi⁴] [REL]

Authors: Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multimodal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge’s focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

Subject: AAAI.2025 - Computer Vision

#5 EMPLACE: Self-Supervised Urban Scene Change Detection [PDF¹⁴] [Copy] [Kimi³] [REL]

Authors: Tim Alpherts, Sennay Ghebreab, Nanne van Noord

Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

Subject: AAAI.2025 - Computer Vision

#6 AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation [PDF¹³] [Copy] [Kimi²] [REL]

Authors: Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, Chengwei Pan

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL-base, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPS v2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Subject: AAAI.2025 - Computer Vision

#7 Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation [PDF¹⁶] [Copy] [Kimi] [REL]

Authors: Xiaoqi An, Lin Zhao, Chen Gong, Jun Li, Jian Yang

With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by 10.0mm. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by 20.7mm.

Subject: AAAI.2025 - Computer Vision

#8 CA-MLIF: Cross-Attention and Multimodal Low-Rank Interaction Fusion Framework for Tumor Prognostic Prediction [PDF¹²] [Copy] [Kimi] [REL]

Authors: Yajun An, Jiale Chen, Huan Lin, Zhenbing Liu, Siyang Feng, Hualong Zhang, Rushi Lan, Zaiyi Liu, Xipeng Pan

Cancer is a leading cause of death worldwide due to its aggressive nature and complex variability. Accurate prognosis is therefore challenging but essential for guiding personalized treatment and follow-up. Previous research often relied on single data sources, missing the opportunity to combine various types of patient information for more comprehensive survival predictions. To address these challenges, we propose a two-stage fusion method named Cross-Attention and Multimodal Low-Rank Interaction Fusion Framework (CA-MLIF). In the first stage, we propose a CA mechanism for real-time feature updates and cross-modal mutual learning to capture rich semantic information. In the second stage, we design a novel multimodal low-rank interaction fusion method for survival prediction. Specifically, we present modal attention mechanism (MAM) for feature filtration, low-rank multimodal fusion (LMF) for model complexity reduction, and optimal weight concatenation (OWC) for maximizing feature integration. Extensive experiments on two public datasets TCGA-GBMLGG and TCGA-KIRC, as well as a multi-center in-house lung adenocarcinoma (LUAD) dataset validate the effectiveness of CA-MLIF, which demonstrate that our method outperforms existing approaches in survival prediction under both pathology-gene fusion and CT-pathology fusion scenarios.

Subject: AAAI.2025 - Computer Vision

#9 HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models [PDF⁶] [Copy] [Kimi³] [REL]

Authors: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.

Subject: AAAI.2025 - Computer Vision

#10 Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset [PDF⁴] [Copy] [Kimi] [REL]

Authors: Sithu Aung, Min-Cheol Sagong, Junghyun Cho

We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information. Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.

Subject: AAAI.2025 - Computer Vision

#11 ProtoArgNet: Interpretable Image Classification with Super-Prototypes and Argumentation [PDF⁵] [Copy] [Kimi²] [REL]

Authors: Hamed Ayoobi, Nico Potyka, Francesca Toni

We propose ProtoArgNet, a novel interpretable deep neural architecture for image classification in the spirit of prototypical-part-learning as found, e.g., in ProtoPNet. While earlier approaches associate every class with multiple prototypical-parts, ProtoArgNet uses super-prototypes that combine prototypical-parts into a unified class representation. This is done by combining local activations of prototypes in an MLP-like manner, enabling the localization of prototypes and learning (non-linear) spatial relationships among them. By leveraging a form of argumentation, ProtoArgNet is capable of providing both supporting (i.e. `this looks like that') and attacking (i.e. `this differs from that') explanations. We demonstrate on several datasets that ProtoArgNet outperforms state-of-the-art prototypical-part-learning approaches. Moreover, the argumentation component in ProtoArgNet is customisable to the user's cognitive requirements by a process of sparsification, which leads to more compact explanations compared to state-of-the-art approaches.

Subject: AAAI.2025 - Computer Vision

#12 Can Generative Models Improve Self-Supervised Representation Learning? [PDF³] [Copy] [Kimi²] [REL]

Authors: Sana Ayromlou, Vahid Reza Khazaie, Fereshteh Forghani, Arash Afkanpour

The rapid advancement in self-supervised representation learning has highlighted its potential to leverage unlabeled data for learning rich visual representations. However, the existing techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple transformations that cannot fully capture variations in the real world. This constrains the diversity and quality of samples, which leads to sub-optimal representations. In this paper, we introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for SSL. Our extensive experimental results on various joint-embedding SSL techniques demonstrate that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks. This research demonstrates that incorporating generative models into the joint-embedding SSL workflow opens new avenues for exploring the potential of synthetic data. This development paves the way for more robust and versatile representation learning techniques.

Subject: AAAI.2025 - Computer Vision

#13 The Master Key Filters Hypothesis: Deep Filters Are General [PDF⁴] [Copy] [Kimi¹] [REL]

Authors: Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

This paper challenges the prevailing view that convolutional neural network (CNN) filters become increasingly specialized in deeper layers. Motivated by recent observations of clusterable repeating patterns in depthwise separable CNNs (DS-CNNs) trained on ImageNet, we extend this investigation across various domains and datasets. Our analysis of DS-CNNs reveals that deep filters maintain generality, contradicting the expected transition to class-specific features. We demonstrate the generalizability of these filters through transfer learning experiments, showing that frozen filters from models trained on different datasets perform well and can be further improved when sourced from larger, better-performing models. Our findings indicate that spatial features learned by depthwise separable convolutions remain generic across all layers, domains, and architectures. This research provides new insights into the nature of generalization in neural networks, particularly in DS-CNNs, and has significant implications for transfer learning and model design.

Subject: AAAI.2025 - Computer Vision

#14 Frozen Language Models Are Gradient Coherence Rectifiers in Vision Transformers [PDF⁹] [Copy] [Kimi³] [REL]

Authors: Lichen Bai, Zixuan Xiong, Hai Lin, Guangwei Xu, Xiangjin Xie, Ruijie Guo, Zhanhui Kang, Hai-Tao Zheng, Hong-Gee Kim

Large language models (LLMs) have demonstrated remarkable performance in multimodal tasks even with frozen LLM Block and only a few trainable parameters. However, the underlying mechanisms of how LLMs enhance multimodal performance remains unclear. In this work, we focus on the phenomenon that ``Merely concatenating a frozen LLM block to the Vision Transformer (ViT) encoder can yield significant performance enhancements. Moreover, the choice of LLM block and insertion position can have a substantial impact, leading to varying degrees of improvement''. We analyze the optimization of the training process from the perspective of gradient dynamics and find that frozen LLM blocks act as gradient coherence rectifiers, aligning the gradients of different samples more closely during training. Furthermore, we demonstrate that the representation similarity between the inserted LLM block and the adjacent ViT block influences performance, with greater similarity tending to yield larger positive gains. Through these findings, we can justify the selection of suitable LLM blocks to be inserted at appropriate positions, and introduce additional gradient backpropagation paths by incorporating LLM blocks, could improve the performance of vanilla ViT through the rectification effect of gradient consistency during the training process, without the need to add LLM blocks during inference. Our experiments demonstrate the effectiveness of this strategy, making the practical application of the gradient rectification effect feasible.

Subject: AAAI.2025 - Computer Vision

#15 Plug-and-Play Tri-Branch Invertible Block for Image Rescaling [PDF²] [Copy] [Kimi] [REL]

Authors: Jingwei Bao, Jinhua Hao, Pengcheng Xu, Ming Sun, Chao Zhou, Shuyuan Zhu

High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing dual-branch based vanilla invertible blocks, process high-frequency and low-frequency information separately, often relying on specific distributions to model high-frequency components. However, processing the low-frequency component directly in the RGB domain introduces channel redundancy, limiting the efficiency of image reconstruction. To address these challenges, we propose a plug-and-play tri-branch invertible block (T-InvBlocks) that decomposes the low- frequency branch into luminance (Y) and chrominance (CbCr) components, reducing redundancy and enhancing feature processing. Additionally, we adopt an all-zero mapping strategy for high-frequency components during upscaling, focusing essential rescaling information within the LR image. Our T-InvBlocks can be seamlessly integrated into existing rescaling models, improving performance in both general rescaling tasks and scenarios involving lossy compression. Extensive experiments confirm that our method advances the state of the art in HR image reconstruction.

Subject: AAAI.2025 - Computer Vision

#16 BEE: Metric-Adapted Explanations via Baseline Exploration-Exploitation [PDF²] [Copy] [Kimi¹] [REL]

Authors: Oren Barkan, Yehonatan Elisha, Jonathan Weill, Noam Koenigstein

Two prominent challenges in explainability research involve 1) the nuanced evaluation of explanations and 2) the modeling of missing information through baseline representations. The existing literature introduces diverse evaluation metrics, each scrutinizing the quality of explanations through distinct lenses. Additionally, various baseline representations have been proposed, each modeling the notion of missingness differently. Yet, a consensus on the ultimate evaluation metric and baseline representation remains elusive. This work acknowledges the diversity in explanation metrics and baselines, demonstrating that different metrics exhibit preferences for distinct explanation maps resulting from the utilization of different baseline representations and distributions. To address the diversity in metrics and accommodate the variety of baseline representations in a unified manner, we propose Baseline Exploration-Exploitation (BEE) - a path-integration method that introduces randomness to the integration process by modeling the baseline as a learned random tensor. This tensor follows a learned mixture of baseline distributions optimized through a contextual exploration-exploitation procedure to enhance performance on the specific metric of interest. By resampling the baseline from the learned distribution, BEE generates a comprehensive set of explanation maps, facilitating the selection of the best-performing explanation map in this broad set for the given metric. Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics.

Subject: AAAI.2025 - Computer Vision

#17 Dual Manifold Regularization Steered Robust Representation Learning for Point Cloud Analysis [PDF³] [Copy] [Kimi] [REL]

Authors: Jian Bi, Qianliang Wu, Jianjun Qian, Lei Luo, Jian Yang

With the rapid advancement of 3D scanning technology, point clouds have become a crucial data type in computer vision and machine learning. However, learning robust representations for point clouds remains a significant challenge due to their irregularity and sparsity. In this paper, we propose a novel Dual Manifold Regularization (DMR) framework that makes full use of the properties of positive and negative curvature in manifolds to improve the representation of point clouds. Specifically, we leverage DMR based on hyperbolic and hyperspherical manifolds to address the limitations of traditional single-manifold regularization techniques, including inadequate generalization ability and adaptability to data diversity, as well as the difficulty of capturing complex relationships between data. To begin, we utilize the tree-like structure of the hyperbolic manifold to model the part-whole hierarchical relationships within point clouds. This allows for a more comprehensive representation of the data, improving the model's capability to understand complex shapes. Additionally, we construct positive samples through topological consistency augmentation and employ contrastive learning techniques in the hyperspherical manifold to capture more discriminative features within the data. Our experimental results show that our method outperforms traditional supervised learning and single-manifold regularization techniques in point cloud analysis. Specifically, for shape classification, DMR achieves a new State-Of-The-Art (SOTA) performance with 94.8% Overall Accuracy (OA) on ModelNet40 and 90.7% OA on ScanObjectNN, surpassing the recent SOTA model without increasing the baseline parameters.

Subject: AAAI.2025 - Computer Vision

#18 Learning Fine-grained Domain Generalization via Hyperbolic State Space Hallucination [PDF³] [Copy] [Kimi²] [REL]

Authors: Qi Bi, Jingjun Yi, Haolan Zhan, Wei Ji, Gui-Song Xia

Fine-grained domain generalization (FGDG) aims to learn a fine-grained representation that can be well generalized to unseen target domains when only trained on the source domain data. Compared with generic domain generalization, FGDG is particularly challenging in that the fine-grained category can be only discerned by some subtle and tiny patterns. Such patterns are particularly fragile under the cross-domain style shifts caused by illumination, color and etc. To push this frontier, this paper presents a novel Hyperbolic State Space Hallucination (HSSH) method. It consists of two key components, namely, state space hallucination (SSH) and hyperbolic manifold consistency (HMC). SSH enriches the style diversity for the state embeddings by firstly extrapolating and then hallucinating the source images. Then, the pre- and post- style hallucinate state embeddings are projected into the hyperbolic manifold. The hyperbolic state space models the high-order statistics, and allows a better discernment of the fine-grained patterns. Finally, the hyperbolic distance is minimized, so that the impact of style variation on fine-grained patterns can be eliminated. Experiments on three FGDG benchmarks demonstrate its state-of-the-art performance.

Subject: AAAI.2025 - Computer Vision

#19 DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization [PDF⁵] [Copy] [Kimi] [REL]

Authors: Qi Bi, Jingjun Yi, Hao Zheng, Haolan Zhan, Wei Ji, Yawen Huang, Yuexiang Li

Domain generalization aims to learn a representation from the source domain, which can be generalized to arbitrary unseen target domains. A fundamental challenge for visual domain generalization is the domain gap caused by the dramatic style variation whereas the image content is stable. The realm of selective state space, exemplified by VMamba, demonstrates its global receptive field in representing the content. However, the way exploiting the domain-invariant property for selective state space is rarely explored. In this paper, we propose a novel Flow Factorized State Space model, dubbed as DGFamba, for visual domain generalization. To maintain domain consistency, we innovatively map the style-augmented and the original state embeddings by flow factorization. In this latent flow space, each state embedding from a certain style is specified by a latent probability path. By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences. Extensive experiments conducted on various visual domain generalization settings show its state-of-the-art performance.

Subject: AAAI.2025 - Computer Vision

#20 CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Xiuli Bi, Jian Lu, Bo Liu, Xiaodong Cun, Yong Zhang, Weisheng Li, Bin Xiao

Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

Subject: AAAI.2025 - Computer Vision

#21 MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [PDF¹] [Copy] [Kimi] [REL]

Authors: Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to process different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the text-to-motion semantic pre-training, followed by the multimodal low-level control adaptation. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Subject: AAAI.2025 - Computer Vision

#22 FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot Medical Image Segmentation [PDF⁶] [Copy] [Kimi¹] [REL]

Authors: Yuntian Bo, Yazhou Zhu, Lunbo Li, Haofeng Zhang

Existing few-shot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross-domain few-shot medical image segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain. Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency-aware Matching Network (FAMNet), which includes two key components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta-learning phase: 1) intra-domain variance caused by the inherent support-query bias, due to the different appearances of organs and lesions, and 2) inter-domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter-domain variance on the model's segmentation performance. Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation models on three cross-domain datasets, achieving state-of-the-art performance in the CD-FSMIS task.

Subject: AAAI.2025 - Computer Vision

#23 FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing [PDF³] [Copy] [Kimi] [REL]

Authors: Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang

Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

Subject: AAAI.2025 - Computer Vision

#24 Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval [PDF⁵] [Copy] [Kimi] [REL]

Authors: Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang

Existing cross-modal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions. Considering that the semantics and expression styles of the input caption largely influence how to encode it, we propose a semantic disentangling module to extract the semantic-related and semantic-agnostic features from the input, ensuring that generated adapters are well-suited to the characteristics of input caption. Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models.

Subject: AAAI.2025 - Computer Vision

#25 Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue [PDF³] [Copy] [Kimi] [REL]

Authors: Shuo Cai, Xinzhe Han, Shuhui Wang

Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.

Subject: AAAI.2025 - Computer Vision