| Total: 2388
Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental learning, as retaining a sufficient validation set would be impractical. Thus, we propose T-CIL, a novel temperature scaling approach for class-incremental learning without a validation set for old tasks, that leverages adversarially perturbed exemplars from memory. Directly using exemplars is inadequate for temperature optimization, since they are already used for training. The key idea of T-CIL is to perturb exemplars more strongly for old tasks than for the new task by adjusting the perturbation direction based on feature distance, with the single magnitude determined using the new-task validation set. This strategy makes the perturbation magnitude computed from the new task also applicable to old tasks, leveraging the tendency that the accuracy of old tasks is lower than that of the new task. We empirically show that T-CIL significantly outperforms various baselines in terms of calibration on real datasets and can be integrated with existing class-incremental learning techniques with minimal impact on accuracy.
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96\% percent and 84.89\% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37\% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03\% vs 32.77\%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
Unlike modern native digital videos, the restoration of old films requires addressing specific degradations inherent to analog sources. However, existing specialized methods still fall short compared to general video restoration techniques. In this work, we propose a new baseline to re-examine the challenges in old film restoration. First, we develop an improved Mamba-based framework, dubbed MambaOFR, which can dynamically adjust the degradation removal patterns by generating degradation-aware prompts to tackle the complex and composite degradations present in old films. Second, we introduce a flow-guided mask deformable alignment module to mitigate the propagation of structured defect features in the temporal domain. Third, we introduce the first benchmark dataset that includes both synthetic and real-world old film clips. Extensive experiments show that the proposed method achieves state-of-the-art performance, outperforming existing advanced approaches in old film restoration. The implementation and model will be released.
Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to watch the supplementary video.
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting.
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
Federated Learning (FL) is an emerging direction in distributed machine learning that enables jointly training a model without sharing the data. However, as the size of datasets grows exponentially, computational costs of FL increase. In this paper, we propose the first Coreset Selection criterion for Federated Learning (FedCS) by exploring the Distance Contrast (DC) in feature space. Our FedCS is inspired by the discovery that DC can indicate the intrinsic properties inherent to samples regardless of the networks. Based on the observation, we develop a method that is mathematically formulated to prune samples with high DC. The principle behind our pruning is that high DC samples either contain less information or represent rare extreme cases, thus removal of them can enhance the aggregation performance. Besides, we experimentally show that samples with low DC usually contain substantial information and reflect the common features of samples within their classes, such that they are suitable for constructing coreset. With only two time of linear-logarithmic complexity operation, FedCS leads to significant improvements over the methods using whole dataset in terms of computational costs, with similar accuracies. For example, on the CIFAR-10 dataset with Dirichlet coefficient \alpha=0.1, FedCS achieves 58.88% accuracy using only 44% of the entire dataset, whereas other methods require twice the data volume as FedCS for same performance.
In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.
While existing pre-training-based methods have enhanced point cloud model performance, they have not fundamentally resolved the challenge of local structure representation in point clouds. The limited representational capacity of pure point cloud models continues to constrain the potential of cross-modal fusion methods and performance across various tasks. To address this challenge, we propose a Dynamic Acoustic Field Fitting Network (DAF-Net), inspired by physical acoustic principles. Specifically, we represent local point clouds as acoustic fields and introduce a novel Acoustic Field Convolution (AF-Conv), which treats local aggregation as an acoustic energy field modeling problem and captures fine-grained local shape awareness by dividing the local area into near field and far field. Furthermore, drawing inspiration from multi-frequency wave phenomena and dynamic convolution, we develop the Dynamic Acoustic Field Convolution (DAF-Conv) based on AF-Conv. DAF-Conv dynamically generates multiple weights based on local geometric priors, effectively enhancing adaptability to diverse geometric features. Additionally, we design a Global Shape-Aware (GSA) layer incorporating EdgeConv and multi-head attention mechanisms, which combines with DAF-Conv to form the DAF Block. These blocks are then stacked to create a hierarchical DAFNet architecture. Extensive experiments on point cloud classification, part segmentation, and few-shot semantic segmentation demonstrate that DAFNet significantly outperforms existing methods across multiple tasks.
How many outliers are within an unlabeled and contaminated dataset? Despite a series of unsupervised outlier detection (UOD) approaches have been proposed, they cannot correctly answer this critical question, resulting in their performance instability across various real-world (varying contamination factor) scenarios. To address this problem, we propose FlexUOD, with a novel contamination factor estimation perspective. FlexUOD not only achieves its remarkable robustness but also is a general and plug-and-play framework, which can significantly improve the performance of existing UOD methods. Extensive experiments demonstrate that FlexUOD achieves state-of-the-art results as well as high efficacy on diverse evaluation benchmarks.
Recent studies have focused on introducing pre-trained foundation models into semi-supervised learning (SSL) tasks. Nevertheless, these foundation models can exhibit biases toward different classes and tend to generate imbalanced pseudo-labels for SSL. Thus, efforts have been made to introduce the logit adjustment offset to reduce the inherent bias in foundation models for SSL tasks.Despite their success, existing foundation model-based SSL methods face challenges: 1) unreliability in the estimated logit adjustment offset, 2) overlooking the potential of linguistic knowledge in capturing model biases and 3) fail to fully exploit the unlabeled samples. To address these issues, we propose Language-Assisted Debiasing and Smoothing framework, namely LADaS, for foundation model-based SSL. It consists of two components: 1) Language-assisted Pseudo-Label Debiasing (LPLD) to reduce biases in foundation models, and 2) Language-aware Pseudo-Label Smoothing (LPLS) to fully exploit low-confidence samples to facilitate SSL training. In particular, LPLD introduces a reliability score to dynamically assess the reliability of the logit adjustment. Additionally, it incorporates a language-oriented preference to reduce model biases using linguistic knowledge derived from pre-trained language models. Finally, LPLS introduces language-aware soft labels and devises language-aware pseudo-label smoothing loss to guide the learning of unlabeled samples with low-quality pseudo-labels. Extensive experiments demonstrate the superiority of our proposed LADaS. We will release oursource code to benefit other researchers.
Model merging combines ``expert'' models---each finetuned from a shared foundation model on diverse tasks and domains---into a single, more capable base model. However, existing model merging approaches assume all experts to be available simultaneously.In reality, new tasks and domains emerge continuously, prompting the need for a dynamic process of integrating these experts over time, which we call \textit{temporal model merging}. The temporal dimension introduces unique challenges not addressed in prior work:At each task, should expert training start from merged previous experts or the original base model? Should all models be merged at every time step? Which merging techniques are best suited for temporal merging? Should different strategies be used for the training initialization and deployment phases? To tackle these questions, we propose a unified framework called \textsc{TIME}---\underline{T}emporal \underline{I}ntegration of \underline{M}odel \underline{E}xpertise---that defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Utilizing \textsc{TIME}, we study temporal model merging across model sizes, tasks, and compute budgets on the large-scale FoMo-in-Flux benchmark for continual multimodal pretraining. Systematic experiments across \textsc{TIME} and FoMo-in-Flux allow us to arrive at several crucial key insights for temporal model merging to better understand current limits and best practices for successful model merging across time.
Space-time video resampling aims to conduct both spatial-temporal downsampling and upsampling processes to achieve high-quality video reconstruction.Although there has been much progress, some major challenges still exist, such as how to preserve motion information during temporal resampling while avoiding blurring artifacts, and how to achieve flexible temporal and spatial resampling factors. In this paper, we introduce an Invertible Motion Steganography Module (IMSM), designed to preserve motion information in high-frame-rate videos. This module embeds motion information from high-frame-rate videos into downsampled frames with lower frame rates in a visually imperceptible manner. Its reversible nature allows the motion information to be recovered, facilitating the reconstruction of high-frame-rate videos. Furthermore, we propose a 3D implicit feature modulation technique that enables continuous spatiotemporal resampling. With tailored training strategies, our method supports flexible frame rate conversions, including non-integer changes like 30 FPS to 24 FPS and vice versa. Extensive experiments show that our method significantly outperforms existing solutions across multiple datasets in various video resampling tasks with high flexibility. Codes will be made available upon publication.
Deep learning–based models for All-In-One image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model, capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on GenDS dataset exhibit significant improvements in out-of-distribution performance as compared to when trained solely on existing datasets. Furthermore, we provide comprehensive analyses on implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop an autoregressive video generation model, Visioner, trained exclusively on raw video data, and test its knowledge acquisition abilities in video-based Go and robotic control environments. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning extensive knowledge, and (2) the compactness of visual representations significantly enhances learning efficiency. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM). Remarkably, Visioner reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research.
The capability of tracking objects in low-light environments like nighttime is crucial for numerous real-world applications such as crowd behavior analysis and traffic scene understanding. However, previous Multi-Camera Multi-Target(MCMT) tracking methods are primarily focused on tracking during daytime with favorable lighting, shying away from low-light environments. The main difficulty of tracking under low light condition is lack of detailed visible appearance features. To address this issue, we incorporate the infrared modality into MCMT tracking framework to provide more useful information. We constructed the first Multi-modality(RGBT) Multi-camera Multi-target tracking dataset named M3Track, which contains sequences captured in low-light environments, laying a solid foundation for all-day multi-camera tracking. Based on the proposed dataset, we propose All-Day Multi-Camera Multi-Target tracking network, termed as ADMCMT. Specifically, we propose an All-Day Mamba Fusion(ADMF) model to adaptively fuse information from different modalities. Within ADMF, the Lighting Guidance Model(IGM) extracts lighting relevant information to guide the fusion process. Furthermore, the Nearby Target Collection(NTC) strategy is designed to enhance tracking accuracy by leveraging information derived from surrounding objects of target. Experiments conducted on M3Track demonstrate that ADMCMT exhibits strong generalization across different lighting conditions. The code will be released soon.
Autofocus is a crucial component of modern digital cameras. While recent learning-based methods achieve state-of-the-art in-focus prediction accuracy, they unfortunately ignore the potential focus hunting phenomenon of back-and-forth lens movement in the multi-step focusing procedure. To address this, in this paper, we propose an expert regularized deep reinforcement learning (DRL)-based approach for autofocus, which is able to utilize the sequential information of lens movement trajectory to both enhance the multi-step in-focus prediction accuracy and reduce the chance of focus hunting. Our method generally follows an actor-critic framework. To accelerate the DRL's training with a higher sample efficiency, we initialize the policy with a pre-trained single-step prediction network, where the network is further improved by modifying the output of absolute in-focus position distribution to the relative lens movement distribution to establish a better mapping between input images and lens movement. To further stabilize DRL's training with lower focus hunting occurrence in the resulting lens movement trajectory, we generate some offline trajectories based on the prior knowledge to avoid focus hunting, which are then leveraged as an offline dataset of expert trajectories to regularize actor network's training. Empirical evaluations show that our method outperforms those learning-based methods on public benchmarks, with higher single- and multi-step prediction accuracies, and a significant reduction of focus hunting rate.
Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.
Diffusion transformers have shown exceptional performance in visual generation but are accompanied by high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, which achieves 1.55 times acceleration with negligible impact on image quality.
Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, driven by the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (referred to as prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. Additionally, we extend these evaluations to text-to-image retrieval by collecting manual annotations that represent retrieval performance. We thus establish the first joint benchmark for prompt and query performance prediction (PQPP) across both tasks, comprising over 10K queries. Our benchmark enables (i) the comparative assessment of prompt/query difficulty in both image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We evaluate several pre- and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code are publicly available at https://anonymous.4open.science/r/PQPP-D332.
Due to their multimodal capabilities, Vision-Language Models (VLMs) have found numerous impactful applications in real-world scenarios. However, recent studies have revealed that VLMs are vulnerable to image-based adversarial attacks, particularly targeted adversarial images that manipulate the model to generate harmful content specified by the adversary. Current attack methods rely on predefined target labels to create targeted adversarial attacks, which limits their scalability and applicability for large-scale robustness evaluations. In this paper, we propose **AnyAttack**, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision, allowing **any** image to serve as a target for the **attack**.Our framework employs the "pre-training and fine-tuning" paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset.This large-scale pre-training endows our method with powerful transferability across a wide range of VLMs.Extensive experiments on five mainstream open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks (image-text retrieval, multimodal classification, and image captioning) demonstrate the effectiveness of our attack.Additionally, we successfully transfer AnyAttack to multiple commercial VLMs, including Google Gemini, Claude Sonnet, Microsoft Copilot and OpenAI GPT.These results reveal an unprecedented risk to VLMs, highlighting the need for effective countermeasures.
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often resulting in misalignment between motion semantics and language instructions. In this paper, we introduce MotionDiT, an advanced Diffusion-Transformer-based motion editing model that effectively incorporates editing features both as layer-wise control signals and as input prefixes. To enhance the model's semantic understanding, we also propose a novel auxiliary task, motion similarity prediction, which fosters the learning of semantically meaningful representations. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both editing alignment and fidelity.
Despite advancements in Computer-Aided-Design (CAD) generation, direct generation of complex Boundary Representation (B-rep) CAD models remains challenging. This difficulty arises from the parametric nature of B-rep data, complicating the encoding and generation of its geometric and topological information. To address this, we introduce BrepGiff, a lightweight generation approach for high-quality and complex B-rep model based on 3D Graph Diffusion. First, we transfer B-rep models into 3D graphs representation. Specifically, BrepGiff extracts and integrates topological and geometric features to construct a 3D graph where nodes correspond to face centroids in 3D space, preserving adjacency and degree information. Geometric features are derived by sampling points in the UV domain and extracting face and edge features. Then, BrepGiff applies a Graph Attention Network (GAT) to enforce topological constraints from local to global during the degree-guided diffusion process. With the 3D graph representation and efficient diffusion process, our method significantly reduces the computational cost and improves the quality, thus achieving lightweight generation of complex models. Experiments show that BrepGiff can generate complex B-rep models (>100 faces) using only 2 RTX4090 GPUs, achieving state-of-the-art performance in B-rep generation.