CVPR.2026 - Highlight

| Total: 575

#1 Kaleidoscopic Scintillation Event Imaging [PDF] [Copy] [Kimi] [REL]

Authors: Alex Bocchieri, John Mamish, David Appleyard, Andreas Velten

Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources. Most existing methods use fast single-pixel detectors to detect and time scintillation events. Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle. Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events. This allows us to use machine vision techniques to analyze events, enabling new types of detectors. The main challenge is the very low brightness of the events. Techniques have to work with a very limited number of photons. We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information. The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera. We introduce theory for imaging an event in a kaleidoscopic scintillator and an algorithm to estimate the event's 3D position. We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera.

Subject: CVPR.2026 - Highlight

#2 Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting [PDF] [Copy] [Kimi] [REL]

Authors: Jinhyeok Jang, Jaehong Kim, Jung Uk Kim

Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our KNowledge-Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Naive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer. The code and pre-trained model are available at https://github.com/jjh6297/KNOW

Subject: CVPR.2026 - Highlight

#3 Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning [PDF] [Copy] [Kimi] [REL]

Authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossHA, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossHA achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA.

Subject: CVPR.2026 - Highlight

#4 GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models [PDF] [Copy] [Kimi] [REL]

Authors: Korada Sri Vardhana, Soma Biswas

Text-to-Image (T2I) diffusion models power modern creative tools, but their open-ended generative nature raises safety, ethical, and copyright concerns. Retraining or fine-tuning to remove every unsafe or copyrighted concept is impractical, motivating training-free interventions that suppress specific semantics while preserving general visual quality. Existing guard-railing methods face a core trade-off: they are either rigid, failing to generalize to paraphrased or context-shifted prompts, or coarse, distorting unrelated content and fidelity. We present GenErase (GENeralizable ERAsure with SEmantic Awareness), a training-free, geometry-grounded framework for robust concept removal in diffusion models. GenErase enforces semantic orthogonality in the cross-attention value space via an explicit erase-and-replace operation, guided by a per-token preserve projector and a hard geometric gate. This design enables precise erasure, explicit protection of critical semantics, and stability across layers, paraphrases, and multi-concept cases. Extensive experiments on identity, object, and style erasure, together with a new GenBench-40 benchmark, show that GenErase achieves state-of-the-art erasure fidelity and superior paraphrase-level generalization, establishing it as a practical and principled guard-rail for safe, real-time diffusion deployment.

Subject: CVPR.2026 - Highlight

#5 DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Mengyang Li, Pinlong Zhao

The efficiency of hyperparameter optimization (HPO) is critical for deep learning, yet state-of-the-art methods share a fundamental flaw: they are difficulty-agnostic, treating all hyperparameter configurations homogeneously. This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. We first model optimization difficulty hierarchically. Then, departing from hand-crafted priors, we train a conditional diffusion model on 120,000 real learning curves, generating synthetic data with 2.3xhigher fidelity. This data trains our difficulty-aware surrogate model and acquisition function to dynamically adapt the search strategy. Across 75 tasks, DABO reduces regret by 11-18% compared to the leading difficulty-agnostic method, ifBO. Our work establishes a new paradigm for HPO, shifting the focus from configuration-centric to difficulty-aware resource allocation to enable more robust and efficient optimization.

Subject: CVPR.2026 - Highlight

#6 TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures [PDF] [Copy] [Kimi] [REL]

Authors: Hyeongjin Nam, Daniel Sungho Jung, Kyoung Mu Lee

Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

Subject: CVPR.2026 - Highlight

#7 EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images [PDF] [Copy] [Kimi] [REL]

Authors: Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez, Jaeho Moon, Jihyong Oh, Munchurl Kim

Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks. Code and project page will be released.

Subject: CVPR.2026 - Highlight

#8 AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation [PDF] [Copy] [Kimi¹] [REL]

Authors: Xiya Shen, Qinglin Zhao, Li Feng

Prototype or region-attention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. AD-GBC aggregates local features into region-level semantics and redistributes the refined representation back to pixels in a fully differentiable manner, enabling geometry-aware refinement within modern UNet-style architectures. Two geometric regularizers, a Wasserstein-based diversity loss and a scale consistency loss, mitigate center collapse and encourage stable, well-formed region geometry.AD-GBC yields consistent improvements across four widely used medical segmentation benchmarks (BUSI, GlaS, CVC-ClinicDB, ISIC17) when integrated into two strong backbones (Rolling-UNet and U-KAN), demonstrating that the proposed geometric region formulation generalizes well across different imaging conditions. The code is available at https://github.com/SiaShen-dot/AD-GBC.

Subject: CVPR.2026 - Highlight

#9 Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers [PDF] [Copy] [Kimi] [REL]

Authors: Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu

Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments.

Subject: CVPR.2026 - Highlight

#10 Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning [PDF] [Copy] [Kimi] [REL]

Authors: Menghao Zhang, Yiyan Zhu, Pengfei Ren, Haifeng Sun, Qi Qi, Zirui Zhuang, Huazheng Wang, Lei Zhang, Jianxin Liao, Jingyu Wang

In this paper, we explore video anomaly detection (VAD) from a fine-grained perspective, which aims not only to detect anomalous events but also to identify their specific categories. Due to the limited number of examples per category, existing methods either fail to handle intra-class variation across diverse contexts or struggle with inter-class confusion caused by shared visual primitives. To address these challenges, we propose a progressive cross-granularity learning paradigm that leverages coarse- and fine-grained labels in a complementary manner to progressively refine representations from generic anomaly patterns to category-specific semantics. Building on this paradigm, we develop Fine-VAD, a progressive alignment framework that aligns video features with supervision signals at multiple granularities. Extensive experiments on two benchmark datasets demonstrate that Fine-VAD achieves up to 47.7% relative improvement in fine-grained anomaly classification. Notably, our paradigm generalizes well across diverse model architectures, offering an adaptable and effective solution for real-world fine-grained VAD.

Subject: CVPR.2026 - Highlight

#11 CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention [PDF] [Copy] [Kimi] [REL]

Authors: Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, Jian Pu

Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

Subject: CVPR.2026 - Highlight

#12 Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness [PDF¹] [Copy] [Kimi] [REL]

Authors: Ziliang Chen, Yulu Li, Liangda Fang, Jusheng Zhang, Yongsen Zheng, Quanlong Guan, Xipeng Chen

Open-vocabulary learning on CLIP provides remarkable generalization on diverse concepts, however, falters under the realistic streaming open-world evaluations for Stability against distractor classes and Extensibility to novel classes. Current fine-tuning methods often fail these tests since they are mainly designed for closed-set conditions, leading to the performance gaps while the target vocabulary progressively scales. We formalize a "vocabulary scaling law" showing that these openness measures can be lower-bounded by performance on the full class-name universe, implying that robust fine-tuning should: (i) account for the entire vocabulary, (ii) tune class-name embeddings rather than context, and (iii) enforce orthogonality between prompt embeddings including training and open-set class names. Guided by our analysis, we propose Submodular-Vocabulary Fine-tuning (SVFT), a bi-level optimization framework that approximates the intractable objective of tuning all class name embedding by greedily selecting a small, informative subset of class names via constrained submodular maximization, thus, allows the employment of efficient greedy algorithm for the near-optimal class-name subset selection to fine-tune CLIP instead of using all open classes. Across extensive experiments, SVFT consistently improves both stability and extensibility, advancing the openness and practical robustness of CLIP-based vision-language models.

Subject: CVPR.2026 - Highlight

#13 A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens [PDF] [Copy] [Kimi] [REL]

Authors: Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce deterministic predictions that implicitly average over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaWorld, a generative VFM-based world model that efficiently generates diverse plausible futures. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes the feature difference between consecutive frames into a single continuous "delta" token, reducing video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence. For example, this yields a 1,024x token reduction with 512x512 frames. Delta tokens enable efficient and effective multi-hypothesis training, where many diverse futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Project page: https://deltatok.github.io.

Subject: CVPR.2026 - Highlight

#14 High-Quality and Efficient Turbulence Mitigation with Events [PDF] [Copy] [Kimi] [REL]

Authors: Xiaoran Zhang, Jian Ding, Yuxing Duan, Haoyue Liu, Gang Chen, Yi Chang, Luxin Yan

Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent "event tubes" in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Extensive experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively.

Subject: CVPR.2026 - Highlight

#15 Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Yuanzhe Li, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Sheng Lu

Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.

Subject: CVPR.2026 - Highlight

#16 Efficient and Training-Free Single-Image Diffusion Models [PDF¹] [Copy] [Kimi] [REL]

Authors: Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell

We consider the problem of generating images whose internal structure - defined by the distribution of patches across multiple scales---matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

Subject: CVPR.2026 - Highlight

#17 OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement [PDF] [Copy] [Kimi] [REL]

Authors: Rui Wang, Huisi Wu, Jing Qin

Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

Subject: CVPR.2026 - Highlight

#18 Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression [PDF¹] [Copy] [Kimi] [REL]

Authors: Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

As efficient alternatives to softmax Attention, linear state space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA is inspired by the Kalman Filter and solves ridge regression problems online at test time, with constant memory and linear time in the sequence length. An insight is that standard Kalman filter equations are numerically unstable in low-precision environments (e.g., bfloat16) and difficult to parallelize on modern GPUs. We address both challenges via two innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the problem, ensuring numerical stability and balancing memory retention; (2) the use of Chebyshev Iteration instead of conventional iterative solvers, which we show to be more stable in low-precision settings. To improve scalability, we implement Chebyshev Iteration in a hardware-aware, chunk-wise manner, along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. On short-context tasks, GKA shows strong language understanding capabilities and outperforms existing SSMs (e.g., Mamba2, and Gated DeltaNet). On long-context tasks, GKA excels at real-world RAG and LongQA tasks up to 128k tokens with more than 10% relative improvement over baselines. Finally, we show GKA outperforms Mamba when extended for ImageNet classification.

Subject: CVPR.2026 - Highlight

#19 Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization [PDF¹] [Copy] [Kimi] [REL]

Authors: Shengju Yu, Suyuan Liu, Wenhao Shao, Siwei Wang, Ke Liang, Xihong Yang, Tiejun Li, Xinwang Liu

Prevailing incomplete multi-view clustering (IMVC) approaches typically fail to account for the interference of view-exclusive artifacts when learning view-consensus representations, which could compromise the fidelity of the resulting similarity measure. Moreover, inconsistencies in anchor order across views may distort the graph structure, impairing the clustering performance. The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model's practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. It explicitly models the exclusive-to-consensus interplay, derives a view-private graph from each view, and adaptively integrates them into a global consensus affinity according to the respective view's intrinsic characteristics. Furthermore, a permutation transformation with unary encoding constraints is applied to anchor matrix, realigning anchor topology while preserving the values. This process synchronizes anchor order prior to similarity integration and maintains original anchor properties. Notably, all components are coupled seamlessly and optimized in a joint manner. Also, the provable overall linear complexity further enlarges its scalability and practicality. Experimental results confirm that PJFTH receives competitive performance compared to several leading methods.

Subject: CVPR.2026 - Highlight

#20 Anchoring and Rescaling Attention for Semantically Coherent Inbetweening [PDF] [Copy] [Kimi] [REL]

Authors: Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

Subject: CVPR.2026 - Highlight

#21 Frequency-domain Manipulation for Face Obfuscation [PDF] [Copy] [Kimi] [REL]

Authors: Jintae Kim, Keunsoo Ko, Chang-Su Kim

Facial image datasets have become essential resources for various face analysis tasks, but their use raises significant privacy concerns. To address this issue, face obfuscation has emerged as a practical approach to hide identity from humans while retaining cues decipherable by machines. However, existing methods often leave exploitable visual traces, making them vulnerable to reconstruction attacks that restore hidden identity. To address this issue, we propose a frequency-domain manipulation framework, called FreM, which adjusts frequency subbands differently to hide identity, retain machine-decipherable cues, and improve robustness against reconstruction attacks. Specifically, the proposed FreM first decomposes a facial image into frequency subbands and applies subband-adaptive modulation that regulates information according to the characteristics of each subband. The modulation parameters are then refined to yield the reliable obfuscated result. Extensive experiments across multiple face analysis benchmarks demonstrate that FreM achieves superior obfuscation quality and strong robustness against reconstruction attacks.

Subject: CVPR.2026 - Highlight

#22 ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions [PDF] [Copy] [Kimi] [REL]

Authors: Zikai Wang, Zhilu Zhang, Yiqing Wang, Hui Li, Wangmeng Zuo

Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

Subject: CVPR.2026 - Highlight

#23 Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition [PDF] [Copy] [Kimi] [REL]

Authors: Jakob Paul Zimmermann, Georg Loho

It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network.We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods -- SplitCAM and SplitLRP --improve onstate of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories.Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.

Subject: CVPR.2026 - Highlight

#24 GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang

Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing competitors.

Subject: CVPR.2026 - Highlight

#25 End-to-End Language-Action Model for Humanoid Whole Body Control [PDF] [Copy] [Kimi] [REL]

Authors: Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action framework for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

Subject: CVPR.2026 - Highlight