AAAI.2026 - Philosophy and Ethics of AI | Cool Papers

#1 Privacy Preserving In-Context-Learning Framework for Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy, Adam D. Cobb, Rohit Chadha, Susmit Jha

Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#2 Your Prompts Are Not Safe: Output-Free Membership Inference via Prompt Vectors in Vision-Language Tuning [PDF] [Copy] [Kimi] [REL]

Authors: Yuran Bian, Xiaohan Zhang, Zhiyuan Yu, Changqing Li, Li Pan

Prompt tuning enables Vision-Language Models (VLMs) to efficiently adapt to new tasks through learnable prompt vectors. This naturally raises a question: do these prompts leak private information about their training data? While Membership Inference Attacks (MIAs) can quantify this risk, current methods rely on access to model outputs or internal gradients. This limitation prevents a clear assessment of a prompt’s standalone privacy leakage, particularly in deployment scenarios where such information is inaccessible. In this paper, we propose Prompt Intrinsic Privacy Risk Analyzer (PIPRA) to address this gap. As the first output-free MIA, PIPRA leverages open-source pre-trained VLMs to extract features from both prompts and samples within a shared cross-modal semantic space. By employing a contrastive learning-based feature projector to enhance these representations, PIPRA enables a subsequent discriminator to effectively perform membership inference. Extensive experiments across nine benchmark datasets and multiple VLMs show PIPRA achieves an average AUC of 87.58%, significantly outperforming traditional output-dependent methods (77.05%). These findings reveal that prompts pose a substantially greater privacy risk than previously recognized, highlighting the urgent need for prompt-level privacy protection.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#3 When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Wei Cai, Shujuan Liu, Jian Zhao, Ziyan Shi, Yusheng Zhao, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning.To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge.A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM's internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#4 Failures to Surface Harmful Contents in Video Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Yuxin Cao, Wei Song, Derui Wang, Jingling Xue, Jin Song Dong

Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90\% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#5 MOBA: A Material-Oriented Backdoor Attack Against LiDAR-Based 3D Object Detection Systems [PDF] [Copy] [Kimi] [REL]

Authors: Saket Sanjeev Chaturvedi, Gaurav Bagwe, Lan Emily Zhang, Pan He, Xiaoyong Yuan

LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy detectability. This paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital–physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger's behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO₂) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren–Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#6 StyleSentinel: Reliable Artistic Copyright Verification via Stylistic Fingerprints [PDF] [Copy] [Kimi] [REL]

Authors: Lingxiao Chen, Liqin Wang, Wei Lu

The versatility of diffusion models in generating customized images has led to unauthorized usage of personal artwork, which poses a significant threat to the intellectual property of artists. Existing approaches relying on embedding additional information, such as perturbations, watermarks, and backdoors, suffer from limited defensive capabilities and fail to protect artwork published online. In this paper, we propose StyleSentinel, an approach for copyright protection of artwork by verifying an inherent stylistic fingerprint in the artist's artwork. Specifically, we employ a semantic self-reconstruction process to enhance stylistic expressiveness within the artwork, which establishes a dense and style-consistent manifold foundation for feature learning. Subsequently, we adaptively fuse multi-layer image features to encode abstract artistic style into a compact stylistic fingerprint. Finally, we model the target artist's style as a minimal enclosing hypersphere boundary in the feature space, transforming complex copyright verification into a robust one-class learning task. Extensive experiments demonstrate that compared with the state-of-the-art, StyleSentinel achieves superior performance on the one-sample verification task. We also demonstrate the effectiveness through online platforms.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#7 Amplifying Discrepancies: Exploiting Macro and Micro Inconsistencies for Image Manipulation Localization [PDF] [Copy] [Kimi] [REL]

Authors: Shenghao Chen, Yibo Zhao, Tianyi Wang, Chunjie Ma, Weili Guan, Ming Li, Zan Gao

The rapid development of image manipulation technologies poses significant challenges to multimedia forensics, especially in accurate localization of manipulated regions. Existing methods often fail to fully explore the intrinsic discrepancies between manipulated and authentic regions, resulting in sub-optimal performance. To address this limitation, we propose the Focus Region Discrepancy Network (FRD-Net), a novel and efficient framework that significantly enhances manipulation localization by amplifying discrepancies at both macro- and micro-levels. Specifically, our proposed Iterative Clustering Module (ICM) groups features into two discriminative clusters and refines representations via backward propagation from cluster centers, improving the distinction between tampered and authentic regions at the macro level. Thereafter, our Differential Progressive Module (DPM) is constructed to capture fine-grained structural inconsistencies within local neighborhoods and integrate them into a Central Difference Convolution, increasing sensitivity to subtle manipulation details at the micro level. Finally, these complementary modules are seamlessly integrated into a compact architecture that achieves a favorable balance between accuracy and efficiency. Extensive experiments on multiple benchmarks demonstrate that FRD-Net consistently surpasses state-of-the-art methods in terms of manipulation localization performance while maintaining a lower computational cost.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#8 ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs [PDF] [Copy] [Kimi] [REL]

Authors: Xunlei Chen, Jinyu Guo, Yuang Li, Zhaokun Wang, Yi Gong, Jie Zou, Jiwei Wei, Wenhong Tian

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#9 Reference Recommendation Based Membership Inference Attack Against Hybrid-Based Recommender Systems [PDF] [Copy] [Kimi] [REL]

Authors: Xiaoxiao Chi, Xuyun Zhang, Yan Wang, Hongsheng Hu, Wanchun Dou

Recommender systems have been widely deployed across various domains such as e-commerce and social media, and intelligently suggest items like products and potential friends to users based on their preferences and interaction history, which are often privacy-sensitive. Recent studies have revealed that recommender systems are prone to membership inference attacks (MIAs), where an attacker aims to infer whether or not a user’s data has been used for training a target recommender system. However, existing MIAs fail to exploit the unique characteristic of recommender systems, and therefore are only applicable to mixed recommender systems consisting of two recommendation algorithms. This leaves a gap in investigating MIAs against hybrid-based recommender systems where the same algorithm utilizing user-item historical interactions and attributes of users and items serves and produces personalised recommendations. To investigate how the personalisation in hybrid-based recommender systems influences MIA, we propose a novel metric-based MIA. Specifically, we leverage the characteristic of personalisation to obtain reference recommendation for any target users. Then, a relative membership metric is proposed to exploit a target user’s historical interactions, target recommendation, and reference recommendation to infer the membership of the target user’s data. Finally, we theoretically and empirically demonstrate the efficacy of the proposed metric-based MIA on hybrid-based recommender systems.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#10 Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations [PDF] [Copy] [Kimi] [REL]

Author: Omar Claflin

Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpret-able features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces com-pressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2×2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. This work provides systematic evidence for (1) dual encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#11 Fairness Perceptions of Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Benjamin Cookson, Soroush Ebadian, Nisarg Shah

Large language models (LLMs) are increasingly used for decision-making tasks where fairness is an essential desideratum. But what does fairness even mean to an LLM? To investigate this, we conduct a comprehensive evaluation of how LLMs perceive fairness in the context of resource allocation, using both synthetic and real-world data. We find that several state-of-the-art LLMs, when instructed to be fair, tend to prioritize improving collective welfare rather than distributing benefits equally. Their perception of fairness is somewhat sensitive to how user preferences are represented, but less so to the real-world context of the decision-making task. Finally, we show that the best strategy for aligning an LLM's perception of fairness to a specific criterion is to provide it as a mathematical objective, without referencing "fairness", as this prevents the LLM from mixing the criterion with its own prior notions of fairness. Our results provide practical insights for understanding and shaping how LLMs interpret fairness in resource allocation problems.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#12 Time Shuffle: A Transferability-Booster for Multiple Audio Adversarial Tasks [PDF] [Copy] [Kimi] [REL]

Authors: JiaCheng Deng, Dengpan Ye, Yuhong Liu, Zhaolin Wei, Ziyi Liu, Haoran Duan

Existing audio adversarial attack methods suffer from poor transferability, primarily due to insufficient exploration of model decision mechanisms and overreliance on heuristic-driven algorithm design. This paper aims to alleviate this gap. Specifically, through observations across three mainstream audio tasks (Automatic Speech Recognition, Speaker Verification, and Keyword Spotting), we reveal that these models primarily rely on local temporal features—inputs with time shuffled retain 83.7% of original accuracy. The SHAP-based visualization further validated that time shuffle leads to a significant shift in the salient regions of the model, but the samples can still be correctly identified, indicating the presence of redundant features that can affect decision-making. Inspired by these findings, we propose Time-Shuffle (TS) adversarial attack (including segments-based TS and phoneme-level-based TS-p). This method divides audio or phonemes into segments, randomly shuffles them, and computes gradients on the shuffled structure. By forcing perturbations to exploit transferable local temporal features and reduce overfitting to source-specific patterns, TS/TS-p inherently enhances transferability. As a model-agnostic framework, TS/TS-p can seamlessly integrate with existing attack methods. Comprehensive experiments demonstrate that TS-p achieved SOTA and boosts transferability by about 23%/14.7%/6.3% on ASR/ASV/KWS.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#13 Stability-Aware Reinforcement Learning for Robust Class Integration Test Order Generation [PDF] [Copy] [Kimi] [REL]

Authors: Yanru Ding, Yanmei Zhang, Guan Yuan, Shujuan Jiang, Wei Dai, Luciano Baresi

Generating a class integration test order (CITO) is essential to reduce the overhead of test stub construction (the primary cost in integration testing) and to ensure system reliability in complex software systems. Although reinforcement learning (RL) has shown promise in automating CITO generation, existing methods suffer from unstable policy learning and limited robustness against structural perturbations and defect injection. These challenges stem from insufficient reward shaping and the lack of reliable oracles for validation. To address these limitations, we propose LM-CITO, a stability-aware RL framework that integrates Lyapunov-guided reward shaping with semantic validation through metamorphic testing (MT). Specifically, we design a Lyapunov energy function over class dependency graphs to promote monotonic structural convergence during training, and define metamorphic relations (MRs) to verify behavioral consistency under controlled perturbations. Extensive experiments on six real-world systems demonstrate that LM-CITO consistently produces more effective policies, yielding CITOs with significantly reduced stubbing costs compared to baseline models. Furthermore, MT verifies the capability of our MRs to detect defects in 19 injected bug variants, confirming the robustness of LM-CITO under various fault-induced perturbations. These results highlight the synergy of stability guidance and MR-based validation, offering an effective, principled solution for oracle-free RL in software testing.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#14 One for All: Synthesis-Free Fingerprint Learning for Attribution of In-the-Wild Synthetic Images [PDF] [Copy] [Kimi] [REL]

Authors: Jianwei Fei, Yunshu Dai, Peipeng Yu, Zhihua Xia, Dasara Shullani, Daniele Baracchi, Alessandro Piva

Attributing synthetic images to their source generative models is critical for digital forensics and security. While most existing attribution methods can distinguish images produced by known models and reject those from unknown ones, they are unable to verify whether a given image was produced by a specific, previously unseen model. To address this limitation, we formulate an open-set verification problem: determining whether a given image was generated by a specific model. Our key insight is that synthetic images from different models show consistent, content-independent fingerprints in their amplitude spectrum. Based on this insight, we design a dynamic fingerprint simulator capable of simulating over 1.6 trillion generative model architectures. We further train an extractor to capture model-specific fingerprint representations with supervised contrastive learning, enabling accurate attribution of synthetic images, even from previously unseen models. Our method does not rely on any synthetic images, instead, it is trained solely on real images. On DMDetection and AIGCBenchmark, which comprises dozens of state-of-the-art and in-the-wild generative models, our method improves the attribution performance (AUC) of the prior method from random level to 94.05% and 83.05%, respectively. On GenImage and OSMA datasets, we obtain 85.08%, and 88.48% OSCR, outperforming the SOTA methods by 4.30% and 9.37% under the same settings.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#15 Efficient, Secure, Differentially Private Deep Learning in the Two-Server Model [PDF] [Copy] [Kimi] [REL]

Authors: Jun Feng, Hong Sun, Pengfei Zhang, Bocheng Ren, Shunli Zhang

Existing solutions on differentially private deep learning (DPDL) either require the assumption of a trusted data server (centralized DPDL) or suffer from poor utility (local DPDL); and hence their adoptions are hampered in real-world scenarios.We present CRYPTDP, a crypto-assisted differentially private deep learning approach in the two-server model. CRYPTDP employs two non-colluding servers to collaboratively and efficiently train differentially private deep learning over the secret shares of data owners' private data while protecting the confidentiality of the data from untrusted servers. CRYPTDP is the first approach with the best of both local DPDL and centralized DPDL models, which does not resort to trusted server like local DPDL and has the utility like centralized DPDL. In particular, we also make innovations for addressing the major challenges like poor performance and security that beset CRYPTDP: We introduce a new secure computation and differential privacy friendly activation function; we propose a novel garbled-circuits-free most significant bit extraction protocol, and using the protocol we propose an efficient and secure garbled-circuits-free protocol for activation function over secret shares. Exhaustive experiments show that CRYPTDP delivers significantly better performance than the state-of-the-art local DPDL, yields higher accuracy than the state-of-the-art centralized DPDL, and can achieve two orders of magnitude faster runtime than the state-of-the-art approach.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#16 Runtime Safety and Reach-avoid Prediction of Stochastic Systems via Observation-aware Barrier Functions [PDF] [Copy] [Kimi] [REL]

Authors: Shenghua Feng, Jie An, Fanjiang Xu

Stochastic dynamical systems have emerged as fundamental models across numerous application domains, providing powerful mathematical representations for capturing uncertain system behavior. In this paper, we address the problem of runtime safety and reach-avoid probability prediction for discrete-time stochastic systems with online observations, i.e., estimating the probability that the system satisfies a given safety or reach-avoid specification. Unlike traditional approaches that rely solely on offline models, we propose a framework that incorporates real-time observations to dynamically refine probability estimates for safety and reach-avoid events. By introducing observation-aware barrier functions, our method adaptively updates probability bounds as new observations are collected, combining efficient offline computation with online backward iteration. This approach enables rigorous and responsive prediction of safety and reach-avoid probabilities under uncertainty. In addition to the theoretical guarantees, experimental results on benchmark systems demonstrate the practical effectiveness of the proposed method.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#17 RSA-CR: Resisting Shilling Attacks in Citation Recommendation via Dumbbell Inductive Learning [PDF] [Copy] [Kimi] [REL]

Authors: Xiyue Gao, Yukai Liu, Zhuoqi Ma, Xiaotian Qiao, Hui Li, Cai Xu, Kunhua Zhang, Jiangtao Cui

Citation recommendation aims to provide researchers with the most relevant references for their manuscripts, helping them swiftly discover pertinent studies and bolster the reliability of their arguments. However, some individuals manipulate these recommendation systems by injecting false information, such as deliberately inflating the citation count of their own papers, to obtain favorable recommendations and ratings. This form of attack, commonly termed “shilling attack”, is not only highly concealed but also has an unimaginable impact on all scientific research. To address this problem, we theoretically reveal the impact of shilling attacks on citation recommendation and propose three feasible resistance strategies: historical collaborations, significant citations and content constraints. Based on these insights, we introduce RSA-CR, a robust and hybrid citation recommendation algorithm resistant to shilling attacks. The algorithm constructs a two-layer academic graph and uses random and content generation strategies to initialize author and paper embeddings. Confidence-guided inductive aggregations based on collaboration and citation relationships are then performed at the author and paper sides, where author aggregation results directly influences the paper aggregation strength. Finally, recommendations are made by measuring the distances between the fused paper embeddings. The entire learning process resembles a dumbbell, hence termed “dumbbell inductive learning”. Experiments on four academic datasets demonstrate that our method outperforms baselines in both effectiveness and robustness.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#18 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu, Fu Zhang, Xinlei He

Recent advances in deep learning have enabled highly accurate six-degree-of-freedom (6DoF) object pose estimation, leading to its widespread use in real-world applications such as robotics, augmented reality, virtual reality, and autonomous systems. However, backdoor attacks pose a major security risk to deep learning models. By injecting malicious triggers into training data, an attacker can cause a model to perform normally on benign inputs but behave incorrectly under specific conditions. While most research on backdoor attacks has focused on 2D vision tasks, their impact on 6DoF pose estimation remains largely unexplored. Furthermore, unlike traditional backdoors that only change the object class, backdoors against 6DoF pose estimation must additionally control continuous pose parameters, such as translation and rotation, making existing 2D backdoor attack methods not directly applicable to this setting. To address this gap, we propose a novel backdoor attack framework (6DAttack) that exposes vulnerabilities in 6DoF pose estimation. 6DAttack uses synthetic and real 3D objects of varying shapes as triggers and assigns target poses to induce controlled erroneous pose outputs while maintaining normal behavior on clean inputs. We evaluated this attack on multiple models (including PVNet, DenseFusion, and PoseDiffusion) and datasets (including LINEMOD, YCB-Video, and CO3D). Experimental results demonstrate that 6DAttack achieves extremely high attack success rates (ASRs) without compromising performance on legitimate tasks. Across various models and objects, the backdoored models achieve up to 100% ADD accuracy on clean data, while also achieving 100% ASR under trigger conditions. The accuracy of controlled erroneous pose output is also extremely high, with triggered samples achieving 97.70% ADD-P. These results demonstrate that the backdoor can be reliably implanted and activated, achieving a high ASR under trigger conditions while maintaining a negligible impact on benign data. Furthermore, we evaluate a representative defense and show that it remains ineffective under 6DAttack. Overall, our findings reveal a potentially serious and previously underexplored threat to modern 6DoF pose estimation models.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#19 The Silent Amplifier: In-Context Examples Fuel Bias in Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Xinwei Guo, Jiashi Gao, Junlei Zhou, Jiaxin Zhang, Quanying Liu, Haiyan Wu, Xin Yao, Xuetao Wei

In-context learning (ICL) has proven to be adept at adapting large language models (LLMs) to downstream tasks without parameter updates, based on a few demonstration examples. Prior work has found that the ICL performance is susceptible to the selection of examples in prompt and made efforts to stabilize it. However, existing example selection studies ignore the ethical risks behind the examples selected, such as gender and race bias. In this work, we conduct extensive experiments and discover that (1) example selection with high accuracy does not mean low bias; (2) example selection for ICL may amplify the biases of LLMs; (3) example selection contributes to spurious correlations of LLMs. Based on the above observations, we propose the Remind with Bias-aware Embedding (ReBE), which removes the spurious correlations through contrastive learning and obtains bias-aware embedding for LLMs based on prompt tuning. Finally, we demonstrate that ReBE effectively mitigates biases of LLMs without significantly compromising accuracy and is highly compatible with existing example selection methods.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#20 Beyond World Models: Rethinking Understanding in AI Models [PDF] [Copy] [Kimi] [REL]

Authors: Tarun Gupta, Danish Pruthi

World models have garnered substantial interest in the AI community. These are internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models "understand" the world in a human-like way. In this paper, we use case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they help us explore the limits of world models.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#21 Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Haojie Hao, Jiakai Wang, Aishan Liu, Yuqing Ma, Haotong Qin, Yuanfang Guo, Xianglong Liu

Recently, Large Vision-Language Models (LVLMs) have been demonstrated to be vulnerable to jailbreak attacks, highlighting the urgent need for further research to comprehensively identify and mitigate these threats. Unfortunately, existing jailbreak studies primarily focus on coarse-grained input manipulation to elicit specific responses, overlooking the exploitation of internal representations, i.e., intermediate activations, which constrains their ability to penetrate alignment safeguards and generate harmful responses. To tackle this issue, we propose the Activation Manipulation (ActMan) Attack framework, which performs fine-grained activation manipulations inspired by the perception and cognition stages of human decision-making, enhancing both the penetration capability and harmfulness of attacks. To improve penetration capability, we introduce a Deceptive Visual Camouflage module inspired by the masking effect in human perception. This module uses a benign activation-guided attention redirection strategy to conceal abnormal activation patterns, thereby suppressing LVLM's defense detection during early-stage decoding. To enhance harmfulness, we design a Malicious Semantic Induction module drawing from the framing effect in human cognition, which reconstructs jailbreak instructions using malicious activation guidance to change LVLM’s risk assessment during late-stage decoding, thereby amplifying the harmfulness of model responses. Extensive experiments on six mainstream LVLMs demonstrate that our method remarkably outperforms state-of-the-art baselines, achieving an average relative ASR improvement of 12.06%.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#22 FILTER: A Framework for Defending Against Backdoor Attacks in Vertical Federated Learning [PDF] [Copy] [Kimi] [REL]

Authors: Zhanyi Hu, Cen Chen, Yanhao Wang

Vertical Federated Learning (VFL) is a distributed machine learning paradigm in which participants train models with vertically partitioned data. Many previous studies have identified backdoor vulnerabilities in VFL systems. However, limited effort has been devoted to developing defenses against such attacks. Unlike centralized machine learning or horizontal FL, VFL poses new challenges for defending against backdoor attacks, particularly because the central server lacks control over the entire model. In this paper, we first explore defenses against backdoor attacks in VFL when the attacker possesses sufficient knowledge of the label information. Specifically, we propose FILTER, a framework for defending against backdoor attacks in VFL to ensure the integrity of VFL systems during training in the presence of malicious participants. To address backdoor risks in VFL, it incorporates two novel filters: an embedding-based filter and a loss-based filter, which effectively identify and remove poisoned samples in later stages of training. Through extensive experiments on five benchmark datasets against four state-of-the-art backdoor attacks, we demonstrate that FILTER significantly reduces the success rate of attacks while maintaining accuracy on clean data close to that of the models trained without such defenses.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#23 Parameterized Abstract Interpretation for Transformer Verification [PDF] [Copy] [Kimi] [REL]

Authors: Pei Huang, Dennis Wei, Omri Isac, Haoze Wu, Min Wu, Clark Barrett

Transformers based on the self-attention mechanism have become foundational models across a wide range of domains, thereby creating an urgent need for effective formal verification techniques to better understand their behavior and ensure safety guarantees. In this paper, we propose two parameterized linear abstract domains for the inner products in the self-attention module, aiming to improve verification precision. The first one constructs symbolic quadratic upper and lower bounds for the product of two scalars, and then derives parameterized affine bounds using tangents. The other one constructs parameterized bounds by interpolating affine bounds proposed in prior work. We evaluate these two parameterization methods and demonstrate that both of them outperform the state-of-the-art approach which is regarded as optimal with respect to a certain mean gap. Experimental results show that, in the context of robustness verification, our approach is able to verify many instances that cannot be verified by existing methods. In the interval analysis, our method achieves tighter results compared to the SOTA, with the strength becoming more pronounced as the network depth increases.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#24 Any2Critical: Safety-Critical Scenario Generation from Arbitrary Real-World Driving Contexts [PDF] [Copy] [Kimi] [REL]

Authors: Yao Huang, Yubo Chen, Ruochen Zhang, Yitong Sun, Shouwei Ruan, Zhenyu Wu, Yinpeng Dong, Xingxing Wei

Autonomous driving systems have achieved remarkable capabilities in real-world deployment, yet ensuring safety under corner cases remains a significant challenge due to the scarcity and constrained diversity of safety-critical scenarios. Existing generation methods may either lead to irrational vehicle behaviors or be limited by fixed collision patterns, while both heavily rely on existing map datasets, restricting the diversity. To address these fundamental limitations, we introduce Any2Critical, the first framework that can encode arbitrary real-world scenarios and generate contextually relevant safety-critical scenarios with realistic driving behaviors. Specifically, Any2Critical addresses two key challenges: (1) developing comprehensive, diverse map data by successfully leveraging everyday traffic situations as the most abundant source of real-world driving contexts, and (2) proposing an RAG-based Safety-Critical Scenario Generation Strategy based on our curated NHTSA-5K database for achieving an optimal balance between scenario diversity and behavioral rationality. Through comprehensive evaluation, we demonstrate that Any2Critical consistently achieves collision rates with an average of 89.69% across diverse scenarios and autonomous driving systems, significantly outperforming current state-of-the-art generation methods.

Subject: AAAI.2026 - Philosophy and Ethics of AI

#25 Private Frequency Estimation via Residue Number Systems [PDF] [Copy] [Kimi] [REL]

Author: Héber Hwang Arcolezi

We present Modular Subset Selection (MSS), a new algorithm for locally differentially private (LDP) frequency estimation. Given a universe of size k and n users, our ε-LDP mechanism encodes each input via a Residue Number System (RNS) over ℓ pairwise-coprime moduli m0, ..., m_{ℓ−1}, and reports a randomly chosen index j ∊ [ℓ] along with the perturbed residue using the statistically optimal Subset Selection (SS) mechanism. This design reduces the user communication cost from Θ(ω log₂(k/ω)) bits required by standard SS (with ω ≈ k/(e^ε+1)) down to ⌈ log₂ ℓ ⌉ + ⌈ log₂ m_j ⌉ bits, where m_j < k. Server-side decoding runs in Θ(n + r k ℓ) time, where r is the number of LSMR iterations. In practice, with well-conditioned moduli (i.e., constant r and ℓ = Θ(log k)), this becomes Θ(n + k log k). We prove that MSS achieves worst-case MSE within a constant factor of state-of-the-art protocols such as SS and Projective Geometry Response (PGR), while avoiding the algebraic prerequisites and dynamic-programming decoder required by PGR. Empirically, MSS matches the estimation accuracy of SS, PGR, and RAPPOR across realistic (k, ε) settings, while offering faster decoding than PGR and shorter user messages than SS. Lastly, by sampling from multiple moduli and reporting only a single perturbed residue, MSS achieves the lowest reconstruction-attack success rate among all evaluated LDP protocols.

Subject: AAAI.2026 - Philosophy and Ethics of AI