| Total: 67
This paper tackles the fundamental failure of Large Language Models (LLMs) to solve new tasks when prompted with a sufficient, yet overly complex, set of multi-modal episodes. This failure stems from the model's inability to distill underlying patterns from the noisy experiences. We propose Hypothesis-Driven Reasoning (HDR), a framework that enhances LLM reasoning by building an explicit semantic memory—a set of hypotheses induced from the multi-modal episodes. HDR employs a two-stage pipeline. It first extracts potential factors from the episodes and then iteratively refines hypotheses by generate-verify loop with the factors. We first empirically demonstrates this failure and the potential of sematic memory, showing that oracle hypotheses can boost accuracy from 35.3% to 92.0% on a novel task we designed. We then evaluate our HDR, achieving near-oracle performance and significantly outperforming baselines, especially on smaller models. This paper validates a shift from unstructured in-context recall to explicit knowledge abstraction for robust reasoning.
Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.
Conversion represents an effective approach for obtaining low-power models by transforming Artificial Neural Networks (ANNs) into event-driven Spiking Neural Networks (SNNs) without additional training. However, existing training-free conversion methods often incur substantial conversion errors. Here, we first reveal that these conversion errors primarily arise from a distributional mismatch, as the activation distributions of ANNs exhibit channel-wise shifts and scaling, whereas spike rates lack corresponding channel-specific characteristics. To address this limitation, we propose Adaptive Integrate-and-Fire (AIF) neurons with channel-specific thresholds and membrane-potential offsets that dynamically adjust spike rates. These parameters are optimized to jointly minimize conversion errors and maximize information entropy, enabling AIF neurons to capture the activation distribution characteristics of the original ANN. Moreover, AIF neurons can be seamlessly integrated into Transformer architectures with only negligible additional computational cost. Our method achieves state-of-the-art results on multiple vision and natural language processing benchmarks, in particular attaining a notable top-1 accuracy of 85.52% on ImageNet-1K.
Spiking Neural Networks (SNNs) offer a promising energy-efficient computing paradigm owing to their event-driven properties and biologically inspired dynamics. Among various encoding schemes, Time-to-First-Spike (TTFS) is particularly notable for its extreme sparsity, utilizing a single spike per neuron to maximize energy efficiency. However, two significant challenges persist: effectively leveraging TTFS sparsity to minimize training costs on Graphics Processing Units (GPUs), and bridging the performance gap between TTFS-based SNNs and their rate-based counterparts. To address these issues, we propose a parallel training algorithm for accelerated execution and a novel decoding strategy for enhanced performance. Specifically, we derive both forward and backward propagation equations for parallelized TTFS SNNs, enabling precise calculation of first-spike timings and gradients. Furthermore, we analyze the limitations of existing output decoders and introduce a membrane potential–based decoder, complemented by an incremental time-step training strategy, to improve accuracy. Our approach achieves state-of-the-art accuracy for TTFS SNNs on several benchmarks, including MNIST (99.51%), Fashion-MNIST (93.14%), CIFAR-10 (95.06%), and CIFAR-100 (74.07%).
Zero-shot object navigation tasks agents with locating target objects in unseen environments—a core capability of embodied intelligence. While recent vision-language navigation methods leverage Large Language Models (LLMs) for multimodal reasoning, they suffer from two key limitations: (1) semantic misalignment between language-grounded maps and real-world layouts, and (2) inefficiency due to LLMs’ lack of specialization for navigation-specific tasks. To address these challenges, we propose Chain-of-Search (CoS), a novel parameter-efficient framework that enables human-like decision-making via iterative semantic reasoning. First, CoS replaces traditional global maps with an optimal-benefit multi-map construction that continuously balances expected gain and cost throughout the navigation process. Second, we introduce a Parameter-Efficient Intent Aligner (PEIA), trained via a prompt-guided paradigm to align directional decisions with navigation intent. PEIA injects semantic cues into benefit-aware maps, enabling more rational and goal-consistent exploration. Finally, a Reflection-Guided Destination Verifier (RDV) confirms whether the target is reached via language-driven reasoning and corrects potential errors through self-reflection. CoS achieves state-of-the-art performance on HM3D (+2.8% SR) and MP3D (+1.2% SR) without relying on LLMs, demonstrating the effectiveness of lightweight, reasoning-centered navigation.
Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS.
Large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, existing studies predominantly resort to prompt-based simulation using frozen LLMs, which frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark MicroLens-100k have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines.
Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical multimodal neuropsychiatric diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improve LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.
Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, current LLM systems used in classrooms often lack the solid theoretical foundations found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory and Zone of Proximal Development for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We instantiate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering effective guidance that students value. This research demonstrates the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.
Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human-guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feedbacks, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.
Neural coupling is a fundamental mechanism in neuroscience that facilitates the emergence of cognitive functions through dynamic interactions and synchronization among distributed brain regions. Inspired by this principle, we pose the question: Might the biological mechanism of neural oscillatory synchronization inspire the feature representation learning for neuroscience? By addressing this question through the Kuramoto model, renowned for simulating oscillatory dynamics, we present a novel physics-informed deep model, `SyncBrain`, it models brain regions as interacting oscillatory units and simulates their temporal dynamics and synchronization patterns to distinguish cognitive states. Furthermore, inspired by the brain's inherent ability to dynamically attend to critical temporal information, we incorporate an adaptive control module that introduces an attention-like mechanism to guide information flow. We evaluate our model on multiple functional neuroimaging datasets, it demonstrates promising performance and enhanced interpretability in both cognitive state decoding and early disease diagnosis, outperforming existing computational methods. These results demonstrate the effectiveness of neural oscillatory mechanisms in shaping robust and interpretable machine learning models for neuroscience applications.
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
Swarm UAV autonomous flight for Embodied Long-Horizon (ELH) tasks is crucial for advancing the low-altitude economy. However, existing methods focus only on specific basic tasks due to dataset limitations, failing in real-world deployment for ELH tasks. ELH tasks are not mere concatenations of basic tasks, requiring handling long-term dependencies, maintaining embodied persistent states, and adapting to dynamic goal shifts. This paper presents U2UData+, the first large-scale swarm UAV autonomous flight dataset for ELH tasks and the first scalable swarm UAV data online collection and algorithm closed-loop verification platform. The dataset is captured by 15 UAVs in autonomous collaborative flights for ELH tasks, comprising 12 scenes, 720 traces, 120 hours, 600 seconds per trajectory, 4.32M LiDAR frames, and 12.96M RGB frames. This dataset also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. The platform supports the customization of simulators, UAVs, sensors, flight algorithms, formation modes, and ELH tasks. Through a visual control window, this platform allows users to collect customized datasets through one-click deployment online and to verify algorithms by closed-loop simulation. U2UData+ also introduces an ELH task for wildlife conservation and provides comprehensive benchmarks with 9 SOTA models.
Depression represents a global mental health challenge requiring efficient and reliable automated detection methods. Current Transformer- or Graph Neural Networks (GNNs)-based multimodal depression detection methods face significant challenges in modeling individual differences and cross-modal temporal dependencies across diverse behavioral contexts. Therefore, we propose P³HF (Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network) with three key innovations: (1) personality-guided representation learning using LLMs to transform discrete individual features into contextual descriptions for personalized encoding; (2) Hypergraph-Former architecture modeling high-order cross-modal temporal relationships; (3) event-level domain disentanglement with contrastive learning for improved generalization across behavioral contexts. Experiments on MPDD-Young dataset show P³HF achieves around 10% improvement on accuracy and weighted F1 for binary and ternary depression classification task over existing methods. Extensive ablation studies validate the independent contribution of each architectural component, confirming that personality-guided representation learning and high-order hypergraph reasoning are both essential for generating robust, individual-aware depression-related representations.
The discovery of symbolic solutions—mathematical expressions, logical rules, and algorithmic structures—is fundamental to advancing scientific and engineering progress. However, traditional methods often struggle with search efficiency and fail to integrate knowledge effectively. While recent large language model-based (LLM-based) approaches have demonstrated improvements in search efficiency, they lack the ability to continually refine and expand upon discovered solutions and their underlying knowledge, limiting their potential for \textit{open-ended innovation}. To address these limitations, we introduce CoEvo, a novel framework that leverages large language models within an evolutionary search methodology to continually generate and refine symbolic solutions. CoEvo integrates a dynamic knowledge library, enabling open-ended innovation of solutions through effective knowledge management. Additionally, CoEvo leverages multiple representations of solutions—including natural language, mathematical expressions, and code—to further enhance search efficiency. By combining the reasoning capabilities of LLMs with the exploratory power of evolutionary algorithms, CoEvo significantly improves the efficiency and scope of symbolic discovery. Our experimental results demonstrate that this method not only enhances the efficiency of searching for symbolic solutions but also supports the ongoing discovery process, akin to human scientific endeavors. This study represents a first effort in conceptualizing the search for symbolic solutions as a lifelong, iterative process, marking a significant step towards harnessing LLMs in the perpetual pursuit of scientific and engineering breakthroughs.
The empathetic dialogue systems aim to recognize user emotions and generate appropriate empathetic responses. However, existing approaches predominantly rely on dialogue history, contextual descriptions, and emotion category labels, failing to model the causal relationship between emotions and their underlying triggers. This limitation leads to generated responses that lack grounding, exhibit weak relevance, and suffer from poor interpretability in emotional expression. To address this, we propose MvP-ECR, a multi-perspective emotion cause reasoning framework that explicitly constructs emotion-cause structures to help models focus on the core emotional drivers. Additionally, we introduce an emotion-cause consistency evaluation metric to quantitatively assess a model’s ability to identify causal relationships. Experiments across multiple large language models (LLMs) demonstrate that the MvP-ECR framework can serve as a plug-and-play tool to help the model correctly infer emotions and causes in empathetic conversations, and provide more immersive responses for empathetic responses. All code and data will be publicly released to promote the development of empathy dialogue research.
Theory of Mind (ToM) refers to the ability to infer others' mental states, which is an essential capability for embodied AI agents to effectively collaborate and interact with humans. While improving Large Language Models' ability to reason about characters' mental states in text-based stories/dialogues has been extensively studied, enhancing Multimodal Large Language Models' ToM capabilities, particularly in egocentric video from an embodied perspective, remains unexplored. In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user's mental states (goals, beliefs, and potential next actions). Evaluation results on in-domain and out-of-domain demonstrate that our method achieves performance improvements of (+30.00%, +2.00%) and (+5.83%, +5.00%) compared to the backbone model and vanilla Group Relative Policy Optimization (GRPO) model, respectively. Additionally, we compare the performance of two post-training paradigms (Supervise Fine-Tuning and RL) and systematically analyze the reasoning trajectories across the base model, vanilla GRPO model, and our proposed method.
Large Language Models (LLMs) have demonstrated remarkable proficiency in diverse tasks. This success raises a fundamental question in machine composition: Can symbolic music be considered a special form of language that can be jointly modeled with natural language for composition tasks? Recent studies validate that symbolic music can be modeled as a human language, yet composing structured music from partial symbolic inputs through natural language interaction remains underexplored. Even LLMs struggle to generate structurally coherent compositions in such hybrid input-output scenarios, highlighting a fundamental gap that calls for a domain-specific learning paradigm. To this end, we propose Inspiration-to-Structure (IoS), a cognitively inspired framework that enables LLMs to generate structured musical sections from melodic ideas. IoS employs a three-phase process—semantic, structural, and collaborative cognition—and is supported by two key components: (1) a new dataset and construction protocol called Structured Triplet Data (STD), and (2) a training method, Dual-Instance Structural Contrastive Optimization (DiSCO), designed to enhance structural awareness. Experiments show that IoS improves structural coherence by 47.8% and artistic creativity by 21.8% compared to conventional language modeling paradigm, supervised fine-tuning, and even enables smaller LLMs to surpass larger LLMs. These results suggest that symbolic music, while language-like, demands specialized modeling beyond standard language modeling paradigms. IoS enables LLMs to transform music theory knowledge into structured composition, empowering users to compose music interactively via language and advancing toward general creative AI.
Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large language models (LLMs)-powered agents offer a promising solution by enabling dynamic task planning and predictive decision support. Despite recent advances, the absence of surgical agent datasets and robust parameter-efficient fine-tuning techniques limits the development of LLM agents capable of complex intraoperative reasoning. In this paper, we introduce Surgical AI Copilot, an LLM agent for image-guided pituitary surgery, capable of conversation, planning, and task execution in response to queries involving tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured agent planning, we develop the PitAgent dataset, a surgical context-aware planning dataset covering surgical tasks like workflow analysis, instrument localization, anatomical segmentation, and query-based reasoning. Additionally, we propose DEFT-GaLore, a Deterministic Energy-based Fourier Transform (DEFT) gradient projection technique for efficient low-rank adaptation of recent LLMs (e.g., LLaMA 3.2, Qwen 2.5), enabling their use as surgical agent planners. We extensively validate our agent's performance and the proposed adaptation technique against other state-of-the-art low-rank adaptation methods on agent planning and prompt generation tasks, including a zero-shot surgical VQA benchmark, demonstrating the significant potential for truly efficient and scalable surgical LLM agents in real-time operative settings.
Brain-inspired spiking neural networks (SNNs) are recognized as a promising avenue for achieving efficient, low-energy neuromorphic computing. Direct training of SNNs typically relies on surrogate gradient (SG) learning to estimate derivatives of non-differentiable spiking activity. However, during training, the distribution of neuronal membrane potentials varies across timesteps and progressively deviates toward both sides of the firing threshold. When the firing threshold and SG remain fixed, this may lead to imbalanced spike firing and diminished gradient signals, preventing SNNs from performing well. To address these issues, we propose a novel dual-stage synergistic learning algorithm that achieves forward adaptive thresholding and backward dynamic SG. In forward propagation, we adaptively adjust thresholds based on the distribution of membrane potential dynamics (MPD) at each timestep, which enriches neuronal diversity and effectively balances firing rates across timesteps and layers. In backward propagation, drawing from the underlying association between MPD, threshold, and SG, we dynamically optimize SG to enhance gradient estimation through spatio-temporal alignment, effectively mitigating gradient information loss. Experimental results demonstrate that our method achieves significant performance improvements. Moreover, it allows neurons to fire stable proportions of spikes at each timestep and increases the proportion of neurons that obtain gradients in deeper layers.
The surrogate gradient (SG) method has shown significant promise in enhancing the performance of deep spiking neural networks (SNNs), but it also introduces vulnerabilities to adversarial attacks. Although spike coding strategies and neural dynamics parameters have been extensively studied for their impact on robustness, the critical role of gradient magnitude, which reflects the model's sensitivity to input perturbations, remains underexplored. In SNNs, the gradient magnitude is primarily determined by the interaction between the membrane potential distribution (MPD) and the SG function. In this study, we investigate the relationship between the MPD and SG and their implications for improving the robustness of SNNs. Our theoretical analysis reveals that reducing the proportion of membrane potentials lying within the gradient-available range of the SG function effectively mitigates the sensitivity of SNNs to input perturbations. Building upon this insight, we propose a novel MPD-driven surrogate gradient regularization (MPD-SGR) method, which enhances robustness by explicitly regularizing the MPD based on its interaction with the SG function. Extensive experiments across multiple image classification benchmarks and diverse network architectures confirm that the MPD-SGR method significantly enhances the resilience of SNNs to adversarial perturbations and exhibits strong generalizability across diverse network configurations, SG functions, and spike encoding schemes.
Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from external short audio-visual behaviours expressed by target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end (E2E) strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules. Experiments show our approach’s effectiveness in capturing real personality traits with superior computational efficiency.
Understanding human emotions from images is a challenging yet essential task for vision-language models. While recent efforts have fine-tuned vision-language models to enhance emotional awareness, most approaches rely on global visual representations and fail to capture the nuanced and multi-faceted nature of emotional cues. Furthermore, most existing approaches adopt instruction tuning, which requires costly dataset construction and involves training a large number of parameters, thereby limiting their scalability and efficiency. To address these challenges, we propose MASP, a novel framework for Multi-Aspect guided emotion reasoning with Soft Prompt tuning in vision-language models. MASP explicitly separates emotion-relevant visual cues via multi-aspect cross-attention modules and guides the language model using soft prompts, enabling efficient and scalable task adaptation without modifying the base model. Our method achieves state-of-the-art performance on various emotion recognition benchmarks, demonstrating that the explicit modeling of multi-aspect emotional cues with soft prompt tuning leads to more accurate and interpretable emotion reasoning in vision-language models.
A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents—virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine-grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents’ activities. Our approach produces rich, privacy-preserving sensor data that reflects real-world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low-resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real-world datasets. These results highlight the potential of using LLM-guided embodied agents for scalable and cost-effective sensor data generation in HAR.
Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.