EMNLP.2025 - Findings

| Total: 1405

#1 Automating Alternative Generation in Decision-Making [PDF3] [Copy] [Kimi5] [REL]

Authors: Yevhen Kostiuk, Clara Seyfried, Chris Reed

In decision making, generating alternative solutions is crucial for solving a problem. However, cognitive biases can impede this process by constraining individual decision makers’ creativity. To address this issue, we introduce a new task for automatically generating alternatives, inspired by the process of human “brainstorming”. We define alternative options based on atomic action components and present a dataset of 106 annotated Reddit r/Advice posts containing unique alternative options extracted from users’ replies. We also introduce new metrics to assess the quality of generated components, including distinctiveness, creativity, upvote-weighted, crowd intersection, and final commit intersection scores. As a baseline, we evaluated the large language models (LLMs) LLaMa3:8b, LLaMa3.1:8b, and Gemma 2:9b on the alternative component generation task. On the one hand, models demonstrated high creativity (ability to generate options beyond what Reddit users suggested) and performed well at proposing distinct alternatives. A subset of generated components was manually evaluated and found overall useful. This indicates that LLMs might be used to extend lists of alternative options, helping decision makers consider a problem from different perspectives. On the other hand, LLMs’ outputs often failed to align with human suggestions, implying that they still tend to miss important components.

Subject: EMNLP.2025 - Findings


#2 Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification [PDF1] [Copy] [Kimi1] [REL]

Authors: Takuma Udagawa, Yang Zhao, Hiroshi Kanayama, Bishwaranjan Bhattacharjee

Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

Subject: EMNLP.2025 - Findings


#3 Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions [PDF1] [Copy] [Kimi1] [REL]

Authors: Chenming Tang, Zhixiang Wang, Hao Sun, Yunfang Wu

With the help of in-context learning (ICL), large language models (LLMs) have achieved impressive performance across various tasks. However, the function of descriptive instructions during ICL remains under-explored. In this work, we propose an ensemble prompt framework to describe the selection criteria of multiple in-context examples, and preliminary experiments on machine translation (MT) across six translation directions confirm that this framework boosts ICL performance. But to our surprise, LLMs might not care what the descriptions actually say, and the performance gain is primarily caused by the ensemble format, since it could lead to improvement even with random descriptive nouns. We further apply this new ensemble framework on a range of commonsense, math, logical reasoning and hallucination tasks with three LLMs and achieve promising results, suggesting again that designing a proper prompt format would be much more effective and efficient than paying effort into specific descriptions.

Subject: EMNLP.2025 - Findings


#4 Boundary Matters: Leveraging Structured Text Plots for Long Text Outline Generation [PDF1] [Copy] [Kimi1] [REL]

Authors: Yuanchi Ma, Jiamou Liu, Hui He, Libo Zhang, Haoyuan Li, Zhendong Niu

Outline generation aims to uncover the internal content structure of a document by identifying potential chapter connections and generating corresponding summaries. A robust outline generation model strives for coherence between and within plots. However, existing methods perform well on short- and medium-length texts and struggle with generating readable outlines for very long texts (e.g., fictional literary works). The primary challenge lies in their inability to accurately segment plots within long texts. To address this issue, we propose a novel unsupervised guidance framework, LeStrTP, to guide large language model (LLM) outline generation. This framework ensures that each structured plot encapsulates complete causality by accurately identifying plot boundaries. Specifically, the LeStrTP framework constructs chapter-level graph from long texts and learns their embeddings. Subsequently, through Markov chain modeling chapter dependence, a unique search operator is designed to achieve plot segmentation. To facilitate research on this task, we introduce a new annotated benchmark dataset, NovOutlineSet. Experimental results demonstrate that structured plots not only enhance the coherence and integrity of generated outlines but also significantly improve their quality.

Subject: EMNLP.2025 - Findings


#5 Can Large Language Models Personalize Dialogues to Generational Styles? [PDF] [Copy] [Kimi1] [REL]

Authors: Pier Felice Balestrucci, Ondrej Dusek, Luca Anselma, Alessandro Mazzei

We investigate how large language models (LLMs) can produce personalized dialogue responses, specifically focusing on whether they reflect linguistic styles pertaining to different generations: Baby Boomers, Generation X, Generation Y, and Generation Z. We create P-MultiWoZ, a personalized, generation-specific version of MultiWOZ 2.2, by prompting LLMs, and validate its alignment with the original dataset through automatic and human evaluations. To validate the appropriateness of generational linguistic traits, we introduce GeMoSC, a corpus of generation-annotated movie dialogues. Linguistic analysis and perplexity test suggest that P-MultiWoZ reflects patterns consistent with GeMoSC. Finally, a human evaluation reveals that annotators were able to mostly correctly identify the generation behind P-MultiWoZ dialogues, based only on a single query-reply pair.

Subject: EMNLP.2025 - Findings


#6 Toward Optimal LLM Alignments Using Two-Player Games [PDF] [Copy] [Kimi1] [REL]

Authors: Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Yang Liu, Hang Li

Alignment of large language models (LLM) is a process that ensures the model’s responses to user prompts align with human intentions and social values. This optimization typically relies on pre-collected prompts. The collection of these prompts often either requires careful human interventions or proves to be difficult to have a good coverage over all scenarios an LLM can improve over . To address this issue, we propose an alignment method based on a two-agent game, consisting of an adversarial agent and a defensive agent. The adversarial agent’s task is to generate prompts that expose the deficiencies of the defensive agent. At the same time, the defensive agent improves its performance on the prompts generated by the adversary based on feedback from the reward model. This iterative process is repeated to enhance the model’s performance. We theoretically demonstrate that, under mild assumptions, this iterative alignment process converges to a Nash equilibrium by both agents. Learning in this competitive environment results in policies with better generalization capabilities. We demonstrate the advantage of our framework using extensive experiments.

Subject: EMNLP.2025 - Findings


#7 Structural Patent Classification Using Label Hierarchy Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Mengting Gui, Shufeng Hao, Chongyang Shi, Qi Zhang

Patent classification is a fundamental step in the patent examination process, directly impacting the efficiency and quality of substantive review. Existing methods mostly focus on general texts like titles and abstracts, thus ignoring the key technical content claims and the corresponding citation relationships. Meanwhile, these approaches treat labels as independent targets, failing to exploit the semantic and structural information within the label taxonomy. To address these problems, we propose a Claim Structure based Patent Classification model with Label Awareness (CSPC-LA). The method first utilizes the citation relationship of patent claim texts to construct the citation graph and the co-reference graph. Then structural graph learning is used on both graphs to mine the internal logic of patent claims. Finally, we optimize the tree hierarchy of IPC labels and employ tree propagation learning to enhance the patent representation. Extensive experiments on the latest patent classification dataset from USPTO demonstrate that the proposed method is more effective than the state-of-the-art baselines.

Subject: EMNLP.2025 - Findings


#8 Exploring Hyperbolic Hierarchical Structure for Multimodal Rumor Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Md Mahbubur Rahman, Shufeng Hao, Chongyang Shi, An Lao, Jinyan Liu

The rise of multimodal content on social platforms has led to the rapid spread of complex and persuasive false narratives, combining of text and images. Traditional rumor detection models attempt to identify such content by relying on textual cues or employing shallow multimodal fusion techniques. However, these methods often assume a simplistic one-to-one alignment between modalities, overlooking the richer hierarchical relationships across modalities, failing to capture the layered structure of meaning. In this paper, we present RumorCone, a novel method that employs hyperbolic geometry in order to preserve hierarchical, non-linear relationships, rather than representing them at a flat semantic level. First, RumorCone decomposes image and text content into three levels: base, mid, and high-level abstractions, and embeds them in hyperbolic space to model their tree-like semantic structure. Second, a dynamic hyperbolic multimodal attention mechanism aligns features across modalities and levels, and a flexible fusion strategy adjusts the contribution of each modality based on alignment quality. Our experiments indicate the importance of hierarchical semantic modeling for robust and interpretable multimodal rumor detection.

Subject: EMNLP.2025 - Findings


#9 Multi-Surrogate-Objective Optimization for Neural Topic Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Tue Le, Hoang Tran Vuong, Tung Nguyen, Linh Ngo Van, Dinh Viet Sang, Trung Le, Thien Huu Nguyen

Neural topic modeling has substantially improved topic quality and document topic distribution compared to traditional probabilistic methods. These models often incorporate multiple loss functions. However, the disparate magnitudes of these losses can make hyperparameter tuning for these loss functions challenging, potentially creating obstacles for simultaneous optimization. While gradient-based Multi-objective Optimization (MOO) algorithms offer a potential solution, they are typically applied to shared parameters in multi-task learning, hindering their broader adoption, particularly in Neural Topic Models (NTMs). Furthermore, our experiments reveal that naïve MOO applications on NTMs can yield suboptimal results, even underperforming compared to implementations without the MOO mechanism. This paper proposes a novel approach to integrate MOO algorithms, independent of hard-parameter sharing architectures, and effectively optimizes multiple NTMs loss functions. Comprehensive evaluations on widely used benchmark datasets demonstrate that our approach significantly enhances baseline topic model performance and outperforms direct MOO applications on NTMs.

Subject: EMNLP.2025 - Findings


#10 How Diversely Can Language Models Solve Problems? Exploring the Algorithmic Diversity of Model-Generated Code [PDF] [Copy] [Kimi1] [REL]

Authors: Seonghyeon Lee, HeeJae Chon, Joonwon Jang, Dongha Lee, Hwanjo Yu

Language models (LMs) have exhibited impressive abilities in generating code from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation capabilities. There is a lack of studies focused on assessing the diversity of generated code, which overlooks its importance in code LMs. Therefore, we propose a systematic approach to evaluate code diversity, introducing various metrics with inter-code similarity. Specifically, we introduce code clustering methods that leverages LMs’ capabilities in code understanding and reasoning, resulting in a set of metrics that represent the number of algorithms in model-generated solutions. We extensively investigate the property of model-generated solutions by contrasting them with human-written ones and quantifying the impact of various factors on code diversity: model size, temperature, instruction tuning, and problem complexity. Our analysis demonstrates that model-generated solutions exhibit low algorithmic diversity, which was neglected by the research community. Moreover, we explore methods to increase code diversity by combining solutions from different models and increasing sampling temperatures. Our findings highlight that code diversity can be enhanced with the help of heterogeneous models and setting temperature beyond 1.0 that has not been fully explored due to the functional correctness degradation. To facilitate our research direction, we publicly share our code and datasets through open-source repositories.

Subject: EMNLP.2025 - Findings


#11 ReAL: How Can LLMs Simulate the Real Teacher? Retrieval-enhanced Agent for Adaptive Learning [PDF] [Copy] [Kimi1] [REL]

Authors: Rui Lv, Qi Liu, Weibo Gao, Jiatong Li, Kai Zhang, Shiwei Tong

Adaptive learning focuses on recommending personalized materials (e.g., exercises, courses) to the unique needs of learners. Despite significant research, these methods still lag behind real teachers including two main limitations: (1) Prior methods model learner-item interactions based only on ID sequences, leading to insufficient use of both learner and item information, particularly the inability to leverage semantic content from item text; (2) The data-driven reinforcement learning frameworks struggle with stable performance in scenarios with sparse learning logs. To address these challenges, we introduce the Retrieval-enhanced Agent for Adaptive Learning (ReAL) powered by large language models (LLMs), to simulate teacher decision-making with extensive prior knowledge and teaching experience. Specifically, we approach the simulation from both internal and external perspectives. From the internal perspective, we utilize the superior natural language standing ability of LLMs to analyze item texts and learner profiles. This mechanism contributes to the generation of personalized and appropriate item candidates. From the external perspective, we simulate the teacher experience by retrieving similar learners, further ensuring the model’s performance on sparse interaction data. Furthermore, we design a reflector based on learners’ feedback to refine the recommendation process. Evaluation on three real-world datasets demonstrates the superiority of ReAL in both data utilization, recommendation accuracy and stability compared to various representative baselines.

Subject: EMNLP.2025 - Findings


#12 LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts [PDF] [Copy] [Kimi1] [REL]

Authors: Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao

As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important.To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors.We present LLMsPark, a game theory–based evaluation platform that measures LLMs’ decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth.Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models.This work introduces a novel perspective for evaluating LLMs’ strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios.The benchmark and rankings are publicly available at https://llmsparks.github.io/.

Subject: EMNLP.2025 - Findings


#13 Versatile Framework for Song Generation with Prompt-based Control [PDF] [Copy] [Kimi] [REL]

Authors: Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao

Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results demonstrate that VersBand performs better over baseline models across multiple song generation tasks using objective and subjective metrics.

Subject: EMNLP.2025 - Findings


#14 InsBank: Evolving Instruction Subset for Ongoing Alignment [PDF] [Copy] [Kimi1] [REL]

Authors: Jiayi Shi, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Huan Ren, Yao Hu, Kan Li

Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently explored. To achieve LLMs’ ongoing alignment, we introduce Instruction Bank (InsBank), a continuously updated repository that integrates the latest valuable instruction data. We further propose Progressive Instruction Bank Evolution (PIBE), a novel framework designed to evolve InsBank effectively and efficiently over time. PIBE employs a gradual data selection strategy to maintain long-term efficiency, leveraging a representation-based diversity score to capture relationships between data points and retain historical information for comprehensive diversity evaluation. This also allows for flexible combination of diversity and quality scores during data selection and ranking. Extensive experiments demonstrate that PIBE significantly outperforms baselines in InsBank evolution and is able to extract budget-specific subsets, demonstrating its effectiveness and adaptability.

Subject: EMNLP.2025 - Findings


#15 TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use [PDF] [Copy] [Kimi] [REL]

Authors: Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, Zhengyin Du

Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of categories. Building on these findings, we propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. Code and data are available at https://github.com/Junjie-Ye/TL-Training.

Subject: EMNLP.2025 - Findings


#16 DCMKC: A Dual Consistency Matching Approach for Multi-hop Question Answering in LLMs [PDF] [Copy] [Kimi] [REL]

Authors: Xinyi Wang, Yiping Song, Chang Liu, Tingjin Luo, Bo Liu, Zheng Xie, Minlie Huang

Reasoning based on chains of thought (CoTs) enables large language models (LLMs) to solve problems by thinking step by step and becomes the mainstream solution for Question-Answering (QA) tasks. Knowledge graph (KG)-enhanced CoT technology helps correct factual errors or predict reasoning direction. Existing KG-enhanced methods find relevant information in KGs “within” each reasoning step of CoTs. However, in some cases, logical connections “between” reasoning steps may be missing or wrong, leading to broken reasoning chains and wrong reasoning direction. To solve the above problem, we argue that the errors between reasoning steps require collaborative verification and mining of multiple triplets and multiple paths in KG. So we propose the DCMKC (Dual Consistency Matching for KG and CoT) method, aiming to maintain semantic and structural consistency between KG and CoT. The main idea is to convert CoTs and KGs into two granularity-aligned graphs, transforming multi-hop reasoning and KG matching into iterative matching and modification of two graphs. In each iteration, DCMKC matches the KG reasoning chains with CoTs based on semantic similarity and judges the structural consistency between them. Then it modifies CoTs using the matched chains. After iterations, the CoTs and KG reasoning chains reach high semantic and structural consistency, which is theoretically and experimentally demonstrated by kernel and spectral methods. The two kinds of chains are then used to generate the final answers. Experimental results show that our method outperforms baselines on multiple datasets, especially on multi-answer questions, with up to 5.1% improvement over the baseline.

Subject: EMNLP.2025 - Findings


#17 On Domain-Adaptive Post-Training for Multimodal Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang

Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs via post-training, focusing on data synthesis, training pipeline, and task evaluation. (1) **Data Synthesis**: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models in enhancing domain-specific performance. (2) **Training Pipeline**: Unlike general MLLMs that typically adopt a two-stage training paradigm, we find that a single-stage approach is more effective for domain adaptation. (3) **Task Evaluation**: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Finally, we fully open-source our models, code, and data to encourage future research in this area.

Subject: EMNLP.2025 - Findings


#18 CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Jing Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, Yongbin Li

Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals. Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring. Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages: (1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.

Subject: EMNLP.2025 - Findings


#19 SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin [PDF] [Copy] [Kimi] [REL]

Authors: Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.

Subject: EMNLP.2025 - Findings


#20 Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [PDF] [Copy] [Kimi1] [REL]

Authors: Zhangyue Yin, YuHong Sun, Xuanjing Huang, Xipeng Qiu, Hui Zhao

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs’ reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.

Subject: EMNLP.2025 - Findings


#21 sudoLLM: On Multi-role Alignment of Language Models [PDF] [Copy] [Kimi1] [REL]

Authors: Soumadeep Saha, Akshay Chaturvedi, Joy Mahapatra, Utpal Garain

User authorization-based access privileges are a key feature in many safety-critical systems, but have not been extensively studied in the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, resistance to prefix-based jailbreaking attacks, and “fails-closed”. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.

Subject: EMNLP.2025 - Findings


#22 DAC: Decomposed Automation Correction for Text-to-SQL [PDF] [Copy] [Kimi] [REL]

Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

Text-to-SQL is an important task that helps access databases by generating SQL queries. Currently, correcting the generated SQL based on large language models (LLMs) automatically is an effective method to enhance the quality of the generated SQL. However, previous research shows that it is hard for LLMs to detect mistakes in SQL directly, leading to poor performance. Therefore, in this paper, we propose to employ the decomposed correction to enhance text-to-SQL performance. We first demonstrate that detecting and fixing mistakes based on the decomposed sub-tasks is easier than using SQL directly. Then, we introduce Decomposed Automation Correction (DAC), which first generates the entities and skeleton corresponding to the question, and then compares the differences between the initial SQL and the generated entities and skeleton as feedback for correction. Experimental results show that, compared with the previous automation correction method, DAC improves performance by 1.4% of Spider, Bird, and KaggleDBQA on average, demonstrating the effectiveness of DAC.

Subject: EMNLP.2025 - Findings


#23 VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction [PDF] [Copy] [Kimi] [REL]

Authors: Jie Yang, Jiajun Chen, Zhangyue Yin, Shuo Chen, Yuxin Wang, Yiran Guo, Yuan Li, Yining Zheng, Xuanjing Huang, Xipeng Qiu

Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments’ complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on GitHub.

Subject: EMNLP.2025 - Findings


#24 End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation [PDF] [Copy] [Kimi] [REL]

Authors: Zhiyuan Fan, Longfei Yun, Ming Yan, Yumeng Wang, Dadi Guo, Brian Mak, James Kwok, Yi R. Fung

Multimodal Retrieval-Augmented Generation (MM-RAG) has emerged as a promising approach for enhancing the reliability and factuality of large vision-language models (LVLMs). While end-to-end loss backpropagation is infeasible due to non-differentiable operations during the forward process, current methods primarily focus on component-level optimizations, necessitate extensive component-specific training datasets and suffer from a gap between local and global optimization objectives. In this paper, we propose a new paradigm that backpropagates global rewards from the system output to each component and then transforms these rewards into specific local losses, enabling each component to perform gradient descent and thus ensuring end-to-end optimization. Specifically, we first insert two lightweight multimodal components, a query translator and an adaptive reranker, to address the heterogeneity of multimodal knowledge and the varying knowledge demands for different questions, and then tune only these inserted components using our proposed paradigm to integrate the entire system. Our method achieves SOTA performance on multiple knowledge-intensive multimodal benchmarks with high training efficiency, relying exclusively on supervised signals from an external reward model. Experimental results and our detailed analysis of the evolution of components during training collectively reveal the advantages and considerable potential of this paradigm as a promising direction for MM-RAG research.

Subject: EMNLP.2025 - Findings


#25 Audio-Aware Large Language Models as Judges for Speaking Styles [PDF] [Copy] [Kimi1] [REL]

Authors: Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs’ responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

Subject: EMNLP.2025 - Findings