ACL.2025 - Student Research Workshop | Cool Papers

#1 Advancing African-Accented English Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [PDF⁵] [Copy] [Kimi⁶] [REL]

Accents play a pivotal role in shaping human communication, enhancing our ability to convey and comprehend messages with clarity and cultural nuance. While there has been significant progress in Automatic Speech Recognition (ASR), African-accented English ASR has been understudied due to a lack of training datasets, which are often expensive to create and demand colossal human labor. Combining several active learning paradigms and the core-set approach, we propose a new multi-round adaptation process that uses epistemic uncertainty to automate annotation, significantly reducing the associated costs and human labor. This novel method streamlines data annotation and strategically selects data samples that contribute most to model uncertainty, enhancing training efficiency. We define a new U-WER metric to track model adaptation to hard accents. We evaluate our approach across several domains, datasets, and high-performing speech models. Our results show that our approach leads to a 27% WER relative average improvement while requiring, on average, 45% less data than established baselines. Our approach also improves out-of-distribution generalization for very low-resource accents, demonstrating its viability for building generalizable ASR models in the context of accented African ASR. We open-source the code here.

Subject: ACL.2025 - Student Research Workshop

#2 Beyond the Gold Standard in Analytic Automated Essay Scoring [PDF²] [Copy] [Kimi²] [REL]

Author: Gabrielle Gaudeau

Originally developed to reduce the manual burden of grading standardised language tests, Automated Essay Scoring (AES) research has long focused on holistic scoring methods which offer minimal formative feedback in the classroom. With the increasing demand for technological tools that support language acquisition, the field is turning to analytic AES (evaluating essays according to different linguistic traits). This approach holds promise for generating more detailed essay feedback, but relies on analytic scoring data that is both more cognitively demanding for humans to produce, and prone to bias. The dominant paradigm in AES is to aggregate disagreements between raters into a single gold-standard label, which fails to account for genuine examiner variability. In an attempt to make AES more representative and trustworthy, we propose to explore the sources of disagreements and lay out a novel AES system design that learns from individual raters instead of the gold standard labels.

Subject: ACL.2025 - Student Research Workshop

#3 Confidence and Stability of Global and Pairwise Scores in NLP Evaluation [PDF] [Copy] [Kimi] [REL]

Authors: Georgii Levtsov, Dmitry Ustalov

With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley–Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.

Subject: ACL.2025 - Student Research Workshop

#4 Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Simon Münker, Kai Kugler, Achim Rettinger

Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach – despite being limited by local computation resources during the model selection – is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.

Subject: ACL.2025 - Student Research Workshop

#5 Rethinking Full Finetuning from Pretraining Checkpoints in Active Learning for African Languages [PDF] [Copy] [Kimi¹] [REL]

Authors: Bonaventure F. P. Dossou, Ines Arous, Jackie CK Cheung

Active learning (AL) aims to reduce annotation effort by iteratively selecting the most informative samples for labeling. The dominant strategy in AL involves fully finetuning the model on all acquired data after each round, which is computationally expensive in multilingual and low-resource settings. This paper investigates continual finetuning (CF), an alternative update strategy where the model is updated only on newly acquired samples at each round. We evaluate CF against full finetuning (FA) across 28 African languages using MasakhaNEWS and SIB-200. Our analysis reveals three key findings. First, CF matches or outperforms FA for languages included in the model’s pretraining, achieving up to 35% reductions in GPU memory, FLOPs, and training time. Second, CF performs comparably even for languages not seen during pretraining when they are typologically similar to those that were. Third, CF’s effectiveness depends critically on uncertainty-based acquisition; without it, performance deteriorates significantly. While FA remains preferable for some low-resource languages, the overall results establish CF as a robust, cost-efficient alternative for active learning in multilingual NLP. These findings motivate developing hybrid AL strategies that adapt fine-tuning behavior based on pretraining coverage, language typology, and acquisition dynamics.

Subject: ACL.2025 - Student Research Workshop

#6 HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization [PDF] [Copy] [Kimi] [REL]

Authors: Enes Özeren, Yihong Liu, Hinrich Schuetze

Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLM’s token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.

Subject: ACL.2025 - Student Research Workshop

#7 SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection [PDF¹] [Copy] [Kimi] [REL]

Authors: Anku Rani, Dwip Dalal, Shreya Gautam, Pankaj Gupta, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das

Deception is the intentional practice of twisting information. It is a nuanced societal practice deeply intertwined with human societal evolution, characterized by a multitude of facets. This research explores the problem of deception through the lens of psychology, employing a framework that categorizes deception into three forms: lies of omission, lies of commission, and lies of influence. The primary focus of this study is specifically on investigating only lies of omission. We propose a novel framework for deception detection leveraging NLP techniques. We curated an annotated dataset of 876,784 samples by amalgamating a popular large-scale fake news dataset and scraped news headlines from the Twitter handle of “Times of India”, a well-known Indian news media house. Each sample has been labeled with four layers, namely: (i) the type of omission (speculation, bias, distortion, sounds factual, and opinion), (ii) colors of lies (black, white, grey, and red), and (iii) the intention of such lies (to influence, gain social prestige, etc) (iv) topic of lies (political, educational, religious, racial, and ethnicity). We present a novel multi-task learning [MTL] pipeline that leverages the dataless merging of fine-tuned language models to address the deception detection task mentioned earlier. Our proposed model achieved an impressive F1 score of 0.87, demonstrating strong performance across all layers including the type, color, intent, and topic aspects of deceptive content. Finally, our research aims to explore the relationship between the lies of omission and propaganda techniques. To accomplish this, we conducted an in-depth analysis, uncovering compelling findings. For instance, our analysis revealed a significant correlation between loaded language and opinion, shedding light on their interconnectedness. To encourage further research in this field, we are releasing the SEPSIS dataset and code at https://huggingface.co/datasets/ankurani/deception.

Subject: ACL.2025 - Student Research Workshop

#8 Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems? [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Md Tanzib Hosain, Md Kishor Morol

Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample and high-quality unit and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1% pass@1 solve rate. With our best inference technique, which combines muti-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open source our code at https://github.com/kraritt/zolve.

Subject: ACL.2025 - Student Research Workshop

#9 Do Androids Question Electric Sheep? A Multi-Agent Cognitive Simulation of Philosophical Reflection on Hybrid Table Reasoning [PDF] [Copy] [Kimi] [REL]

Author: Yiran Rex Ma

While LLMs demonstrate remarkable reasoning capabilities and multi-agent applicability, their tendency to “overthink” and “groupthink” pose intriguing parallels to human cognitive limitations. Inspired by this observation, we conduct an exploratory simulation to investigate whether LLMs are wise enough to be thinkers of philosophical reflection. We design two frameworks, Philosopher and Symposium, which simulate self- and group-reflection for multi-persona in hybrid table reasoning tasks. Through experiments across four benchmarks, we discover that while introducing varied perspectives might help, LLMs tend to under-perform simpler end-to-end approaches. We reveal from close reading five emergent behaviors which strikingly resemble human cognitive closure-seeking behaviors, and identify a consistent pattern of “overthinking threshold” across all tasks, where collaborative reasoning often reaches a critical point of diminishing returns. This study sheds light on a fundamental challenge shared by both human and machine intelligence: the delicate balance between deliberation and decisiveness.

Subject: ACL.2025 - Student Research Workshop

#10 Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free [PDF¹] [Copy] [Kimi] [REL]

Authors: Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo

Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.

Subject: ACL.2025 - Student Research Workshop

#11 A Reproduction Study: The Kernel PCA Interpretation of Self-Attention Fails Under Scrutiny [PDF] [Copy] [Kimi] [REL]

Authors: Karahan Sarıtaş, Çağatay Yıldız

In this reproduction study, we revisit recent claims that self-attention implements kernel principal component analysis (KPCA) (Teo and Nguyen, 2024), positing that (i) value vectors V capture the eigenvectors of the Gram matrix of the keys, and (ii) that self-attention projects queries onto the principal component axes of the key matrix K in a feature space. Our analysis reveals three critical inconsistencies: (1) No alignment exists between learned self-attention value vectors and what is proposed in the KPCA perspective, with average similarity metrics (optimal cosine similarity ≤ 0.32, linear CKA (Centered Kernel Alignment) ≤ 0.11, kernel CKA ≤ 0.32) indicating negligible correspondence; (2) Reported decreases in reconstruction loss Jproj, arguably justifying the claim that the self-attentionminimizes the projection error of KPCA, are misinterpreted, as the quantities involved differ by orders of magnitude (∼ 103); (3) Gram matrix eigenvalue statistics, introduced to justify that V captures the eigenvector of the gram matrix, are irreproducible without undocumented implementation-specific adjustments. Across 10 transformer architectures, we conclude that the KPCA interpretation of self-attention lacks empirical support.

Subject: ACL.2025 - Student Research Workshop

#12 Transforming Brainwaves into Language: EEG Microstates Meet Text Embedding Models for Dementia Detection [PDF] [Copy] [Kimi] [REL]

Authors: Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Dorothy Bai, Nghia Duong-Trung, Thomas Do, Chin-teng Lin

This study proposes a novel, scalable, non-invasive and channel-independent approach for early dementia detection, particularly Alzheimer’s Disease (AD), by representing Electroencephalography (EEG) microstates as symbolic, language-like sequences. These representations are processed via text embedding and time-series deep learning models for classification. Developed on EEG data from 1001 participants across multiple countries, the proposed method achieves a high accuracy of 94.31% for AD detection. By eliminating the need for fixed EEG configurations and costly/invasive modalities, the introduced approach improves generalisability and enables cost-effective deployment without requiring separate AI models or specific devices. It facilitates scalable and accessible dementia screening, supporting timely interventions and enhancing AD detection in resource-limited communities.

Subject: ACL.2025 - Student Research Workshop

#13 Neuron-Level Language Tag Injection Improves Zero-Shot Translation Performance [PDF] [Copy] [Kimi] [REL]

Authors: Jay Orten, Ammon Shurtz, Nancy Fulda, Stephen D. Richardson

Language tagging, a method whereby source and target inputs are prefixed with a unique language token, has become the de facto standard for conditioning Multilingual Neural Machine Translation (MNMT) models on specific language directions. This conditioning can manifest effective zero-shot translation abilities in MT models at scale for many languages. Expanding on previous work, we propose a novel method of language tagging for MNMT, injection, in which the embedded representation of a language token is concatenated to the input of every linear layer. We explore a variety of different tagging methods, with and without injection, showing that injection improves zero-shot translation performance with up to a 2+ BLEU score point gain for certain language directions in our dataset.

Subject: ACL.2025 - Student Research Workshop

#14 Voices of Dissent: A Multimodal Analysis of Protest Songs through Lyrics and Audio [PDF] [Copy] [Kimi] [REL]

Authors: Utsav Shekhar, Radhika Mamidi

Music has long served as a vehicle for political expression, with protest songs playing a central role in articulating dissent and mobilizing collective action. Yet, despite their cultural significance, the linguistic and acoustic signatures that define protest music remain understudied. We present a multimodal computational analysis of protest and non-protest songs spanning multiple decades. Using NLP and audio analysis, we identify the linguistic and musical features that differentiate protest songs. Instead of focusing on classification performance, we treat classification as a diagnostic tool to investigate these features and reveal broader patterns. Protest songs are not just politically charged they are acoustically and linguistically distinct, and we quantify how.

Subject: ACL.2025 - Student Research Workshop

#15 Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Qi Feng, Yihong Liu, Hinrich Schuetze

Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics – such as text length – which may not accurately reflect the model’s own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks.Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.

Subject: ACL.2025 - Student Research Workshop

#16 CausalGraphBench: a Benchmark for Evaluating Language Models capabilities of Causal Graph discovery [PDF] [Copy] [Kimi] [REL]

Authors: Nikolay Babakov, Ehud Reiter, Alberto Bugarín-Diz

This paper introduces CausalGraphBench, a benchmark designed to evaluate the ability of large language models (LLMs) to construct Causal Graphs (CGs), a critical component of reasoning models like Bayesian Networks. The benchmark comprises 35 CGs sourced from publicly available repositories and academic papers, each enriched with detailed metadata to facilitate systematic and consistent evaluation. We explore various LLM-driven methods for CG discovery, analyzing their performance across different graph sizes and complexity levels. Additionally, we examine the effects of data contamination on the quality of the generated CGs.Our findings reveal that methods relying on approaches with a limited number of queries to LLM, particularly those leveraging the full graph context, consistently outperform query-intensive and exhaustive approaches, which tend to overemphasize local relationships. Across all methods, performance declines as graph size increases.

Subject: ACL.2025 - Student Research Workshop

#17 Reasoning for Translation: Comparative Analysis of Chain-of-Thought and Tree-of-Thought Prompting for LLM Translation [PDF] [Copy] [Kimi] [REL]

Authors: Lam Nguyen, Yang Xu

As Large Language Models (LLMs) continue to advance in capability, prompt engineering has emerged as a crucial method for optimizing their performance on specialized tasks. While prompting strategies like Zero-shot, Few-shot, Chain-of-Thought, and Tree-of-Thought have demonstrated significant improvements in reasoning tasks, their application to machine translation has received comparatively less attention. This paper systematically evaluates these prompting techniques across diverse language pairs and domains, measuring their effect on translation quality. Our findings reveal substantial performance variations between prompting methods, with certain strategies offering consistent improvements for specific language directions and complexity levels. These results provide valuable insights for developing more effective LLM-based translation systems without requiring model fine-tuning and complement existing works in the field.

Subject: ACL.2025 - Student Research Workshop

#18 iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop [PDF¹] [Copy] [Kimi] [REL]

Authors: Jiahui Li, Roman Klinger

Prompt engineering has made significant contributions to the era of large language models, yet its effectiveness depends on the skills of a prompt author. This paper introduces iPrOp, a novel interactive prompt optimization approach, to bridge manual prompt engineering and automatic prompt optimization while offering users the flexibility to assess evolving prompts. We aim to provide users with task-specific guidance to enhance human engagement in the optimization process, which is structured through prompt variations, informative instances, predictions generated by large language models along with their corresponding explanations, and relevant performance metrics. This approach empowers users to choose and further refine the prompts based on their individual preferences and needs. It can not only assist non-technical domain experts in generating optimal prompts tailored to their specific tasks or domains, but also enable to study the intrinsic parameters that influence the performance of prompt optimization. The evaluation shows that our approach has the capability to generate improved prompts, leading to enhanced task performance.

Subject: ACL.2025 - Student Research Workshop

#19 Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes [PDF] [Copy] [Kimi] [REL]

Authors: Nikita Neveditsin, Pawan Lingras, Vijay Kumar Mago

We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.

Subject: ACL.2025 - Student Research Workshop

#20 FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Datasets Dependency [PDF] [Copy] [Kimi] [REL]

Authors: Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Rodrigues Vieira, Andrew Bermingham, Ziad El Sayed

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo & Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets—either collected from the Web or generated by another model—which may contain out-of-distribution (OOD) data beyond the model’s generalisation capabilities. This can result in hallucinated SAE features, which we term ”Fake Features”, that misrepresent the model’s internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model’s own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on webbased datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

Subject: ACL.2025 - Student Research Workshop

#21 Translating Movie Subtitles by Large Language Models using Movie-meta Information [PDF] [Copy] [Kimi] [REL]

Authors: Ashmari Pramodya, Yusuke Sakai, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

Large language models (LLMs) have advanced natural language processing by understanding, generating, and manipulating texts.Although recent studies have shown that prompt engineering can reduce computational effort and potentially improve translation quality, prompt designs specific to different domains remain challenging. Besides, movie subtitle translation is particularly challenging and understudied, as it involves handling colloquial language, preserving cultural nuances, and requires contextual information such as the movie’s theme and storyline to ensure accurate meaning. This study aims to fill this gap by focusing on the translation of movie subtitles through the use of prompting strategies that incorporate the movie’s meta-information, e.g., movie title, summary, and genre. We build a multilingual dataset which aligns the OpenSubtitles dataset with their corresponding Wikipedia articles and investigate different prompts and their effect on translation performance. Our experiments with GPT-3.5, GPT-4o, and LLaMA-3 models have shown that the presence of meta-information improves translation accuracy. These findings further emphasize the importance of designing appropriate prompts and highlight the potential of LLMs to enhance subtitle translation quality.

Subject: ACL.2025 - Student Research Workshop

#22 Pun2Pun: Benchmarking LLMs on Textual-Visual Chinese-English Pun Translation via Pragmatics Model and Linguistic Reasoning [PDF] [Copy] [Kimi] [REL]

Authors: Yiran Rex Ma, Shan Huang, Yuting Xu, Ziyu Zhou, Yuanxi Wei

Puns, as a unique form of linguistic creativity, present significant challenges in cross-lingual translation, particularly between linguistically distant languages like Chinese and English, where it’s often considered a “mission impossible”. We introduce Pun2Pun, a novel benchmark for quantitatively evaluating pun translation between Chinese and English while preserving both linguistic mechanisms and humorous effects. We propose the adaptation of Constant-Variable Optimization (CVO) Model for translation strategy and concomitant Overlap (Ovl) metric for translation quality assessment. Our approach provides a robust quantitative evaluation framework to assess models’ complex linguistic and cultural reasoning capabilities in pun translation. Through extensive experiments on both textual and visual puns, we demonstrate that our translation strategy model significantly improves performance, particularly for better-performing models. Our findings reveal exciting potentials and current limitations of LLMs in preserving sophisticated humor across linguistic and cultural boundaries.

Subject: ACL.2025 - Student Research Workshop

#23 Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Daniil Gurgurov, Ivan Vykopal, Josef Van Genabith, Simon Ostermann

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.The code for our experiments is available at https://github.com/d-gurgurov/Knowledge-Driven-Adaptation-LLMs.

Subject: ACL.2025 - Student Research Workshop

#24 Exploring the Effect of Nominal Compound Structure in Scientific Texts on Reading Times of Experts and Novices [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Isabell Landwehr, Marie-Pauline Krielke, Stefania Degaetano-Ortlieb

We explore how different types of nominal compound complexity in scientific writing, in particular different types of compound structure, affect the reading times of experts and novices. We consider both in-domain and out-of-domain reading and use PoTeC (Jakobi et al. 2024), a corpus containing eye-tracking data of German native speakers reading passages from scientific textbooks. Our results suggest that some compound types are associated with longer reading times and that experts may not only have an advantage while reading in-domain texts, but also while reading out-of-domain.

Subject: ACL.2025 - Student Research Workshop

#25 Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks [PDF] [Copy] [Kimi] [REL]

Authors: Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, Chitta Baral

This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine-Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction-tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks—covering dialogue, reasoning, mathematical problem-solving, question answering, truthfulness, MT-Bench, Big Bench, and the Open LLM Leaderboard. We find that: (1) alignment methods often achieve near-optimal performance even with smaller subsets of training data; (2) although they offer limited improvements on complex reasoning tasks, they enhance mathematical problem-solving; and (3) using an instruction-tuned model improves truthfulness. These insights highlight the conditions under which alignment methods excel, as well as their limitations.

Subject: ACL.2025 - Student Research Workshop