NAACL.2025 - Student Research Workshop | Cool Papers

#1 Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation [PDF] [Copy] [Kimi] [REL]

Authors: Yirong Sun, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, Xiaoyu Shen

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation.

Subject: NAACL.2025 - Student Research Workshop

#2 INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Pre-Trained Language Models and Ensemble Learning [PDF] [Copy] [Kimi] [REL]

Authors: Pablo Romero, Lifeng Han, Goran Nenadic

This paper presents our system, InsightBuddy-AI, designed for extracting medication mentions and their associated attributes, and for linking these entities to established clinical terminology resources, including SNOMED-CT, the British National Formulary (BNF), ICD, and the Dictionary of Medicines and Devices (dm+d).To perform medication extraction, we investigated various ensemble learning approaches, including stacked and voting ensembles (using first, average, and max voting methods) built upon eight pre-trained language models (PLMs). These models include general-domain PLMs—BERT, RoBERTa, and RoBERTa-Large—as well as domain-specific models such as BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and PubMedBERT.The system targets the extraction of drug-related attributes such as adverse drug effects (ADEs), dosage, duration, form, frequency, reason, route, and strength.Experiments conducted on the n2c2-2018 shared task dataset demonstrate that ensemble learning methods outperformed individually fine-tuned models, with notable improvements of 2.43% in Precision and 1.35% in F1-score.We have also developed cross-platform desktop applications for both entity recognition and entity linking, available for Windows and macOS.The InsightBuddy-AI application is freely accessible for research use at https://github.com/HECTA-UoM/InsightBuddy-AI.

Subject: NAACL.2025 - Student Research Workshop

#3 Linguistic Features in German BERT: The Role of Morphology, Syntax, and Semantics in Multi-Class Text Classification [PDF] [Copy] [Kimi] [REL]

Authors: Henrike Beyer, Diego Frassinelli

Most studies on the linguistic information encoded by BERT primarily focus on English. Our study examines a monolingual German BERT model using a semantic classification task on newspaper articles, analysing the linguistic features influencing classification decisions through SHAP values. We use the TüBa-D/Z corpus, a resource with gold-standard annotations for a set of linguistic features, including POS, inflectional morphology, phrasal, clausal, and dependency structures. Semantic features of nouns are evaluated via the GermaNet ontology using shared hypernyms. Our results indicate that the features identified in English also affect classification in German but suggests important language- and task-specific features as well.

Subject: NAACL.2025 - Student Research Workshop

#4 Thesis Proposal: Uncertainty in Knowledge Graph Embeddings [PDF] [Copy] [Kimi] [REL]

Author: Yuqicheng Zhu

Knowledge Graph Embedding (KGE) methods are widely used to map entities and relations from knowledge graphs (KGs) into continuous vector spaces, enabling non-classical reasoning over knowledge structures. Despite their effectiveness, the uncertainty of KGE methods has not been extensively studied in the literature. This gap poses significant challenges, particularly when deploying KGE models in high-stakes domains like medicine, where reliability and risk assessment are critical. This dissertation seeks to investigate various types of uncertainty in KGE methods and explore strategies to quantify, mitigate, and reason under uncertainty effectively. The outcomes of this research will contribute to enhancing the reliability of KGE methods, providing greater confidence in their use beyond benchmark datasets, and supporting their application in real-world, high-stakes domains.

Subject: NAACL.2025 - Student Research Workshop

#5 Detecting Sexism in Tweets: A Sentiment Analysis and Graph Neural Network Approach [PDF] [Copy] [Kimi] [REL]

Authors: Diana P. Madera-Espíndola, Zoe Caballero-Domínguez, Valeria J. Ramírez-Macías, Sabur Butt, Hector Ceballos

In the digital age, social media platforms like Twitter serve as an extensive repository of public discourse, including instances of sexism. It is important to identify such behavior since radicalized ideologies can lead to real-world violent acts. This project aims to develop a deep learning-based tool that leverages a combination of BERT (both English and multilingual versions) and GraphSAGE, a Graph Neural Network (GNN) model, alongside sentiment analysis and natural language processing (NLP) techniques. The tool is designed to analyze tweets for sexism detection and classify them into five categories.

Subject: NAACL.2025 - Student Research Workshop

#6 Towards Codec-LM Co-design for Neural Codec Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Shih-Lun Wu, Aakash Lahoti, Arjun D Desai, Karan Goel, Chris Donahue, Albert Gu

Neural codec language models (or codec LMs) are emerging as a powerful framework for audio generation tasks like text-to-speech (TTS). These models leverage advancements in language modeling and residual vector quantization (RVQ)-based audio codecs, which compress audios into discrete codes for LMs to process. Despite the close interdependence of codecs and LMs in these systems, research on codecs and LMs has largely remained siloed. In this work, we propose three techniques for better codec-LM co-design: (i) a frame-wise codec encoder that improves both LM log-likelihood and end-to-end TTS metrics, (ii) LM codebook level dropout, a method to efficiently navigate a portion of the codec-LM design space by training a single LM, and (iii) increased codec frame duration, which we show can accelerate inference while maintaining end-to-end performance. Our experiments demonstrate that combining all three co-design techniques results in doubled inference speed, and improvements in intelligibility, audio quality, and speaker control in TTS relative to a siloed baseline.

Subject: NAACL.2025 - Student Research Workshop

#7 Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair [PDF] [Copy] [Kimi] [REL]

Authors: Maksim Borisov, Zhanibek Kozhirbayev, Valentin Malykh

Machine translation for low-resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We present the first code-switching Kazakh-Russian parallel corpus.Additionally, we propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. This method results in a model beating an existing commercial system by human evaluation.

Subject: NAACL.2025 - Student Research Workshop

#8 Generative Product Recommendations for Implicit Superlative Queries [PDF] [Copy] [Kimi] [REL]

Authors: Kaustubh Dhole, Nikhita Vedula, Saar Kuzi, Giuseppe Castellucci, Eugene Agichtein, Shervin Malmasi

In recommender systems, users often seek the best products through indirect, vague, or under-specified queries such as “best shoes for trail running.” These queries, referred to as implicit superlative queries, pose a challenge for standard retrieval and ranking systems due to their lack of explicit attribute mentions and the need for identifying and reasoning over complex attributes. We investigate how Large Language Models (LLMs) can generate implicit attributes for ranking and reason over them to improve product recommendations for such queries. As a first step, we propose a novel four-point schema, called SUPERB, for annotating the best product candidates for superlative queries, paired with LLM-based product annotations. We then empirically evaluate several existing retrieval and ranking approaches on our newly created dataset, providing insights and discussing how to integrate these findings into real-world e-commerce production systems.

Subject: NAACL.2025 - Student Research Workshop

#9 ConQuer: A Framework for Concept-Based Quiz Generation [PDF] [Copy] [Kimi] [REL]

Authors: Yicheng Fu, Zikui Wang, Liuxin Yang, Meiqing Huo, Zhongdongming Dai

Quizzes play a crucial role in education by reinforcing students’ understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework.

Subject: NAACL.2025 - Student Research Workshop

#10 What is it? Towards a Generalizable Native American Language Identification System [PDF] [Copy] [Kimi] [REL]

Authors: Ivory Yang, Weicheng Ma, Carlos Guerrero Alvarez, William Dinauer, Soroush Vosoughi

This paper presents a research thesis proposal to develop a generalizable Native American language identification system. Despite their cultural and historical significance, Native American languages remain entirely unsupported by major commercial language identification systems. This omission not only underscores the systemic neglect of endangered languages in technological development, but also highlights the urgent need for dedicated, community-driven solutions. We propose a two-pronged approach: (1) systematically curating linguistic resources across all Native American languages for robust training, and (2) tailored data augmentation to generate synthetic yet linguistically coherent training samples. As proof of concept, we extend an existing rudimentary Athabaskan language classifier by integrating Plains Apache, an extinct Southern Athabaskan language, as an additional language class. We also adapt a data generation framework for low-resource languages to create synthetic Plains Apache data, highlighting the potential of data augmentation. This proposal advocates for a community-driven, technological approach to supporting Native American languages.

Subject: NAACL.2025 - Student Research Workshop

#11 Med-CoDE: Medical Critique based Disagreement Evaluation Framework [PDF] [Copy] [Kimi] [REL]

Authors: Mohit Gupta, Akiko Aizawa, Rajiv Ratn Shah

The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.

Subject: NAACL.2025 - Student Research Workshop

#12 Sentimatic: Sentiment-guided Automatic Generation of Preference Datasets for Customer Support Dialogue System [PDF] [Copy] [Kimi] [REL]

Authors: Suhyun Lee, ChangHeon Han

Supervised Fine-tuning (SFT) and preference optimization (PO) are key methods for enhancing language models and aligning them with human preferences. However, scaling preference datasets for PO training is challenging, leading AI customer support systems to rely on SFT. To address this, we propose the Sentiment-guided Automatic Generation of Preference Datasets (Sentimatic) methodology to automatically generate customer preference datasets without human intervention using a publicly available dataset constructed for SFT. Our approach classifies responses by sentiment, fine-tunes models on them, and applies advanced sampling and evaluation techniques to ensure diversity and quality. Ultimately, we generated 1,174 customer preference datasets based on 357 test datasets, and through experiments, we confirmed that the AI customer support system trained on these datasets is capable of carefully considering customer emotions and generating professional and appropriate responses.

Subject: NAACL.2025 - Student Research Workshop

#13 Privacy-Preserving Federated Learning for Hate Speech Detection [PDF¹] [Copy] [Kimi] [REL]

Authors: Ivo de Souza Bueno Júnior, Haotian Ye, Axel Wisiorek, Hinrich Schütze

This paper presents a federated learning system with differential privacy for hate speech detection, tailored to low-resource languages. By fine-tuning pre-trained language models, ALBERT emerged as the most effective option for balancing performance and privacy. Experiments demonstrated that federated learning with differential privacy performs adequately in low-resource settings, though datasets with fewer than 20 sentences per client struggled due to excessive noise. Balanced datasets and augmenting hateful data with non-hateful examples proved critical for improving model utility. These findings offer a scalable and privacy-conscious framework for integrating hate speech detection into social media platforms and browsers, safeguarding user privacy while addressing online harm.

Subject: NAACL.2025 - Student Research Workshop

#14 From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Nikita Neveditsin, Pawan Lingras, Vijay Kumar Mago

This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs’ ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.

Subject: NAACL.2025 - Student Research Workshop

#15 Developing Japanese CLIP Models Leveraging an Open-weight LLM for Large-scale Dataset Translation [PDF] [Copy] [Kimi] [REL]

Authors: Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki

CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available.

Subject: NAACL.2025 - Student Research Workshop

#16 Self-Vocabularizing Training for Neural Machine Translation [PDF] [Copy] [Kimi] [REL]

Authors: Pin-Jie Lin, Ernie Chang, Yangyang Shi, Vikas Chandra

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training.Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary.In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training—where each iteration generates a labeled dataset by pairing source sentences with the model’s predictions to define a new vocabulary.Building on these insights, we propose *self-vocabularizing training*, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6–8% reduction in vocabulary size.

Subject: NAACL.2025 - Student Research Workshop

#17 CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search [PDF] [Copy] [Kimi] [REL]

Authors: Nikita Sorokin, Tikhonov Anton, Dmitry Abulkhanov, Ivan Sedykh, Irina Piontkovskaya, Valentin Malykh

We consider the well-known and important tasks of clone detection and information retrieval for source code. The most standard setup is to search clones inside the same language code snippets. But it is also useful to find code snippets with identical behaviour in different programming languages. Nevertheless multi- and cross-lingual clone detection has been little studied in literature. We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity, that we apply to train language models on source code in various programming languages. We show that this training is effective both for encoder- and decoder-based models.The trained encoder-based CCT-LM model%and fine-tuned with CCT,achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73% MAP and AdvTest (monolingual Python code search benchmark) with 47.18% MRR. The decoder-based CCT-LM model shows comparable performance in these tasks. In addition, we formulate the multi- and cross-lingual clone detection problem and present XCD, a new benchmark dataset produced from CodeForces submissions.

Subject: NAACL.2025 - Student Research Workshop

#18 Text Compression for Efficient Language Generation [PDF] [Copy] [Kimi] [REL]

Authors: David Gu, Peter Belcak, Roger Wattenhofer

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the “Generative Pretrained Thoughtformer” (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT’s architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

Subject: NAACL.2025 - Student Research Workshop

#19 Multilingual Native Language Identification with Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Dhiman Goswami, Marcos Zampieri, Kai North, Shervin Malmasi, Antonios Anastasopoulos

Native Language Identification (NLI) is the task of automatically identifying the native language (L1) of individuals based on their second language (L2) production. The introduction of Large Language Models (LLMs) with billions of parameters has renewed interest in text-based NLI, with new studies exploring LLM-based approaches to NLI on English L2. The capabilities of state-of-the-art LLMs on non-English NLI corpora, however, have not yet been fully evaluated. To fill this important gap, we present the first evaluation of LLMs for multilingual NLI. We evaluated the performance of several LLMs compared to traditional statistical machine learning models and language-specific BERT-based models on NLI corpora in English, Italian, Norwegian, and Portuguese. Our results show that fine-tuned GPT-4 models achieve state-of-the-art NLI performance.

Subject: NAACL.2025 - Student Research Workshop

#20 Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling [PDF] [Copy] [Kimi] [REL]

Authors: Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic

The abundance of medical records holds great promise for enhancing healthcare and advancing biomedical research. However, due to privacy constraints, access to such data is typically limited to internal use.Recent studies have attempted to overcome this challenge by generating synthetic data through Causal Language Modelling. Yet, this approach often fails to ensure patient anonymity and offers limited control over output diversity—unless additional computational cost is introduced.In response, we propose a method for generating synthetic free-text medical records based on Masked Language Modelling. Our approach retains key medical details while introducing variability in the generated texts and reducing the risk of patient re-identification. With a relatively lightweight architecture of approximately 120 million parameters, the system ensures low inference costs.Experimental results show that our method produces high-quality synthetic data, achieving a HIPAA-compliant PHI recall of 96% and a re-identification risk of only 3.5%. Furthermore, downstream evaluations reveal that models trained on the synthetic data perform comparably to those trained on real-world data. Our trained models are publicly available on Github as SynDeidMLM (at https://github.com/SamySam0/SynDeidMLM) (meaning synthetic and de-identified data generation using MLM).

Subject: NAACL.2025 - Student Research Workshop

#21 How many words does it take to understand a low-resource language? [PDF] [Copy] [Kimi] [REL]

Authors: Emily Chang, Nada Basit

When developing language technology, researchers have routinely turned to transfer learning to resolve the data scarcity conundrum presented in low-resource languages. As far as we know, this study is the first to evaluate the amount of documentation needed for transfer learning, specifically the smallest vocabulary size needed to create a sentence embedding space. In adopting widely spoken languages as a proxy for low-resource languages, our experiments show that the relationship between a sentence embedding’s vocabulary size and performance is logarithmic with performance leveling at a vocabulary size of 25,000. It should be noted that this relationship cannot be replicated across all languages and this level of documentation does not exist for many low-resource languages. We do observe, however, that performance accelerates at a vocabulary size of ≤ 1000, a quantity that is present in most low-resource language documentation. These results can aid researchers in understanding whether a low-resource language has enough documentation necessary to support the creation of a sentence embedding and language model.

Subject: NAACL.2025 - Student Research Workshop

#22 Linear Relational Decoding of Morphology in Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Eric Xia, Jugal Kalita

A two-part affine approximation has been found to be a good approximation for transformer computations over certain subject-object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation W s , where s is a middle-layer representation of a subject token and W is derived from model derivatives, can accurately reproduce final object states for many relations. This linear technique achieves 90% faithfulness on morphological relations, with similar findings across languages and models. Our results suggest that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space and are sparsely encoded by cross-layer linear transformations.

Subject: NAACL.2025 - Student Research Workshop

#23 SPY: Enhancing Privacy with Synthetic PII Detection Dataset [PDF] [Copy] [Kimi] [REL]

Authors: Maksim Savkin, Timur Ionov, Vasily Konovalov

We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.

Subject: NAACL.2025 - Student Research Workshop

#24 Tighter Clusters, Safer Code? Improving Vulnerability Detection with Enhanced Contrastive Loss [PDF] [Copy] [Kimi] [REL]

Authors: Pranav Kapparad, Biju R Mohan

Distinguishing vulnerable code from non-vulnerable code is challenging due to high inter-class similarity. Supervised contrastive learning (SCL) improves embedding separation but struggles with intra-class clustering, especially when variations within the same class are subtle. We propose Cluster-Enhanced Supervised Contrastive Loss (CESCL), an extension of SCL with a distance-based regularization term that tightens intra-class clustering while maintaining inter-class separation. Evaluating on CodeBERT and GraphCodeBERT with Binary Cross Entropy (BCE), BCE + SCL, and BCE + CESCL, our method improves F1 score by 1.76% on CodeBERT and 4.1% on GraphCodeBERT, demonstrating its effectiveness in code vulnerability detection and broader applicability to high-similarity classification tasks.

Subject: NAACL.2025 - Student Research Workshop

#25 Text Extraction and Script Completion in Images of Arabic Script-Based Calligraphy: A Thesis Proposal [PDF] [Copy] [Kimi] [REL]

Authors: Dilara Zeynep Gürer, Ümit Atlamaz, Şaziye Betül Özateş

Arabic calligraphy carries rich historical information and meaning. However, the complexity of its artistic elements and the absence of a consistent baseline make text extraction from such works highly challenging. In this paper, we provide an in-depth analysis of the unique obstacles in processing and interpreting these images, including the variability in calligraphic styles, the influence of artistic distortions, and the challenges posed by missing or damaged text elements. We explore potential solutions by leveraging state-of-the-art architectures and deep learning models, including visual language models, to improve text extraction and script completion.

Subject: NAACL.2025 - Student Research Workshop