EMNLP.2025 - System Demonstrations

| Total: 77

#1 Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist [PDF2] [Copy] [Kimi1] [REL]

Authors: Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M. Daly, Qian Pan, Michael Desmond

We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.

Subject: EMNLP.2025 - System Demonstrations


#2 ROBOTO2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment [PDF] [Copy] [Kimi] [REL]

Authors: Anthony Hevia, Sanjana Chintalapati, Veronica Ka Wai Lai, Nguyen Thanh Tam, Wai-Tat Wong, Terry P Klassen, Lucy Lu Wang

We present ROBoto2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBoto2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBoto2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.

Subject: EMNLP.2025 - System Demonstrations


#3 SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive [PDF] [Copy] [Kimi] [REL]

Authors: Yingqiang Gao, Fabian Winiger, Patrick Montjourides, Anastassia Shaitarova, Nianlong Gu, Simon Peng-Keller, Gerold Schneider

Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.

Subject: EMNLP.2025 - System Demonstrations


#4 LingConv: An Interactive Toolkit for Controlled Paraphrase Generation with Linguistic Attribute Control [PDF] [Copy] [Kimi] [REL]

Authors: Mohamed Elgaar, Hadi Amiri

We introduce LINGCONV, an interactive toolkit for paraphrase generation enabling finegrained control over 40 specific lexical, syntactic, and discourse linguistic attributes. Users can directly manipulate target attributes using sliders, and with automatic imputation for unspecified attributes, simplifying the control process. Our adaptive Quality Control mechanism employs iterative refinement guided by line search to precisely steer the generation towards target attributes while preserving semantic meaning, overcoming limitations associated with fixed control strengths. Applications of LINGCONV include enhancing text accessibility by adjusting complexity for different literacy levels, enabling personalized communication through style adaptation, providing a valuable tool for linguistics and NLP research, and facilitating second language learning by tailoring text complexity. The system is available at https://mohdelgaar-lingconv.hf.space, with a demo video at https://youtu.be/wRBJEJ6EALQ.

Subject: EMNLP.2025 - System Demonstrations


#5 AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis [PDF] [Copy] [Kimi1] [REL]

Authors: Callie C. Liao, Duoduo Liao, Sai Surya Gadiraju

The rise of Multi-Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter-agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent-to-Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP, enabling dynamic coordination, flexible communication, and rapid development with faster iteration. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. The experiments are validated through both human evaluation and quantitative metrics, including BERTScore F1 (96.3%) and LLM-as-a-Judge G-Eval (87.1%). These results demonstrate robust automated inter-agent coordination, query decomposition, task allocation, dynamic routing, and domain-specific relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS.

Subject: EMNLP.2025 - System Demonstrations


#6 The iRead4Skills Intelligent Complexity Analyzer [PDF] [Copy] [Kimi] [REL]

Authors: Wafa Aissa, Raquel Amaro, David Antunes, Thibault Bañeras-Roux, Jorge Baptista, Alejandro Catala, Luís Correia, Thomas François, Marcos Garcia, Mario Izquierdo-Álvarez, Nuno Mamede, Vasco Martins, Miguel Neves, Eugénio Ribeiro, Sandra Rodriguez Rey, Elodie Vanzeveren

We present the iRead4Skills Intelligent Complexity Analyzer, an open-access platform specifically designed to assist educators and content developers in addressing the needs of low-literacy adults by analyzing and diagnosing text complexity. This multilingual system integrates a range of Natural Language Processing (NLP) components to assess input texts along multiple levels of granularity and linguistic dimensions in Portuguese, Spanish, and French. It assigns four tailored difficulty levels using state-of-the-art models, and introduces four diagnostic yardsticks—textual structure, lexicon, syntax, and semantics—offering users actionable feedback on specific dimensions of textual complexity. Each component of the system is supported by experiments comparing alternative models on manually annotated data.

Subject: EMNLP.2025 - System Demonstrations


#7 AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems [PDF] [Copy] [Kimi] [REL]

Authors: Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, Estevam Hruschka

Large language models (LLMs) are being increasingly used for planning in orchestrated multi-agent systems. However, existing LLM-based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human-in-the-loop planning through conversational and graph-based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM-generated plans, significantly enhancing user control and trust in multi-agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom.

Subject: EMNLP.2025 - System Demonstrations


#8 LAD: LoRA-Adapted Diffusion [PDF] [Copy] [Kimi] [REL]

Authors: Ruurd Jan Anthonius Kuiper, Lars de Groot, Bram van Es, Maarten van Smeden, Ayoub Bagheri

Autoregressive models dominate text generation but suffer from left-to-right decoding constraints that limit efficiency and bidirectional reasoning. Diffusion-based models offer a flexible alternative but face challenges in adapting to discrete text efficiently. We propose LAD (LoRA-Adapted Diffusion), a framework for non-autoregressive generation that adapts LLaMA models for iterative, bidirectional sequence refinement using LoRA adapters. LAD employs a structural denoising objective combining masking with text perturbations (swaps, duplications and span shifts), enabling full sequence editing during generation. We aim to demonstrate that LAD could be a viable and efficient alternative to training diffusion models from scratch, by providing both validation results as well as two interactive demos directly available online:https://ruurdkuiper.github.io/tini-lad/https://huggingface.co/spaces/Ruurd/tini-ladInference and training code:https://github.com/RuurdKuiper/lad-code

Subject: EMNLP.2025 - System Demonstrations


#9 Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach [PDF] [Copy] [Kimi] [REL]

Authors: Imene Kolli, Saeid Vaghefi, Chiara Colesanti Senni, Shantam Raj, Markus Leippold

InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5°C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.

Subject: EMNLP.2025 - System Demonstrations


#10 GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction [PDF] [Copy] [Kimi] [REL]

Authors: Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis

Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built on a fine-tuned encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across diverse IE tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source library available through pip, complete with pre-trained models and comprehensive documentation.

Subject: EMNLP.2025 - System Demonstrations


#11 SciClaims: An End-to-End Generative System for Biomedical Claim Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Raúl Ortega, Jose Manuel Gomez-Perez

We present SciClaims, an interactive web-based system for end-to-end scientific claim analysis in the biomedical domain. Designed for high-stakes use cases such as systematic literature reviews and patent validation, SciClaims extracts claims from text, retrieves relevant evidence from PubMed, and verifies their veracity. The system features a user-friendly interface where users can input scientific text and view extracted claims, predictions, supporting or refuting evidence, and justifications in natural language. Unlike prior approaches, SciClaims seamlessly integrates the entire scientific claim analysis process using a single large language model, without requiring additional fine-tuning. SciClaims is optimized to run efficiently on a single GPU and is publicly available for live interaction.

Subject: EMNLP.2025 - System Demonstrations


#12 AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [PDF] [Copy] [Kimi] [REL]

Authors: Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Xie Xie, Wei Zhou, Wang Xu, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Meiyudong, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun

Large language model agents have enabled GUI-based automation, particularly for mobile devices. However, deployment remains limited by noisy data, poor generalization, and lack of support for non-English GUIs. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. AgentCPM-GUI achieves promising performance on five public benchmarks and our proposed Chinese benchmark CAGUI. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data at: https://github.com/OpenBMB/AgentCPM-GUI

Subject: EMNLP.2025 - System Demonstrations


#13 Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support [PDF] [Copy] [Kimi] [REL]

Authors: Jan Trienes, Anastasiia Derzhanskaia, Roland Schwarzkopf, Markus Mühling, Jörg Schlötterer, Christin Seifert

We present Marcel, a lightweight and open-source conversational agent designed to support prospective students with admission-related inquiries. The system aims to provide fast and personalized responses, while reducing workload of university staff. We employ retrieval-augmented generation to ground answers in university resources and to provide users with verifiable, contextually relevant information. We introduce a Frequently Asked Question (FAQ) retriever that maps user questions to knowledge-base entries, which allows administrators to steer retrieval, and improves over standard dense/hybrid retrieval strategies. The system is engineered for easy deployment in resource-constrained academic settings. We detail the system architecture, provide a technical evaluation of its components, and report insights from a real-world deployment.

Subject: EMNLP.2025 - System Demonstrations


#14 Alpha-GPT: Human-AI Interactive Alpha Mining for Quantitative Investment [PDF1] [Copy] [Kimi] [REL]

Authors: Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel Ni, Heung-Yeung Shum, Jian Guo

One of the most important tasks in quantitative investment research is mining new alphas (effective trading signals or factors). Traditional alpha mining methods, either hand-crafted factor synthesis or algorithmic factor mining (e.g., search with genetic programming), have inherent limitations, especially in implementing the ideas of quant researchers. In this work, we propose a new alpha mining paradigm by introducing human-AI interaction, and a novel prompt engineering algorithmic framework to implement this paradigm by leveraging the power of large language models. Moreover, we develop Alpha-GPT, a new interactive alpha mining system framework that provides a heuristic way to “understand” the ideas of quant researchers and outputs creative, insightful, and effective alphas. We demonstrate the effectiveness and advantage of Alpha-GPT via a number of alpha mining experiments. In particular, we evaluated Alpha-GPT’s performance in the WorldQuant International Quant Championship, where it demonstrated results comparable to those of top-performing human participants, ranking among top-10 over 41000 teams worldwide. These findings suggest Alpha-GPT’s significant potential in generating highly effective alphas that may surpass human capabilities in quantitative investment strategies.

Subject: EMNLP.2025 - System Demonstrations


#15 AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories [PDF] [Copy] [Kimi] [REL]

Authors: Tianyue Ou, Wanyao Guo, Apurva Gandhi, Graham Neubig, Xiang Yue

Large Language Model (LLM) agents produce rich, multi-step trajectories that interleave observations, internal reasoning, and tool actions. However, most evaluation pipelines focus solely on end-task success, leaving the agent’s decision-making process opaque and poorly understood. We introduce AgentDiagnose, an open-source, modular framework for diagnosing agent trajectories. The present release fully supports the web domain, and AgentDiagnose is architect as an extensible, open platform with compatibility for most agent trajectories. AgentDiagnose consists of (i) an evaluation module that quantifies five core agentic competencies—backtracking & exploration, task decomposition, observation reading, self-verification, and objective quality—and (ii) a visualization module that highlights trajectory semantics through t-SNE action embeddings, interactive word clouds, and state-transition timelines. On a set of 30 manually annotated trajectories, our automatic metrics achieve a mean Pearson correlation of 0.57 with human judgments, rising to 0.78 for task decomposition. Furthermore, filtering the 46k-example NNetNav-Live dataset with AgentDiagnose and fine-tuning a Llama-3.1-8B model on the top 6k trajectories improves WebArena success rates by 0.98, despite using only 13% of the original data. AgentDiagnose thus serves as both a diagnostic lens for agent analysis and a practical tool for curating higher-quality training data. The toolkit and demo are publicly available.

Subject: EMNLP.2025 - System Demonstrations


#16 Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization [PDF] [Copy] [Kimi] [REL]

Authors: Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi

Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation, where stringent anonymization methods can significantly impact the text’s utility for downstream applications. Evaluating the effectiveness of text anonymization proves challenging from both privacy and utility perspectives, as there is no universal benchmark that can comprehensively assess anonymization techniques across diverse, and sometimes contradictory contexts. We present Tau-Eval, an open-source framework for benchmarking text anonymization methods through the lens of privacy and utility task sensitivity. A Python library, code, documentation and tutorials are publicly available.

Subject: EMNLP.2025 - System Demonstrations


#17 ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning [PDF] [Copy] [Kimi] [REL]

Authors: Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai

LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/

Subject: EMNLP.2025 - System Demonstrations


#18 Sanskrit Voyager: Unified Web Platform for Interactive Reading and Linguistic Analysis of Sanskrit Texts [PDF] [Copy] [Kimi1] [REL]

Authors: Giacomo De Luca, Danilo Croce, Roberto Basili

Sanskrit Voyager is a web application for searching, reading, and analyzing the texts in the Sanskrit literary corpus. Unlike previous tools that require expert linguistic knowledge or manual normalization, Sanskrit Voyager enables users to search for words and phrases as they actually appear in texts, handling inflection, sandhi, and compound forms automatically while supporting any transliteration. The system integrates four core functionalities: (1) multi-dictionary lookup with morphological analysis and inflection tables; (2) real-time text parsing and annotation; (3) an interactive reader for over 900 digitalized texts; and (4) advanced corpus search with fuzzy matching and filtering. Evaluation shows over 92% parsing accuracy on complex compounds and substantially higher search recall than BuddhaNexus on challenging queries. Source code is publicly available under CC-BY-NC license, resource-efficient, and designed for both learners and researchers, offering the first fully integrated, user-friendly platform for computational Sanskrit studies.

Subject: EMNLP.2025 - System Demonstrations


#19 PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation [PDF] [Copy] [Kimi] [REL]

Authors: Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible – working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.

Subject: EMNLP.2025 - System Demonstrations


#20 LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators [PDF] [Copy] [Kimi] [REL]

Authors: Leanne Tan, Gabriel Chua, Ziyu Ge, Roy Ka-Wei Lee

Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants—creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

Subject: EMNLP.2025 - System Demonstrations


#21 GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery [PDF] [Copy] [Kimi] [REL]

Authors: Italo Luis da Silva, Hanqi Yan, Lin Gui, Yulan He

Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have.While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce GraphMind, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, GraphMind enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. GraphMind enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea’s core contributions and its connections to existing work. GraphMind is available at https://oyarsa.github.io/graphmind and a demonstration video at https://youtu.be/wKbjQpSvwJg.

Subject: EMNLP.2025 - System Demonstrations


#22 Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research [PDF] [Copy] [Kimi] [REL]

Authors: Richard Diehl Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery

Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model’s architecture or training procedures and directly observe their effects on the model’s behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis.

Subject: EMNLP.2025 - System Demonstrations


#23 DistaLs: a Comprehensive Collection of Language Distance Measures [PDF] [Copy] [Kimi] [REL]

Authors: Rob Van Der Goot, Esther Ploeger, Verena Blaschke, Tanja Samardzic

Languages vary along a wide variety of dimensions. In Natural Language Processing (NLP), it is useful to know how “distant” languages are from each other, so that we can inform NLP models about these differences or predict good transfer languages. Furthermore, it can inform us about how diverse language samples are. However, there are many different perspectives on how distances across languages could be measured, and previous work has predominantly focused on either intuition or a single type of distance, like genealogical or typological distance. Therefore, we propose DistaLs, a toolkit that is designed to provide users with easy access to a wide variety of language distance measures. We also propose a filtered subset, which contains less redundant and more reliable features. DistaLs is designed to be accessible for a variety of use cases, and offers a Python, CLI, and web interface. It is easily updateable, and available as a pip package. Finally, we provide a case-study in which we use DistaLs to measure correlations of distance measures with performance on four different morphosyntactic tasks.

Subject: EMNLP.2025 - System Demonstrations


#24 MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education [PDF1] [Copy] [Kimi1] [REL]

Authors: Dongsuk Jang, Ziyao Shangguan, Kyle Tegtmeyer, Anurag Gupta, Jan T Czerminski, Sophie Chheang, Arman Cohan

The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system’s architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report.We conduct a rigorous evaluation of the system. First, two radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large-scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation of LLMs with human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.

Subject: EMNLP.2025 - System Demonstrations


#25 Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification [PDF] [Copy] [Kimi] [REL]

Authors: Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Miserlis Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash

We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

Subject: EMNLP.2025 - System Demonstrations