Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

#1 Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess [PDF] [Copy] [Kimi] [REL]

Authors: Yikuan Xia, Jiazun Chen, Sujian Li, Jun Gao

The wide use of abbreviated column names (derived from English words or Chinese Pinyin) in database tables poses significant challenges for table-centric tasks in natural language processing and database management. Such a column name expansion task, referred to as the NameGuess task, has previously been addressed by fine-tuning Large Language Models (LLMs) on synthetically generated rule-based data. However, the current approaches yield suboptimal performance due to two fundamental limitations: 1) the rule-generated abbreviation data fails to reflect real-world distribution, and 2) the failure of LLMs to follow the rule-sensitive patterns in NameGuess persistently. For the data realism issue, we propose a novel approach that integrates a subsequence abbreviation generator trained on human-annotated data and collects non-subsequence abbreviations to improve the training set. For the rule violation issue, we propose a decoding system constrained on an automaton that represents the rules of abbreviation expansion. We extended the original English NameGuess test set to include non-subsequence and PinYin scenarios. Experimental results show that properly tuned 7/8B moderate-size LLMs with a refined decoding system can surpass the few-shot performance of state-of-the-art LLMs, such as the GPT-4 series. The code and data are presented in the supplementary material.

Subject: EMNLP.2025 - Main

2025.emnlp-main.357@ACL

#1 Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess [PDF] [Copy] [Kimi] [REL]