2025.naacl-long.398@ACL

Total: 1

#1 Functional Lexicon in Subword Tokenization [PDF] [Copy] [Kimi1] [REL]

Authors: Zachary William Hopton, Yves Scherrer, Tanja Samardzic

The distinction between function and content units of the lexicon has been somewhat neglected in recent NLP work, but it could still be useful when working with low-resource languages, and, in particular, to improve cross-lingual transfer. In this paper, we investigate to what extent BPE subword tokenization can be used to identify units of the functional lexicon in a language without any annotated data. We analyze subword tokens in terms of their productivity and attempt to find thresholds that best distinguish function from content tokens. On a sample of seven diverse languages, we find that the best results are obtained with 50 BPE merges. We also show that this subword tokenization setting can be beneficial for the interlinear glossing task.

Subject: NAACL.2025 - Long Papers