PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

#1 PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino [PDF] [Copy] [Kimi] [REL]

Authors: Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2026-06-13 06:12:56 UTC

2606.15144

#1 PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino [PDF] [Copy] [Kimi] [REL]