2410.01487

Total: 1

#1 Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas [PDF] [Copy] [Kimi2] [REL]

Authors: Bastian Bunzeck, Daniel Duran, Leonie Schade, Sina Zarrieß

Current language models use subword-based tokenization algorithms like Byte Pair Encoding, which put their validity as models of linguistic representations into question. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models without any graphemic biases almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.

Subject: Computation and Language

Publish: 2024-10-02 12:36:08 UTC