xVal: A Continuous Numerical Tokenization for Scientific Language Models

#1 xVal: A Continuous Numerical Tokenization for Scientific Language Models [PDF¹] [Copy] [Kimi⁵] [REL]

Authors: Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language , Machine Learning

Publish: 2023-10-04 17:26:16 UTC

2310.02989

#1 xVal: A Continuous Numerical Tokenization for Scientific Language Models [PDF1] [Copy] [Kimi5] [REL]

#1 xVal: A Continuous Numerical Tokenization for Scientific Language Models [PDF¹] [Copy] [Kimi⁵] [REL]