Pretraining Language Models on Historical Text

#1 Pretraining Language Models on Historical Text [PDF⁶] [Copy] [Kimi²] [REL]

Authors: Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2026-06-02 00:59:06 UTC

2606.02991

#1 Pretraining Language Models on Historical Text [PDF6] [Copy] [Kimi2] [REL]

#1 Pretraining Language Models on Historical Text [PDF⁶] [Copy] [Kimi²] [REL]