Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts

2025.findings-emnlp.837@ACL

Total: 1

#1 Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts [PDF] [Copy] [Kimi] [REL]

The frequency distribution of words in human-written texts roughly follows a simple mathematical form known as Zipf’s law. Somewhat less well known is the related Heaps’ law, which describes a sublinear power-law growth of vocabulary size with document size. We study the applicability of Zipf’s and Heaps’ laws to texts generated by Large Language Models (LLMs). We empirically show that Heaps’ and Zipf’s laws only hold for LLM-generated texts in a narrow model-dependent temperature range. These temperatures have an optimal value close to t=1 for all the base models except the large Llama models, are higher for instruction-finetuned models and do not depend on the model size or prompting. This independently confirms the recent discovery of sampling temperature dependent phase transitions in LLM-generated texts.

Subject: EMNLP.2025 - Findings