Total: 1
Reinforcement learning is widely adopted in post-training large language models, especially for reasoning-style tasks such as maths questions. However, as we show, most existing methods will provably fail to learn from questions that are too hard, where the model always fails, or too easy, where the model always succeeds. Much human effort is therefore spent continually producing datasets of questions of a suitable difficulty for state-of-the-art models. Given this, we consider how to algorithmically identify questions that allow for maximally efficient training. We introduce a method, LILO (Learnability Improves LLMs Optimally), that prioritises training on questions with high variance of success, known as learnability, and we provide theory proving LILO maximises the expected improvement of the model. We run a wide range of experiments over multiple base models, algorithms and reasoning datasets to demonstrate that LILO consistently improves final test accuracy and can yield a 3x reduction in the number of training steps required to reach it. We explore how questions with high learnability can be efficiently identified, and discuss how learnability can be scaled to produce LLM agents that autonomously and open-endedly expand the frontier of human knowledge.