UnitCoder: Scalable Code Synthesis from Pre-training Corpora

#1 UnitCoder: Scalable Code Synthesis from Pre-training Corpora [PDF] [Copy] [Kimi] [REL]

Authors: Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Despite the abundant sources of code data, constructing high-quality training datasets at scale poses a significant challenge. Pre-training code data typically suffers from inconsistent data quality issues. Conversely, instruction-based methods which use a high-quality subset as seed samples suffer from limited task diversity. In this paper, we introduce UnitCoder, which directly supervises pre-training data quality through automatically generated unit tests, while ensuring the correctness via an iterative fix and refine flow. Code synthesized by UnitCoder benefits from both the diversity of pre-training corpora and the high quality ensured by unit test supervision. Our experiments demonstrate that models fine-tuned on our synthetic dataset exhibit consistent performance improvements. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released.

Subject: EMNLP.2025 - Main

2025.emnlp-main.286@ACL

#1 UnitCoder: Scalable Code Synthesis from Pre-training Corpora [PDF] [Copy] [Kimi] [REL]