Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

#1 Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training [PDF] [Copy] [Kimi] [REL]

Authors: Mozhi Zhang, Hao Sun, Lu Wang, Xipeng Qiu

We introduce *Domain2Vec*, a novel approach that decomposes any dataset into a linear combination of several *meta-domains*, a new concept designed to capture the key underlying features of datasets.*Domain2Vec* maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary.These domain vectors enable the identification of optimal data mixture for language model (LM) pretraining in a training-free manner under the ***D**istribution **A**lignment **A**ssumption* (DA$^{2}$), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved.Moreover, *Domain2Vec* can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods.Extensive experiments demonstrate that *Domain2Vec* helps find the data mixture that enhances downstream task performance with minimal computational overhead.Specifically, *Domain2Vec* achieves the same validation loss on Pile-CC using only $51.5$\% of the compute required when training on the original mixture of The Pile Dataset.Under equivalent compute budget, *Domain2Vec* improves downstream performance by an average of $2.83$\%.

Subject: ICML.2025 - Poster

kJ5i29FejW@OpenReview

#1 Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training [PDF] [Copy] [Kimi] [REL]