Total: 1
Past works have shown that lexical, syntactical, and semantical differences in heterogeneous data sources can cause challenges such as negative interference or the ``curse of multilinguality''. Because of this, training on such heterogeneous corpora requires extensive and costly efforts to balance data mixtures. We propose a novel pre-training framework to alleviate this curse. Our method, DEPT, decouples embeddings from the transformer body while simultaneously training the latter in multiple contexts without a shared global vocabulary. DEPT: (1) trains robustly and effectively under significant data heterogeneity, (2) reduces token embedding parameters by up to 80% and communication costs by 714x for billion-scale models, (3) enhances transformer body plasticity and generalization, improving average perplexity upward of 15.3-20% and improving performance for downstream fine-tuning in our experiments, and (4) permits training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated multilingual pre-training of a billion-scale model, reducing total parameters by 24% versus standard training.