Total: 1
As generative models gain attention, it is crucial to adapt these models efficiently even with limited high-quality data and computational resources. In this work, we investigate a parameter-efficient fine-tuning (PEFT) for low-resource text-to-speech to transfer pre-trained knowledge to a new language leveraging only a single-speaker dataset and a single NVIDIA TITAN RTX GPU. We propose three types of adapters: Conditioning Adapter, Prompt Adapter, and DiT LoRA Adapter, where Conditioning Adapter enhances text embeddings, Prompt Adapter refines input representations, and DiT LoRA Adapter enables speech generation efficiency. We further explore the respective optimal configuration of adapters for single-speaker and multi-speaker scenarios. Consequently, under resource constraints, we successfully achieve effective adaptation to a new language using only 1.72% of the total parameters. Audio samples, source code and checkpoints will be available.