Total: 1
The rapid development of speech synthesis algorithms poses a challenge in constructing corresponding training datasets for speech anti-spoofing systems in real-world scenarios. The copy-synthesis method offers a simple yet effective solution to this problem. However, the limitation of this method is that it only utilizes the artifacts generated by vocoders, neglecting those from acoustic models. This paper aims to locate the artifacts introduced by the acoustic models of Text-to-Speech (TTS) and Voice Conversion (VC) algorithms, and optimize the copy-synthesis pipeline. The proposed rhythm and speaker perturbation modules successfully boost anti-spoofing models to leverage the artifacts introduced by acoustic models, thereby enhancing their generalization ability when facing various TTS and VC algorithms.