Total: 1
Advancements in diarization have prompted the development of supervised learning models. These models extract fixed-length embeddings from audio files of varying lengths. Despite challenges, commercial API models like Speechbrain, Resemblyzer, Whisper AI, and Pyannote have addressed this issue. However, these models typically utilize Mel-Frequency Cepstral Coefficients (MFCC) features, convolution layers, and dimension reduction techniques to create embeddings. Our proposal method introduces a Wavelet Scattering Transform (WST) that prioritizes information content, allowing users to customize the shape of embeddings according to their model requirements. Coupling WST with AutoEncoders (WST-AE) in a residual manner enhances semantic latent space representations, which can be clustered segment-wise in an unsupervised manner. Testing on AMI and VoxConverse datasets has shown a reduction in Diarization Error Rate (DER) with fewer training parameters and without the need for separate embedding models.