Total: 1
Pitch estimation is of fundamental importance in audio processing and music information retrieval. YOLO is a well developed model designed for image target detection. Here we introduce YOLOv7 into pitch estimation task and improve by proposing time-frequency (TF) dual-branch into the model according to pitch perception of human auditory. An additional advantage of the model over the state-of-the-art (SOTA) models is that it only needs to add an unvoiced class without additional unvoiced/voiced detection to achieve joint pitch estimation and voiced determination. Experiments show for both music and speech, the proposed TF dual-branch can boost pitch estimation accuracy over the back-bone. Our model exhibits superior pitch estimation performance over the SOTA and shows minimal performance degradation in noisy condition. The overall accuracy on the MDB-stem-synth dataset peaks at 99.4%, and voicing determination F-score reaches 99.9%.