Total: 1
Recent advances in Large Language Models (LLM) have brought a new architecture for Automatic Speech Recognition (ASR) tasks, where an audio encoder is followed by a powerful LLM. Refining audio embeddings from the audio encoder to better align textual embeddings can enhance performance of LLM-based ASR. However, current LLM-based ASR research mainly focuses on aligning textual and audio features via paired audio-text data. The use of unpaired audio-text data for such alignment remains under-explored. This paper proposes a cross-modality pre-training method using readily available unpaired audio-text data to better align the audio embeddings to text modality. Experimental results show that using this text-enhanced audio encoder in LLM-based ASR significantly outperforms using the audio encoder pre-trained only with audio data. This method has great potential for further improvement with plentiful easily accessible unpaired audio-text data.