Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

#1 Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts [PDF¹] [Copy] [Kimi] [REL]

Authors: Peixuan Ge, Tongkun Su, Faqin Lv, Baoliang Zhao, Peng Zhang, Chi Hong Wong, Liang Yao, Yu Sun, Zenan Wang, Pak Kin Wong, Ying Hu

Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2\% in BLEU scores, approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.

Subjects: Image and Video Processing , Artificial Intelligence , Computer Vision and Pattern Recognition

Publish: 2025-05-13 08:27:01 UTC

2505.08838

#1 Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts [PDF1] [Copy] [Kimi] [REL]

#1 Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts [PDF¹] [Copy] [Kimi] [REL]