From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

#1 From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes [PDF] [Copy] [Kimi²] [REL]

Authors: Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-07-23 17:28:31 UTC

2507.17717

#1 From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes [PDF] [Copy] [Kimi2] [REL]

#1 From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes [PDF] [Copy] [Kimi²] [REL]