Total: 1
Large language models (LLMs) are being piloted for clinical coding and decision support, yet no open benchmark targets the hospital-funding layer where Diagnosis-Related Groups (DRGs) determine reimbursement. In most OECD systems, DRGs route a substantial share of multi-trillion-dollar health spending through governed grouper software, making transparency and auditability first-order concerns. We release NordDRG-AI-Benchmark, the first public, rule-complete test bed for DRG reasoning. The package includes (i) machine-readable approximately 20-sheet NordDRG definition tables and (ii) expert manuals and change-log templates that capture governance workflows. It exposes two suites: a 13-task Logic benchmark (code lookup, cross-table inference, grouping features, multilingual terminology, and CC/MCC validity checks) and a 13-task Grouper benchmark that requires full DRG grouper emulation with strict exact-match scoring on both the DRG and the triggering drg_logic.id. Lightweight reference agents (LogicAgent, GrouperAgent) enable artefact-only evaluation. Under an artefact-only (no web) setting, on the 13 Logic tasks GPT-5 Thinking and Opus 4.1 score 13/13, o3 scores 12/13; mid-tier models (GPT-5 Thinking Mini, o4-mini, GPT-5 Fast) achieve 6-8/13, and remaining models score 5/13 or below. On full grouper emulation across 13 tasks, GPT-5 Thinking solves 7/13, o3 6/13, o4-mini 3/13; GPT-5 Thinking Mini solves 1/13, and all other tested endpoints score 0/13. To our knowledge, this is the first public report of an LLM partially emulating the complete NordDRG grouper logic with governance-grade traceability. Coupling a rule-complete release with exact-match tasks and open scoring provides a reproducible yardstick for head-to-head and longitudinal evaluation in hospital funding. Benchmark materials available in Github.