
Assess whether frontier large language models match clinicians' reasoning by examining a workflow-focused benchmark and its implications for clinical use.
Key Takeaways
- Twenty-one frontier LLMs were evaluated on twenty-nine MSD Manual clinical vignettes
- Each vignette preserved sequential clinical context and was run in triplicate for scoring
- PrIME-LLM scores compute a normalized polygonal area across five clinical reasoning domains
