Evaluation of large language models in supporting autoimmune liver disease diagnosis and clinical decision-making: advantages of reasoning-based models

Tengyu Guo, Xiong Ma

International Journal of Clinical Pharmacy · 2026-03

Abstract

INTRODUCTION: Large-language models (LLMs) have demonstrated increasing potential in healthcare, including applications in clinical information retrieval, medical reasoning, and decision support. Autoimmune liver diseases (AILDs), including autoimmune hepatitis, primary biliary cholangitis, primary sclerosing cholangitis, and IgG4-related liver disease, are rare and heterogeneous conditions that often present with nonspecific features and require specialized expertise for accurate diagnosis and management. In resource-limited settings, limited access to hepatology subspecialists may contribute to delayed diagnoses and suboptimal care. Although reasoning-oriented LLMs are designed to support structured clinical inference, their performance on AILD-related tasks has not been systematically evaluated. AIM: To evaluate and compare the accuracy, safety, readability, comprehensiveness, and diagnostic performance of six large language models for addressing AILD-related clinical questions and real-world cases. METHOD: We developed 26 clinically relevant questions spanning the key domains of AILDs, including pathogenesis, risk factors, clinical presentation, diagnosis, treatment, and prognosis. Six publicly available LLMs (o1-preview, Claude-3.5-Sonnet, GPT-4o, GPT-4o-mini, GPT-3.5-Turbo, and LLaMA-3.1-405B) generated responses that were independently evaluated by 11 board-certified hepatologists with more than 10 years of specialty experience. Accuracy, safety, readability, comprehensiveness, and clinical helpfulness were assessed by using predefined rating scales. Diagnostic performance was further evaluated in three models (o1-preview, GPT-4o, and GPT-3.5-Turbo) using 21 confirmed real-world AILD cases, with accuracy determined by comparison with gold-standard clinical diagnoses. RESULTS: Among the six models, the o1-preview demonstrated the highest overall performance. It achieved the greatest proportion of accurate responses (78.3%), the highest mean safety score, and the most favorable readability profile, despite generating the longest responses. Comprehensiveness and helpfulness ratings were also highest for o1-preview. In diagnostic testing, the o1-preview achieved an overall accuracy of 81.0%, outperforming GPT-4o and GPT-3.5-Turbo, particularly in single-entity AILD diagnoses. All models demonstrated reduced diagnostic accuracy for autoimmune hepatitis-primary biliary cholangitis overlap syndrome. CONCLUSION: Reasoning-based LLMs, particularly o1-preview, demonstrate potential value in supporting clinical information delivery and diagnostic reasoning for AILDs, especially in resource-limited settings. However, LLM outputs should serve as adjunctive support rather than substitutes for specialist judgment. Further domain-specific refinement and multicenter validation are necessary to ensure safe and effective integration into hepatology and clinical pharmacy practices.

MeSH terms

Medicine
Helpfulness
Hepatology
Readability
Medical diagnosis
Specialty
Autoimmune hepatitis
Liver disease
Disease
Primary sclerosing cholangitis
Internal medicine
Intensive care medicine
MEDLINE
Clinical diagnosis

Evaluation of large language models in supporting autoimmune liver disease diagnosis and clinical decision-making: advantages of reasoning-based models

Abstract

MeSH terms

Related papers