TB Research

Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy

Altunisik N, Altunisik Toplu S, Turkmen D

Cutaneous and ocular toxicology · 2026-04

Abstract

Background Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by "hallucinations" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates. Methods A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: "True and Relevant," "True but Irrelevant/General," or "False/Hallucination." Distributions were compared using Pearson's chi-square test. Results A significant difference was found between models (p Conclusion LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.