Study Reveals ChatGPT’s Limitations When Recommending Cancer Treatment

Patients today have access to vast amounts of medical information on the internet, enabling them to educate themselves about their health conditions.

With the rise of artificial intelligence (AI), tools like ChatGPT have become increasingly popular for providing medical recommendations.

Researchers at Brigham and Women’s Hospital, conducted a study to evaluate the alignment of ChatGPT’s cancer treatment recommendations with the National Comprehensive Cancer Network (NCCN) guidelines.


Unveiling ChatGPT’s Recommendations


Published in JAMA Oncology, the study revealed that ChatGPT 3.5, while often convincing, offered non-concordant recommendations in around one-third of cases. This discovery showcases the importance of recognising the limitations of AI technologies and the importance for patients to engage with medical professionals in their decision-making process.

Danielle Bitterman, MD, corresponding author and part of the Artificial Intelligence in Medicine (AIM) Program, highlighted that AI lacks the nuanced understanding necessary for complex clinical decisions. Bitterman emphasised that medical advice should not be sought solely from AI resources and the importance of collaboration with healthcare providers.


AI’s Impact on Healthcare Landscape

The integration of AI into healthcare has been groundbreaking, potentially revolutionising care delivery, workforce support, and administrative processes. Mass General Brigham, as a leading academic health system and innovation hub, is at the forefront of rigorously researching AI’s responsible incorporation in various aspects of healthcare.



Aligning with NCCN Guidelines

The researchers chose to evaluate ChatGPT’s recommendations against NCCN guidelines, extensively employed by physicians nationwide. Focusing on three prevalent cancers—breast, prostate, and lung—the study prompted ChatGPT to suggest treatment approaches for each based on disease severity. A total of 104 prompts were generated, incorporating 26 distinct diagnosis descriptions and four slightly varied prompts.

Although nearly all responses (98%) contained at least one treatment approach in line with NCCN guidelines, 34% also featured non-approved suggestions. These recommendations were often subtly woven into otherwise sound guidance, making them challenging to identify.


Non-Approved Recommendations Defined

Non-approved recommendations were defined as partially correct suggestions that didn’t fully align with NCCN guidelines. For instance, suggesting surgery alone for locally advanced breast cancer without mentioning complementary therapies constituted a non-approved recommendation. The study revealed that complete agreement only occurred in 62% of cases, highlighting both the complexity of NCCN guidelines and ChatGPT’s potential for vague responses.


Unveiling Hallucinations and Misinformation


In 12.5% of cases, ChatGPT produced “hallucinations,” offering treatment recommendations absent from NCCN guidelines. These ranged from novel therapies to curative treatments for non-curative cancers. The study authors emphasised that such misinformation could distort patient expectations and potentially strain the clinician-patient relationship.

The researchers are now delving into the capabilities of both patients and clinicians in distinguishing between advice from human clinicians and AI models like ChatGPT. Furthermore, they are subjecting ChatGPT to more intricate clinical scenarios to assess its clinical knowledge comprehensively.


Technical Details and Implications

The researchers employed GPT-3.5-turbo-0301, a substantial model available at the time of the study, and 2021 NCCN guidelines for their evaluation. They acknowledged that results might differ with other language models and clinical guidelines, while noting the fundamental similarities and limitations shared by many AI models in this context.

The study sheds light on ChatGPT’s performance in providing cancer treatment recommendations aligned with NCCN guidelines. While AI technologies can aid patient education, the findings showcase the importance of not solely relying on AI for medical decisions. Human expertise remains irreplaceable in the intricate landscape of clinical decision-making.