Study Finds 52% of ChatGPT Answers Are Incorrect

Purdue University Uncovers ChatGPT’s Performance Gap in Software Programming Domain


A recent study carried out by researchers from Purdue University has shed light on the performance of OpenAI’s chatbot, ChatGPT, when it comes to answering software programming questions. The study delves into the accuracy, language style, and user preferences in over 500 responses generated by the AI model, providing insights into its strengths and shortcomings.


A Dismal 52% Accuracy


The Purdue study, which analysed 517 queries from the coding community platform Stack Overflow, unveiled a notable flaw in ChatGPT’s performance. Shockingly, the chatbot produced incorrect answers in more than half of the cases, with a concerning 52 percent accuracy rate. Even more perplexing was the discovery that a staggering 77 percent of its responses were overly verbose, potentially contributing to user confusion.


Style Over Substance


Curiously, despite the inaccuracies plaguing ChatGPT’s responses, the study disclosed a peculiar trend. Users opted for the AI’s answers 39.34 percent of the time, drawn by its eloquent and comprehensive language style. Astonishingly, 77 percent of these preferred responses were incorrect. This phenomenon underscores the allure of the AI’s articulate manner, often overshadowing the factual correctness of the information provided.


Confidence Trumps Correctness


The researchers also observed a fascinating phenomenon – users often failed to spot errors in ChatGPT’s responses, especially when these errors weren’t easily verifiable or necessitated external references. In cases where errors were apparent, a significant number of participants still favored the AI’s response due to its confident and authoritative delivery. This highlights the power of persuasive language in cultivating user trust and favorability, even in the face of inaccuracies.


Language Style Comparison


The Purdue study also extended its analysis to the language style employed by ChatGPT as compared to typical Stack Overflow posts. It was discovered that the AI model frequently utilised “drives attributes,” suggestive of accomplishments, while failing to discuss risks as consistently as the community-driven platform. This discrepancy underscores the need for a more balanced approach to information dissemination.


Recommendations for the Future


In light of the study’s findings, the researchers put forth a series of recommendations for enhancing the software programming Q&A landscape. Firstly, they propose that platforms like Stack Overflow should explore effective strategies for identifying toxic and negative sentiments in comments and responses to foster a more positive user experience. Secondly, they advocate for clearer guidelines for answerers to structure their responses in a methodical, step-by-step manner.

This could potentially enhance the discoverability and comprehensibility of answers.

Owen Morris, Director of Enterprise Architecture at Doherty Associates, commented, “While AI offers numerous benefits, there are certain disadvantages that users should be aware of. One risk is the careless use of AI, relying on it without thorough evaluation or critical analysis. As new research has found, ChatGPT is incorrect 52% of the time, with it being more likely to make conceptual rather than factual errors.  

“Tools like ChatGPT offer insights based on the data on which they’re trained (including crawls of the Internet and other sources), and will retain their biases, so human involvement remains essential for accuracy and value addition. It’s important to remember to make use of your team so that they can contribute their own domain-specific knowledge and data to enhance the models’ applicability. Despite fears that these models will eventually replace human workers, the research shows that this is unlikely to materialise. Without human oversight to contextualise the responses and critically evaluate their accuracy, there’s a considerable risk that you’ll incorporate incorrect or harmful information into your work, jeopardising its quality and, more widely, your professional reputation.”  

The Road Ahead


The study, presented as a pre-print paper, is a first step towards understanding ChatGPT’s performance in a specific domain. The researchers are keen to see further validation through larger-scale studies. As of now, OpenAI has yet to comment on the findings of the Purdue study. With AI’s continuous evolution, the insights gained from such research could pave the way for improvements that align more closely with user needs and expectations.