OpenAI has built a public benchmark called HealthBench to gauge how well chatbots handle health questions.
On the project paper, the team is said to have worked with 262 doctors from 60 countries and logged 5 000 realistic conversations that range from chest pain scares to routine check ups. Each exchange ends with a final user question, which a model must answer.
The benchmark then scores that reply against a checklist written by physicians for that specific scenario. Those checklists carry 48 562 Individual criteria in total, each worth up to 10 points or a penalty if the answer sends the user the wrong way. The whole process runs through a grader based on GPT-4.1, and the score for every conversation sits between 0 and 1.
OpenAI’s researchers say the chats come in 49 languages and cover 26 medical specialties. They also note that no single conversation is published in plain text online to stop models memorising the answers.
How Does HealthBench Mark Replies?
A model’s answer first meets the accuracy yardstick… are the facts right and is any uncertainty acknowledged? Next comes completeness… did the reply mention every piece of advice a doctor listed as essential? The grader then checks clarity, context awareness and instruction following.
To keep the tests realistic, many conversations drip feed detail across different turns, switch from technical jargon to plain speech or swap languages midway. An example given on their blog shows a neighbour found breathing but unresponsive on the floor. The model must tell the caller to dial the emergency number, open the airway, place the person in the recovery position and stay ready to start CPR. The grader lists what the model covered, what it missed and hands out an overall mark. In that example the reply passed 77% of the points.
OpenAI also released two harder spin-offs. One keeps only the checkpoints that many doctors endorsed as safety-critical.
The other gathers 1 000 cases where cutting-edge models still stumble, such as global health questions that need local drug names or instructions that must match limited resources.
More from News
- Apple Devices Allegedly Secretly Recorded Users Now Eligible for Payout
- How Are Job Applicants Using AI For Their CVs?
- What Is Google Keep, And How Does It Work?
- Could UK Businesses Switch Broadband With ‘One Touch Switch’ Soon?
- Google Is Using AI To Manage Digital Scams, Here’s How
- These Technologies Are Being Used To Prevent Crime
- Finland And Neighbours Plan Backup Card Payment Systems For Outages
- Experts Share Their Thoughts On The Recent Interest Rate Cut
Which Models Work Here?
When OpenAI ran the benchmark on leading systems, its new reasoning model o3 took first place with a composite score of 60%. Grok stood next on 54%, and Google’s Gemini 2.5 Pro followed on 52%. Older models fell well behind those marks.
The study shows that o3 excels at spotting emergencies, tailoring language to lay readers and asking the user for missing detail. Grok edges ahead on some checks that need more context, while Gemini delivers solid accuracy but loses marks for incomplete replies in complex scenarios.
Smaller models also improved, GPT-4.1 nano, a cut-down version targeted at cheap cloud hosting, beat last year’s flagship GPT-4o while costing roughly 25x less per query.
The researchers argue that price drops like this could open safe chatbot help to clinics that run on tight budgets.
When HealthBench asked each model the same question 16 times, the worst individual score could be 1/3 lower than the average. The team plotted those “worst-of-n” curves to remind developers that one rogue answer can undo many good ones in a clinical setting.
Why Is This Important For Everyday Health Care?
Hospitals, phone apps and tele-medicine services race to add chat assistance, yet until now no shared yard-stick showed whether a new bot was merely fluent or truly safe. HealthBench turns that judgement into a single score that any lab can reproduce, according to the OpenAI announcement.
Because every test lists the exact checkpoints missed, engineers can patch gaps instead of guessing. A system that repeatedly skips allergy checks, for example, can be retrained on that weakness before it reaches patients.
The doctors behind the benchmark stress that even o3 should not replace a clinician.
They frame HealthBench as a map, showing where language models already help with triage, taking notes or basic education, and where human judgement still rules. In the long run, they hope public scores will keep hype in check and push the next wave of medical chatbots toward safer, more precise advice.