Inside Health AI: What Training Data Can Miss

Health AI is one of the fastest evolving technologies, as algorithms can now analyse mammograms, predict sepsis and even expedite drug discovery. Due to its potential, large tech and healthcare companies are trying to establish the pathways for the future integration of Health AI in medicine. These systems will only be as good as the data they are trained on and the training of AI systems in health care is extremely data intensive.

 

How Health AI is Actually Trained

 

For healthcare systems to utilise AI properly, it is essential that they are trained on high-quality data sets. AI is trained on huge amounts of data in the form of medical images, electronic health care records, lab results and much more. The AI learns to identify patterns for diagnoses, treatment, outcomes and responses. The more relevant, high-quality and representative that data is, the more accurate and generalisable the resulting model becomes.

 

Importance of ‘High-Quality’ and ‘Representative’ Data

 

The most common source of data used to train these systems, de-identified electronic health records, is frequently inconsistent and incomplete. Often, health data represent the real-world scenario of inequitable healthcare systems, skewing final AI-driven results. When data is incomplete and does not reflect the real-world population, the system can generate outputs that are inaccurate, which affects the healthcare and treatment of the individuals involved and has the potential to cause direct harm.

 

 

Who Are The Big-Tech Players Training Health AI?

 

NVIDIA is establishing itself as a strong plater in the the health AI infrastructure builder sector with its hardware and Clara, its health imaging and genomics framework. They allied with Mayo Clinic, Illumina and IQVIA for new approaches in drug discovery, digital pathology and genomic analysis at the J.P. Morgan Healthcare Conference, in January 2025.

 

Microsoft and Epic Electronic Health Systems Partnership

 

Microsoft Azure AI for Health partnered with Epic Systems to plant AI copilots in Electronic Health Records to position predictive care tools in front of clinicians. With Azure AI growing 34% in 2025, with healthcare being one of their fastest growing sectors, Microsoft acquired Nuance to combine clinical speech recognition and natural language processing in the Microsoft Cloud for Healthcare.

 

Google Health and DeepMind in Radiology

 

Google Health and DeepMind have offered some of the most compelling accuracy results in the space. Their published research suggests they have outperformed human radiologists in breast cancer detection by 11.5%. Their endeavors have also allowed the field of drug discovery to utilise AlphaFold protein-structure solutions.

 

The Specialist Health AI Companies

 

Many companies are developing health-focused applications in addition to the infrastructure giants. Aidoc, who has FDA approved algorithms, is now implemented in 2,000 hospitals worldwide for the real-time detection of stroke and technology for emergency care.

 

Aidoc

 

In 2023, Aidoc invested in and unveiled their CARE (Clinical AI Reasoning Engine) foundation model. They also received the FDA stamp of approval and Breakthrough Design Devices utilising this technology for the first AI ‘foundation model’ in medical imaging in 2025-2026. This model is flexible and adaptable to many diagnostic tasks unlike traditional models.

 

Tempus 

 

Tempus has one of the largest and most comprehensive molecular and clinical data libraries in the world that supports precision cancer care for half of all oncologists in the US. This library houses tens of millions of research records along with billions of clinical notes.

 

Hippocratic AI

 

Hippocratic AI takes a different route: constructing a large language model for healthcare in partnership with clinicians from Stanford, Johns Hopkins and El Camino Health and supported by funding from NVIDIA’s venture arm, among others. Its core design principle is telling: the model must perform at the safety level of an average clinician before it’s trusted with any given task. That emphasis on validated safety, rather than raw capability, reflects a growing recognition across the sector that health AI carries risks that other AI applications simply don’t.

 

PathAI

 

PathAI trains its algorithms on its expansive data set that is backed by more than 15 million expert-annotated pathology annotations. They are one of the worlds leading digital pathology companies which uses AI to improve accuracy with diagnoses and aims to advance grade development.

 

Women’s Health: The Largest Gap in Health Data

 

Women’s health has been a recognised blind spot and discrimination in health AI training data. The majority of participants in clinical trials have been males of a certain age and ethnicity, this pattern has been prevalent for decades and has shaped the medical information for AI. If AI is trained on data sets that exclude women, or other minority groups, the gap and bias will be reproduced every time the model is used.

 

How Is AI Being Trained to Combat These Issues?

 

More and more, businesses are intentionally building large datasets that are accurately annotated, not just relying on historical datasets that may be of low quality. The research community is arriving at a consensus that valid AI models require a multitude of datasets. In addition to that, women’s health AI needs to include research on disease and conditions that are relevant to both men and women to avoid a male health conditions baseline.

 

Policy Attention for AI Training

 

This is an obvious area that regulators and policymakers are beginning to engage with directly. In early 2025, the UK Minister of State for Women’s Health said that, without representative data, health-related AI tools would likely exacerbate health inequities, rather than alleviate them. The expectation in the field is that public health bodies, medical institutions, and technology companies will have to collaborate to establish standards for bias testing, transparency, and validation before the large-scale deployment of health-related AI applications.