GPT, Gemini, and Claude surpass medical AI tools in physician tests

The news: General-purpose LLMs outperformed specialized clinical AI tools across three medical benchmarks, according to a study recently published in Nature Medicine.

  1. US medical licensing exam questions. Gemini led with 97.4% in test accuracy, while all three of the general models topped OpenEvidence (89.6%) and UpToDate (88.4%).
  2. Realistic healthcare scenarios. GPT was the clear leader at 88.0%, while OpenEvidence (62.6%) and UpToDate (61.3%) lagged well behind Gemini (79.3%) and Claude (77.0%) as well in open-ended medical conversations designed to assess how AI handles real-world patient and clinician interactions.
  3. Real-world physician queries. The three general-purpose LLMs also outperformed the two healthcare AI tools across 100 physician questions spanning topics such as diagnosis, treatment, medications, guidelines, and clinical workflows. The models’ responses to these questions were graded on clinical correctness, completeness, safety/harm avoidance, and clarity.

For context, NYU Langone researchers compared OpenEvidence and UpToDate Expert AI with three publicly available frontier models: GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6.

Why it matters: Clinician adoption is accelerating across both general-purpose platforms and healthcare-specific tools, often for different tasks.

Physician AI use is climbing fast: 74% now use AI daily, up 10 percentage points year over year, and 40% use it multiple times a day, per a March 2026 Doximity report. About two-thirds of US physicians use OpenEvidence, and millions of clinicians worldwide rely on UpToDate. But if foundation models match them on clinical tasks, it raises questions about the value of purpose-built clinical AI.

However, key caveats in the study should temper any rush to judgment. Frontier models train on massive public datasets that include medical exams, and HealthBench was created by OpenAI, so part of the performance edge could reflect familiarity with the questions rather than genuine clinical reasoning. Conversely, OpenEvidence and UpToDate were trained on evidence-based clinical content. In its response to the study, OpenEvidence labeled the data “contaminated,” one reason being that the real-world clinician queries benchmark dataset was not publicly released.

Implications for health systems and healthcare AI companies: Peer review doesn't shield a study from bias or methodological constraints, and the companies selling AI to hospitals have every reason to challenge findings that threaten the case for investments in medical AI.

One study won’t redirect the market. Health systems and tech players are already co-developing industry-specific tools, a signal that healthcare organizations value products built for clinical use. In other words, benchmark results are only part of the equation. Physician adoption depends as much on trust, workflow fit, and risk tolerance as on raw model performance. Expect clinicians to rely on specialized healthcare AI tools for higher-stakes clinical work while using frontier models for lower-risk activities, such as explaining concepts, conducting quick research, and handling administrative work.

You've read 0 of 2 free articles this month.

Get more articles - create your free account today!
GPT, Gemini, and Claude surpass medical AI tools in physician tests