Microsoft has developed an AI-enabled diagnostic system, the Microsoft AI Diagnostic Orchestrator (MAI-DxO), which can accurately diagnose complex medical cases at a rate more than four times higher than human doctors, according to a recent experiment.
“When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy–four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3,” the study authors wrote.
“When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek and Llama families.”
The Microsoft team tested MAI-DxO against 304 real-world case studies from the New England Journal of Medicine, and the AI system not only correctly diagnosed 85.5% of cases but used fewer resources than the group of experienced physicians to do so.
Researchers evaluated 21 practicing physicians, each with five to 20 years of clinical experience, located in both the UK and U.S. The physicians were all given the same tasks and achieved a mean accuracy of 20% across the completed cases.
Researchers also stated that although medical specialists are experts in a specific area of the body or a particular type of disease, no doctor can be an expert in every complex medical case.
The Microsoft team stated that AI does not have that limitation and can draw knowledge across various medical fields simultaneously, going beyond what any single doctor can do.
“The MAI-Dx Orchestrator turns any language model into a virtual panel of clinicians: it can ask follow-up questions, order tests or deliver a diagnosis, then run a cost check and verify its own reasoning before deciding whether to proceed,” the authors wrote. “This kind of advanced thinking could change the way healthcare works.”
THE LARGER TREND
Microsoft’s researchers noted limitations in their experiment, including an unrealistic case mix, as the benchmark cases examined were derived from complex, teaching-focused cases in the NEJM and did not include healthy individuals or patients with mild conditions.
Researchers said it was unclear whether the AI would perform as well on everyday, routine cases or how often it would give false positives.
The test was also limited as it lacked real-world constraints, including factors such as patient discomfort, wait times, insurance restrictions, test availability and delays in receiving results.
Evaluation of the test costs was based on simplified U.S. averages and did not account for differences in costs among payers, providers, health systems or geography.
Lastly, the study compared Microsoft’s AI to internal care physicians and primary care physicians only, but not specialists. Additionally, the doctors who participated were restricted from using internet resources, whereas in reality, doctors often consult guidelines, colleagues and numerous other tools during diagnosis.
“While acknowledging these limitations, our results indicate possible accuracy gains, especially when considering clinicians working in remote and under-resourced settings, and also give us a picture of how LMs could augment medical expertise to improve health outcomes even in well-resourced settings,” the Microsoft team wrote.
Credit: Source link