AI interprets imaging data as well as physicians—but there’s a catch

AI models can interpret medical images with a diagnostic accuracy comparable to that of actual physicians, according to new findings published in The Lancet Digital Health. The researchers noted, however, that some AI studies were never externally validated and others included subpar reporting.

The study’s authors noted that there might be a strong need for AI technologies that can read medical images, but that doesn’t mean the science can be rushed—it’s still crucial that healthcare professionals take their time and get the research right.

“Reports of deep learning models matching or exceeding humans in diagnostic performance has generated considerable excitement, but this enthusiasm should not overrule the need for critical appraisal,” wrote Xiaoxuan Liu, MBChB, University Hospitals Birmingham NHS Foundation Trust, and colleagues. “Concerns raised in this field include whether some study designs are biased in favor of the new technology, whether the findings are generalizable, whether the study was performed in silico or in a clinical environment, and therefore to what degree the study results are applicable to the real-world setting.”

Liu et al. performed a meta-analysis, exploring data from 82 studies focused on the diagnostic performance of various AI models. The accuracy of 69 of those studies could be calculated, and the authors found that their reported sensitivity ranged from 9.7% to 100%. The specificity, meanwhile, ranged from 38.9% to 100%.

In 14 studies, the performance of deep learning models was compared with that physicians. When restricting their analysis to each study’s contingency table, the authors found a pooled sensitivity of 87% for the deep learning models and 86.4% for physicians. The pooled specificity was 92.5% for deep learning models and 90.5% for physicians.

A key conclusion of the team’s study, however, is that many of the studies they explored contained significant limitations. The terminology was inconsistent from one study to the next, for instance, and most studies were not performed in a “real clinical environment.”

“Most studies were retrospective, in silico, and based on previously assembled datasets,” the authors wrote. “The ground truth labels were mostly derived from data collected for other purposes, such as in retrospectively collected routine clinical care notes or radiology or histology reports, and the criteria for the presence or absence of disease were often poorly defined. The reporting around handling of missing information in these datasets was also poor across all studies.”

External validation was also an issue for some studies, with many researchers failing to explore out-of-sample validation for both the algorithms themselves and the physicians being compared to those algorithms.

“Our finding when comparing performance on internal versus external validation was that, as expected, internal validation overestimates diagnostic accuracy in both health-care professionals and deep learning algorithms,” the authors wrote. “This finding highlights the need for out-of-sample external validation in all predictive models.”

Ultimately, however, Liu and colleagues said they “cautiously state that the accuracy of deep learning algorithms is equivalent to health-care professionals.” There is still a need, they added, for additional studies focused on the use of deep learning algorithms in real-world clinical settings.