Why AI models must always be tested on data from outside health systems

Artificial intelligence (AI) tools should always be tested across “a wide range of populations,” according to new research published in PLOS Medicine. The authors shared this warning after seeing some models perform worse when tested on data from an outside health system. 

“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” lead author Eric Oermann, MD, of the Icahn School of Medicine at Mount Sinai in New York City, said in a prepared statement. “Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”

Oermann and colleagues examined the performance of Icahn School of Medicine-developed convolutional neural networks (CNNs) that identify pneumonia on chest x-rays at three different institutions: the National Institutes of Health, The Mount Sinai Hospital and Indiana University Hospital.

Overall, the researchers found, the internal performance of the detection of pneumonia on x-rays from hospitals outside of its own network was “significantly lower” than on x-rays within the originating institution. In fact, in three out of five natural comparisons, the researchers found the CNNs were able to detect pneumonia better at their institution than others.

“If CNN systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios, and carefully assessed to determine how they impact accurate diagnosis,” John Zech, a medical student at the Icahn School of Medicine at Mount Sinai, said in the same statement.

Additionally, the researchers found the CNNs were able to detect the hospital where an x-ray was acquired with “extremely high accuracy."

“The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays but also their ability to exploit confounding information,” the researchers concluded. “Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.”