AI trained on more than 1M medical images accurately detects breast cancer

Researchers have developed a new convolutional neural network (CNN) that can predict the presence of breast cancer with the accuracy of an experienced radiologist. The network, trained on more than 220,000 mammograms and more than 1 million medical images, is detailed in a new paper published on Cornell University’s arXiv.org archive.

“The development of deep CNNs to aid in the evaluation of screening mammography would save significant health care costs,” wrote co-author Nan Wu, of New York University’s Center for Data Science, and colleagues. “In this work, we propose a novel neural network architecture and an appropriate two-stage training procedure to efficiently handle a large dataset of high-resolution breast mammograms with biopsy-proven labels.”

Building their dataset, Wu et al. indicated whether or not each image included a malignant or benign finding. For more than 5,000 of the mammograms, a biopsy had been performed within 120 days of the examination; the biopsies confirmed malignant findings in 8.4 percent of those mammograms, benign findings in 47.6 percent and both malignant and benign findings in 2 percent. For each breast included in the study, two binary labels were used: one noted the absence or presence of a malignant finding and the other noted the absence or presence of a benign finding.

The team tested a few different models, focusing on the AUC for malignant/not malignant classifications and benign/not benign classifications. The most successful model, using both images and heatmaps, had an AUC of 0.895 for malignant/not malignant classifications and an AUC of 0.756 for benign/not benign classifications.

The researchers also performed a reader study with 12 attending radiologists with various levels of experience, a resident and a medical student. Each participant read 740 mammograms from the test set, including 368 matched with a biopsy and 372 not matched with a biopsy. All mammograms were randomly selected.

The first two exams each participant interpreted were viewed as “a practice set to familiarize readers with the format of the reader study,” and then the team evaluated reading performance on those final 720 examinations. The individual readers had AUCs ranging from 0.705 to 0.860.

A “human-machine hybrid” combining the assessments of the CNN and the radiologists, the authors noted, scored an average AUC of 0.891.

“These results suggest our model can be used as a tool to assist radiologists in reading breast cancer screening exams and that it captured different aspects of the task compared to experienced breast radiologists,” the authors wrote.

Wu et al. observed that their AI model was “relatively simple” and “more sophisticated and accurate models are possible.” They added that predicting breast cancer before it is even visible in screening mammograms is a “clear next step” for the team’s research.