Can machine learning accurately interpret free-text CT exams?

Interpreting free-text radiology reports can be a challenge for machine learning, according to an article published in the Journal of the American College of Radiology. The authors suggested this is one reason for specialists to embrace structure report templates.

Researchers sought to quantify the variability of language in free-text reports for CT exams performed to rule out pulmonary embolus (PE). They were interested in finding whether the text in the reports signify if there is a presence or absence of a particular finding.

“Use of free text reporting was associated with extensive variability in report length and report terms used,” wrote lead author Rickhesvar Mahraj, MD, Milton S. Hershey Medical Center in Hershey, Pennsylvania, and colleagues. “Such variability may impact the transfer of clinical information to end users.”

Using more than 1,100 chest CT exams with contrast as the study sample, the researchers used text-mining and predictive analytics software to analyze and describe the reports. Machine learning rules were generated, which could potentially predict the “gold standard” radiological diagnosis of PE.

Of the studies reviewed, 92 percent of the studies were categorized as exhibiting no PE and approximately 8 percent of the studies were categorized as exhibiting PE. Reports with a PE diagnosis had a median findings plus impression length of 1,170 characters. Reports with no PE diagnosis had a median findings plus impression length of 1,544 characters.

“Using text-mining approaches, we found that radiologist report texts for chest CT examinations done to rule out PE exhibited substantial variation in length, types, and frequencies of report terms, with a long tail of words used in only a few reports,” the authors wrote. 


More than 2,200 unique words were generated by the radiologists. Embolism was the second most common word and embolus was the 34th most common word. Additionally, diagnostic, quality and variably were the 632nd, 675th and 522nd most commonly words used, respectively.

“A small number of words were commonly used, but the overwhelming majority of words were used in only small numbers of reports,” the authors wrote. “Despite the clinical setting of ruling out PE, the most common term in the Findings section was not some form of embol-, and that stem was only the second most common word in the Impression section. Despite common technical issues with contrast administration that sometimes prevent a diagnosis, terms such as diagnostic and quality were not frequently used.”

The 20 most frequent terms were used in 66 percent of all reports, and the 100 most common words were used in 6 percent of all reports. There were 896 distinct words used in one repost each.

Combining the text from both the findings and impression text, six machine learning rules were produced. They retained a sensitivity rate of 100 percent, a positive predictive value of 73 percent, and a specificity rate of 97 percent. The negative predictive value was 100 percent and had a zero percent rate of false-negatives. The overall accuracy was 97 percent, with a misclassification rate of three percent.

“In the common clinical situation of ruling out PE from the emergency department, a very large number of words is used variably to describe the findings and impressions on chest CTs with contrast studies, and a machine learning proxy for human understanding was only imperfectly able to diagnose the presence and absence of PE,” the authors concluded. “The impact of this on consumers of our reports is not clear, but these results support the imperative of designing and implementing structured templates in specific clinical situations such as patients with clinical suspicion for PE.”