We present a large-scale, high-resolution labeled dataset of chest X-rays for automated medical image exploration, along with their associated reports. This dataset includes over 160,000 images from 67,000 patients that were interpreted and reported by radiologists at San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demographics.
The reports were labeled with 174 different radiographic findings, 19 differential diagnoses, and 104 anatomical locations organized as a hierarchical taxonomy mapped to the Unified Medical Language System (UMLS) standard terminology. Twenty-seven percent of the reports were manually annotated by trained physicians, and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The generated labels were validated, achieving a Micro-F1 score of 0.93 using an independent test set.
To our knowledge, this is the first public database of chest X-rays annotated with the largest number of different tags suitable for supervised training on X-rays, and the first in Spanish to contain X-ray reports.