2776 lines (2776 with data), 136.7 kB
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"colab": {
"name": "EDA and Data Cleaning.ipynb",
"provenance": [],
"collapsed_sections": []
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Xl323Q-gMutB",
"colab_type": "text"
},
"source": [
"Credits to Jeremy Howard who discovered that some files have a corrupted rescale intercept and that other files show very little or no brain matter (https://www.kaggle.com/jhoward/cleaning-the-data-for-rapid-prototyping-fastai)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YnJowERx3Sik",
"colab_type": "text"
},
"source": [
"# **Importing Dependencies**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "CPe_QcSy3DLo",
"colab_type": "code",
"colab": {}
},
"source": [
"import glob\n",
"import os\n",
"import pickle\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import tqdm.notebook as tqdm\n",
"\n",
"pd.set_option('display.max_columns', 500)\n",
"pd.set_option('display.max_colwidth', -1)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "yObm5-yx4FSf",
"colab_type": "text"
},
"source": [
"# **Reading in Metadata & Data Labels**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YA20uuhBVxAn",
"colab_type": "text"
},
"source": [
"DICOM files---the type of data we are dealing with here---do not merely contain images (like a slice of a patient's brain), but also a heap of metadata. Using parts of this metadata (specifically, patient IDs) lets us \"piece together\" *full* scans of patient brains from otherwise unrelated images.\n",
"\n",
"This step is crucial for our approach: For our sequential model to make use of the spatial relations inherent in CT scans, we must first reconstruct those spatial relations."
]
},
{
"cell_type": "code",
"metadata": {
"id": "-7aoQvsa3DMP",
"colab_type": "code",
"colab": {}
},
"source": [
"labels = pd.read_feather(\"./labels_jhoward.fth\")\n",
"train_metadata = pd.read_feather(\"./df_trn.fth\")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Ok_pdN2o3DMd",
"colab_type": "code",
"colab": {}
},
"source": [
"# Add .png extension to file IDs in the \"labels\" DataFrame\n",
"labels = labels[[\"ID\", \"any\"]]\n",
"labels[\"ID\"] = labels[\"ID\"].str[:] + \".png\""
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pRB1cxyd3DMg",
"colab_type": "code",
"outputId": "f3b361aa-0fe2-4f6a-977e-7957846c169e",
"colab": {}
},
"source": [
"# Verify that the above cell executed correctly\n",
"labels.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>any</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ID_000039fa0.png</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ID_00005679d.png</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ID_00008ce3c.png</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ID_0000950d7.png</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ID_0000aee4b.png</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID any\n",
"0 ID_000039fa0.png 0 \n",
"1 ID_00005679d.png 0 \n",
"2 ID_00008ce3c.png 0 \n",
"3 ID_0000950d7.png 0 \n",
"4 ID_0000aee4b.png 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 99
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "czhB_jGTMRGn",
"colab_type": "text"
},
"source": [
"# **EDA**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "GSdYiywS3DMp",
"colab_type": "code",
"outputId": "c8375c53-cb4c-4dcb-dcbe-746fc203665f",
"colab": {}
},
"source": [
"# We find that there are ~700,000 images, 14% of which contain hemorrhages\n",
"# (see \"mean\" (note that hemorrhages have label 1 while no-hemorrhages have label 0))\n",
"labels.describe()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>any</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>674258.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.144015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.351105</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" any\n",
"count 674258.000000\n",
"mean 0.144015 \n",
"std 0.351105 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 0.000000 \n",
"max 1.000000 "
]
},
"metadata": {
"tags": []
},
"execution_count": 100
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "7Aep4NdF3DMs",
"colab_type": "code",
"outputId": "44228e66-21e0-4eae-c68b-c97f71da97f1",
"colab": {}
},
"source": [
"# For reference, the full metadata contained in 5 DICOM files\n",
"train_metadata.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SOPInstanceUID</th>\n",
" <th>Modality</th>\n",
" <th>PatientID</th>\n",
" <th>StudyInstanceUID</th>\n",
" <th>SeriesInstanceUID</th>\n",
" <th>StudyID</th>\n",
" <th>ImagePositionPatient</th>\n",
" <th>ImageOrientationPatient</th>\n",
" <th>SamplesPerPixel</th>\n",
" <th>PhotometricInterpretation</th>\n",
" <th>Rows</th>\n",
" <th>Columns</th>\n",
" <th>PixelSpacing</th>\n",
" <th>BitsAllocated</th>\n",
" <th>BitsStored</th>\n",
" <th>HighBit</th>\n",
" <th>PixelRepresentation</th>\n",
" <th>WindowCenter</th>\n",
" <th>WindowWidth</th>\n",
" <th>RescaleIntercept</th>\n",
" <th>RescaleSlope</th>\n",
" <th>fname</th>\n",
" <th>MultiImagePositionPatient</th>\n",
" <th>ImagePositionPatient1</th>\n",
" <th>ImagePositionPatient2</th>\n",
" <th>MultiImageOrientationPatient</th>\n",
" <th>ImageOrientationPatient1</th>\n",
" <th>ImageOrientationPatient2</th>\n",
" <th>ImageOrientationPatient3</th>\n",
" <th>ImageOrientationPatient4</th>\n",
" <th>ImageOrientationPatient5</th>\n",
" <th>MultiPixelSpacing</th>\n",
" <th>PixelSpacing1</th>\n",
" <th>img_min</th>\n",
" <th>img_max</th>\n",
" <th>img_mean</th>\n",
" <th>img_std</th>\n",
" <th>img_pct_window</th>\n",
" <th>MultiWindowCenter</th>\n",
" <th>WindowCenter1</th>\n",
" <th>MultiWindowWidth</th>\n",
" <th>WindowWidth1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ID_231d901c1</td>\n",
" <td>CT</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>ID_dd37ba3adb</td>\n",
" <td>ID_15dcd6057a</td>\n",
" <td></td>\n",
" <td>-125.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>MONOCHROME2</td>\n",
" <td>512</td>\n",
" <td>512</td>\n",
" <td>0.488281</td>\n",
" <td>16</td>\n",
" <td>16</td>\n",
" <td>15</td>\n",
" <td>1</td>\n",
" <td>40.0</td>\n",
" <td>100.0</td>\n",
" <td>-1024.0</td>\n",
" <td>1.0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm</td>\n",
" <td>1</td>\n",
" <td>-123.101000</td>\n",
" <td>104.307000</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.984808</td>\n",
" <td>-0.173648</td>\n",
" <td>1</td>\n",
" <td>0.488281</td>\n",
" <td>-1024</td>\n",
" <td>3263</td>\n",
" <td>171.462490</td>\n",
" <td>828.102464</td>\n",
" <td>0.164074</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ID_994bc0470</td>\n",
" <td>CT</td>\n",
" <td>ID_400facde</td>\n",
" <td>ID_c5277f0c63</td>\n",
" <td>ID_4ba12c2161</td>\n",
" <td></td>\n",
" <td>-125.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>MONOCHROME2</td>\n",
" <td>512</td>\n",
" <td>512</td>\n",
" <td>0.488281</td>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>47.0</td>\n",
" <td>80.0</td>\n",
" <td>-1024.0</td>\n",
" <td>1.0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm</td>\n",
" <td>1</td>\n",
" <td>53.628222</td>\n",
" <td>223.572015</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.933580</td>\n",
" <td>-0.358368</td>\n",
" <td>1</td>\n",
" <td>0.488281</td>\n",
" <td>0</td>\n",
" <td>2507</td>\n",
" <td>430.418091</td>\n",
" <td>599.742963</td>\n",
" <td>0.198139</td>\n",
" <td>1.0</td>\n",
" <td>47.0</td>\n",
" <td>1.0</td>\n",
" <td>80.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ID_127689cce</td>\n",
" <td>CT</td>\n",
" <td>ID_42910d3d</td>\n",
" <td>ID_db93ade25b</td>\n",
" <td>ID_c4b4931314</td>\n",
" <td></td>\n",
" <td>-125.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>MONOCHROME2</td>\n",
" <td>512</td>\n",
" <td>512</td>\n",
" <td>0.488281</td>\n",
" <td>16</td>\n",
" <td>16</td>\n",
" <td>15</td>\n",
" <td>1</td>\n",
" <td>30.0</td>\n",
" <td>80.0</td>\n",
" <td>-1024.0</td>\n",
" <td>1.0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm</td>\n",
" <td>1</td>\n",
" <td>-123.646240</td>\n",
" <td>124.321068</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.972370</td>\n",
" <td>-0.233445</td>\n",
" <td>1</td>\n",
" <td>0.488281</td>\n",
" <td>-2000</td>\n",
" <td>2810</td>\n",
" <td>12.801376</td>\n",
" <td>1209.046168</td>\n",
" <td>0.250923</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ID_25457734a</td>\n",
" <td>CT</td>\n",
" <td>ID_329aafa7</td>\n",
" <td>ID_8dd6d32f3b</td>\n",
" <td>ID_116558f409</td>\n",
" <td></td>\n",
" <td>-114.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>MONOCHROME2</td>\n",
" <td>512</td>\n",
" <td>512</td>\n",
" <td>0.445312</td>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>36.0</td>\n",
" <td>80.0</td>\n",
" <td>-1024.0</td>\n",
" <td>1.0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm</td>\n",
" <td>1</td>\n",
" <td>-6.000000</td>\n",
" <td>171.999939</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" <td>0.445312</td>\n",
" <td>0</td>\n",
" <td>2647</td>\n",
" <td>566.557011</td>\n",
" <td>610.152845</td>\n",
" <td>0.298386</td>\n",
" <td>1.0</td>\n",
" <td>36.0</td>\n",
" <td>1.0</td>\n",
" <td>80.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ID_81c9aa125</td>\n",
" <td>CT</td>\n",
" <td>ID_6b544c3c</td>\n",
" <td>ID_2685c5d5c0</td>\n",
" <td>ID_f56d7bd0f9</td>\n",
" <td></td>\n",
" <td>-115.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>MONOCHROME2</td>\n",
" <td>512</td>\n",
" <td>512</td>\n",
" <td>0.449219</td>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>36.0</td>\n",
" <td>80.0</td>\n",
" <td>-1024.0</td>\n",
" <td>1.0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm</td>\n",
" <td>1</td>\n",
" <td>-1.000000</td>\n",
" <td>230.500000</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" <td>0.449219</td>\n",
" <td>4</td>\n",
" <td>1570</td>\n",
" <td>178.512295</td>\n",
" <td>358.235071</td>\n",
" <td>0.006176</td>\n",
" <td>1.0</td>\n",
" <td>36.0</td>\n",
" <td>1.0</td>\n",
" <td>80.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SOPInstanceUID Modality PatientID StudyInstanceUID SeriesInstanceUID \\\n",
"0 ID_231d901c1 CT ID_b81a287f ID_dd37ba3adb ID_15dcd6057a \n",
"1 ID_994bc0470 CT ID_400facde ID_c5277f0c63 ID_4ba12c2161 \n",
"2 ID_127689cce CT ID_42910d3d ID_db93ade25b ID_c4b4931314 \n",
"3 ID_25457734a CT ID_329aafa7 ID_8dd6d32f3b ID_116558f409 \n",
"4 ID_81c9aa125 CT ID_6b544c3c ID_2685c5d5c0 ID_f56d7bd0f9 \n",
"\n",
" StudyID ImagePositionPatient ImageOrientationPatient SamplesPerPixel \\\n",
"0 -125.0 1.0 1 \n",
"1 -125.0 1.0 1 \n",
"2 -125.0 1.0 1 \n",
"3 -114.0 1.0 1 \n",
"4 -115.0 1.0 1 \n",
"\n",
" PhotometricInterpretation Rows Columns PixelSpacing BitsAllocated \\\n",
"0 MONOCHROME2 512 512 0.488281 16 \n",
"1 MONOCHROME2 512 512 0.488281 16 \n",
"2 MONOCHROME2 512 512 0.488281 16 \n",
"3 MONOCHROME2 512 512 0.445312 16 \n",
"4 MONOCHROME2 512 512 0.449219 16 \n",
"\n",
" BitsStored HighBit PixelRepresentation WindowCenter WindowWidth \\\n",
"0 16 15 1 40.0 100.0 \n",
"1 12 11 0 47.0 80.0 \n",
"2 16 15 1 30.0 80.0 \n",
"3 12 11 0 36.0 80.0 \n",
"4 12 11 0 36.0 80.0 \n",
"\n",
" RescaleIntercept RescaleSlope \\\n",
"0 -1024.0 1.0 \n",
"1 -1024.0 1.0 \n",
"2 -1024.0 1.0 \n",
"3 -1024.0 1.0 \n",
"4 -1024.0 1.0 \n",
"\n",
" fname \\\n",
"0 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm \n",
"1 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm \n",
"2 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm \n",
"3 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm \n",
"4 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm \n",
"\n",
" MultiImagePositionPatient ImagePositionPatient1 ImagePositionPatient2 \\\n",
"0 1 -123.101000 104.307000 \n",
"1 1 53.628222 223.572015 \n",
"2 1 -123.646240 124.321068 \n",
"3 1 -6.000000 171.999939 \n",
"4 1 -1.000000 230.500000 \n",
"\n",
" MultiImageOrientationPatient ImageOrientationPatient1 \\\n",
"0 1 0.0 \n",
"1 1 0.0 \n",
"2 1 0.0 \n",
"3 1 0.0 \n",
"4 1 0.0 \n",
"\n",
" ImageOrientationPatient2 ImageOrientationPatient3 \\\n",
"0 0.0 0.0 \n",
"1 0.0 0.0 \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 0.0 \n",
"\n",
" ImageOrientationPatient4 ImageOrientationPatient5 MultiPixelSpacing \\\n",
"0 0.984808 -0.173648 1 \n",
"1 0.933580 -0.358368 1 \n",
"2 0.972370 -0.233445 1 \n",
"3 1.000000 0.000000 1 \n",
"4 1.000000 0.000000 1 \n",
"\n",
" PixelSpacing1 img_min img_max img_mean img_std img_pct_window \\\n",
"0 0.488281 -1024 3263 171.462490 828.102464 0.164074 \n",
"1 0.488281 0 2507 430.418091 599.742963 0.198139 \n",
"2 0.488281 -2000 2810 12.801376 1209.046168 0.250923 \n",
"3 0.445312 0 2647 566.557011 610.152845 0.298386 \n",
"4 0.449219 4 1570 178.512295 358.235071 0.006176 \n",
"\n",
" MultiWindowCenter WindowCenter1 MultiWindowWidth WindowWidth1 \n",
"0 NaN NaN NaN NaN \n",
"1 1.0 47.0 1.0 80.0 \n",
"2 NaN NaN NaN NaN \n",
"3 1.0 36.0 1.0 80.0 \n",
"4 1.0 36.0 1.0 80.0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 101
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6Xai9HMWZIjT",
"colab_type": "text"
},
"source": [
"Sorting by patient ID groups patients together while sorting by \"ImagePositionPatient2\" sorts the patients' brain slices to be in correct order (thus, the 20 files output by the cell below contain subsequent slices of a single patient's brain)."
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "laKYqlaO3DMy",
"colab_type": "code",
"colab": {}
},
"source": [
"train_metadata.sort_values(by=[\"PatientID\", \"ImagePositionPatient2\"]).head(20)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "RgF5wYFK3DM-",
"colab_type": "code",
"outputId": "4f8c3482-d623-4668-83dc-972423dd2bc3",
"colab": {}
},
"source": [
"# Verify that we can retrieve the name of our files from the metadata\n",
"# (important for matching our PNGs extracted from DICOM files to their metadata)\n",
"train_metadata[[\"SOPInstanceUID\", \"fname\"]].sort_values(\"SOPInstanceUID\")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SOPInstanceUID</th>\n",
" <th>fname</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>409738</th>\n",
" <td>ID_000039fa0</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>470057</th>\n",
" <td>ID_00005679d</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>548095</th>\n",
" <td>ID_00008ce3c</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204704</th>\n",
" <td>ID_0000950d7</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>291987</th>\n",
" <td>ID_0000aee4b</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>544908</th>\n",
" <td>ID_ffff73ede</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>385867</th>\n",
" <td>ID_ffff80705</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>674027</th>\n",
" <td>ID_ffff82e46</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52232</th>\n",
" <td>ID_ffff922b9</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5317</th>\n",
" <td>ID_fffff9393</td>\n",
" <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>674258 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" SOPInstanceUID \\\n",
"409738 ID_000039fa0 \n",
"470057 ID_00005679d \n",
"548095 ID_00008ce3c \n",
"204704 ID_0000950d7 \n",
"291987 ID_0000aee4b \n",
"... ... \n",
"544908 ID_ffff73ede \n",
"385867 ID_ffff80705 \n",
"674027 ID_ffff82e46 \n",
"52232 ID_ffff922b9 \n",
"5317 ID_fffff9393 \n",
"\n",
" fname \n",
"409738 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm \n",
"470057 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm \n",
"548095 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm \n",
"204704 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm \n",
"291987 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm \n",
"... ... \n",
"544908 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm \n",
"385867 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm \n",
"674027 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm \n",
"52232 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm \n",
"5317 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm \n",
"\n",
"[674258 rows x 2 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 103
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P8goUuHTaBA9",
"colab_type": "text"
},
"source": [
"Next, we want to know how many slices of a patient's brain the CT scans in our data usually contain. The histogram below tells us that the answer is ~30."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Tg-1hmrC3DND",
"colab_type": "code",
"outputId": "0bfdf379-e0c9-4eeb-f97d-8b2a70e5853d",
"colab": {}
},
"source": [
"plt.figure(figsize=(20, 6))\n",
"train_metadata.groupby(\"PatientID\").Modality.count().hist(bins=150)\n",
"plt.show()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x432 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nn0cztnmajeI",
"colab_type": "text"
},
"source": [
"# **Data Cleaning**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "10FFXYu53DNM",
"colab_type": "text"
},
"source": [
"## **Step 1: Removing Files w/ Incorrect Rescale Intercept**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tSztirfuaz-t",
"colab_type": "text"
},
"source": [
"The rescale intercept of our DICOM files (see metadata) should be -1024 for all files. However, as the histogram below shows, some fraction of files are corrupted: Their rescale intercept is much larger."
]
},
{
"cell_type": "code",
"metadata": {
"id": "pAgE8QJI3DNN",
"colab_type": "code",
"outputId": "e1c38fcb-0e95-47c4-cc2f-ed0b1d2f1a2a",
"colab": {}
},
"source": [
"train_metadata.RescaleIntercept.hist()\n",
"plt.show()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "km5BztPZ3DNR",
"colab_type": "code",
"outputId": "196e276a-5c36-43c1-f88f-9912cee61fd5",
"colab": {}
},
"source": [
"# We identify the corrupted files in the next three cells\n",
"train_metadata.query(\"RescaleIntercept!=-1024\").groupby(\"PatientID\").count().sort_values(\"SOPInstanceUID\")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SOPInstanceUID</th>\n",
" <th>Modality</th>\n",
" <th>StudyInstanceUID</th>\n",
" <th>SeriesInstanceUID</th>\n",
" <th>StudyID</th>\n",
" <th>ImagePositionPatient</th>\n",
" <th>ImageOrientationPatient</th>\n",
" <th>SamplesPerPixel</th>\n",
" <th>PhotometricInterpretation</th>\n",
" <th>Rows</th>\n",
" <th>Columns</th>\n",
" <th>PixelSpacing</th>\n",
" <th>BitsAllocated</th>\n",
" <th>BitsStored</th>\n",
" <th>HighBit</th>\n",
" <th>PixelRepresentation</th>\n",
" <th>WindowCenter</th>\n",
" <th>WindowWidth</th>\n",
" <th>RescaleIntercept</th>\n",
" <th>RescaleSlope</th>\n",
" <th>fname</th>\n",
" <th>MultiImagePositionPatient</th>\n",
" <th>ImagePositionPatient1</th>\n",
" <th>ImagePositionPatient2</th>\n",
" <th>MultiImageOrientationPatient</th>\n",
" <th>ImageOrientationPatient1</th>\n",
" <th>ImageOrientationPatient2</th>\n",
" <th>ImageOrientationPatient3</th>\n",
" <th>ImageOrientationPatient4</th>\n",
" <th>ImageOrientationPatient5</th>\n",
" <th>MultiPixelSpacing</th>\n",
" <th>PixelSpacing1</th>\n",
" <th>img_min</th>\n",
" <th>img_max</th>\n",
" <th>img_mean</th>\n",
" <th>img_std</th>\n",
" <th>img_pct_window</th>\n",
" <th>MultiWindowCenter</th>\n",
" <th>WindowCenter1</th>\n",
" <th>MultiWindowWidth</th>\n",
" <th>WindowWidth1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PatientID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>ID_b956c8dd</th>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_11e103d4</th>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_03ac0e28</th>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_57a06f55</th>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_cd2e1b47</th>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>23</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_a579ac67</th>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_2b35cfb8</th>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_00526c11</th>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_aa91c454</th>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ID_b4f7750e</th>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" <td>64</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>407 rows × 41 columns</p>\n",
"</div>"
],
"text/plain": [
" SOPInstanceUID Modality StudyInstanceUID SeriesInstanceUID \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 \n",
"ID_cd2e1b47 23 23 23 23 \n",
"... .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 \n",
"\n",
" StudyID ImagePositionPatient ImageOrientationPatient \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 \n",
"ID_11e103d4 20 20 20 \n",
"ID_03ac0e28 22 22 22 \n",
"ID_57a06f55 23 23 23 \n",
"ID_cd2e1b47 23 23 23 \n",
"... .. .. .. \n",
"ID_a579ac67 42 42 42 \n",
"ID_2b35cfb8 52 52 52 \n",
"ID_00526c11 52 52 52 \n",
"ID_aa91c454 56 56 56 \n",
"ID_b4f7750e 64 64 64 \n",
"\n",
" SamplesPerPixel PhotometricInterpretation Rows Columns \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 \n",
"ID_cd2e1b47 23 23 23 23 \n",
"... .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 \n",
"\n",
" PixelSpacing BitsAllocated BitsStored HighBit \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 \n",
"ID_cd2e1b47 23 23 23 23 \n",
"... .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 \n",
"\n",
" PixelRepresentation WindowCenter WindowWidth RescaleIntercept \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 \n",
"ID_cd2e1b47 23 23 23 23 \n",
"... .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 \n",
"\n",
" RescaleSlope fname MultiImagePositionPatient \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 \n",
"ID_11e103d4 20 20 20 \n",
"ID_03ac0e28 22 22 22 \n",
"ID_57a06f55 23 23 23 \n",
"ID_cd2e1b47 23 23 23 \n",
"... .. .. .. \n",
"ID_a579ac67 42 42 42 \n",
"ID_2b35cfb8 52 52 52 \n",
"ID_00526c11 52 52 52 \n",
"ID_aa91c454 56 56 56 \n",
"ID_b4f7750e 64 64 64 \n",
"\n",
" ImagePositionPatient1 ImagePositionPatient2 \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 \n",
"ID_11e103d4 20 20 \n",
"ID_03ac0e28 22 22 \n",
"ID_57a06f55 23 23 \n",
"ID_cd2e1b47 23 23 \n",
"... .. .. \n",
"ID_a579ac67 42 42 \n",
"ID_2b35cfb8 52 52 \n",
"ID_00526c11 52 52 \n",
"ID_aa91c454 56 56 \n",
"ID_b4f7750e 64 64 \n",
"\n",
" MultiImageOrientationPatient ImageOrientationPatient1 \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 \n",
"ID_11e103d4 20 20 \n",
"ID_03ac0e28 22 22 \n",
"ID_57a06f55 23 23 \n",
"ID_cd2e1b47 23 23 \n",
"... .. .. \n",
"ID_a579ac67 42 42 \n",
"ID_2b35cfb8 52 52 \n",
"ID_00526c11 52 52 \n",
"ID_aa91c454 56 56 \n",
"ID_b4f7750e 64 64 \n",
"\n",
" ImageOrientationPatient2 ImageOrientationPatient3 \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 \n",
"ID_11e103d4 20 20 \n",
"ID_03ac0e28 22 22 \n",
"ID_57a06f55 23 23 \n",
"ID_cd2e1b47 23 23 \n",
"... .. .. \n",
"ID_a579ac67 42 42 \n",
"ID_2b35cfb8 52 52 \n",
"ID_00526c11 52 52 \n",
"ID_aa91c454 56 56 \n",
"ID_b4f7750e 64 64 \n",
"\n",
" ImageOrientationPatient4 ImageOrientationPatient5 \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 \n",
"ID_11e103d4 20 20 \n",
"ID_03ac0e28 22 22 \n",
"ID_57a06f55 23 23 \n",
"ID_cd2e1b47 23 23 \n",
"... .. .. \n",
"ID_a579ac67 42 42 \n",
"ID_2b35cfb8 52 52 \n",
"ID_00526c11 52 52 \n",
"ID_aa91c454 56 56 \n",
"ID_b4f7750e 64 64 \n",
"\n",
" MultiPixelSpacing PixelSpacing1 img_min img_max img_mean \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 23 \n",
"ID_cd2e1b47 23 23 23 23 23 \n",
"... .. .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 64 \n",
"\n",
" img_std img_pct_window MultiWindowCenter WindowCenter1 \\\n",
"PatientID \n",
"ID_b956c8dd 20 20 20 20 \n",
"ID_11e103d4 20 20 20 20 \n",
"ID_03ac0e28 22 22 22 22 \n",
"ID_57a06f55 23 23 23 23 \n",
"ID_cd2e1b47 23 23 0 0 \n",
"... .. .. .. .. \n",
"ID_a579ac67 42 42 42 42 \n",
"ID_2b35cfb8 52 52 52 52 \n",
"ID_00526c11 52 52 52 52 \n",
"ID_aa91c454 56 56 56 56 \n",
"ID_b4f7750e 64 64 64 64 \n",
"\n",
" MultiWindowWidth WindowWidth1 \n",
"PatientID \n",
"ID_b956c8dd 20 20 \n",
"ID_11e103d4 20 20 \n",
"ID_03ac0e28 22 22 \n",
"ID_57a06f55 23 23 \n",
"ID_cd2e1b47 0 0 \n",
"... .. .. \n",
"ID_a579ac67 42 42 \n",
"ID_2b35cfb8 52 52 \n",
"ID_00526c11 52 52 \n",
"ID_aa91c454 56 56 \n",
"ID_b4f7750e 64 64 \n",
"\n",
"[407 rows x 41 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 106
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "A4FcWqQk3DNb",
"colab_type": "code",
"colab": {}
},
"source": [
"png_IDs_to_remove = sorted(train_metadata.query(\"RescaleIntercept!=-1024\").SOPInstanceUID.tolist())"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "tT9LMoLl3DNf",
"colab_type": "code",
"outputId": "8a4deaf2-005f-4821-d992-98eaf00f6a18",
"colab": {}
},
"source": [
"png_IDs_to_remove[0:10]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['ID_0007ff5d1',\n",
" 'ID_000aa2bce',\n",
" 'ID_000bd8380',\n",
" 'ID_0012b1611',\n",
" 'ID_0015e926e',\n",
" 'ID_001bdd8fb',\n",
" 'ID_001d4ce1c',\n",
" 'ID_0023e98ab',\n",
" 'ID_0024b1888',\n",
" 'ID_00382ea5e']"
]
},
"metadata": {
"tags": []
},
"execution_count": 108
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3O7qOYMr3DNv",
"colab_type": "text"
},
"source": [
"\n",
"Now, we remove the corrupted files and update the metadata & labels to no longer contain references to said files."
]
},
{
"cell_type": "code",
"metadata": {
"id": "P7KMYBqb3DNw",
"colab_type": "code",
"colab": {}
},
"source": [
"# Removing corrupted files\n",
"folder_path = \"C:/Users/Administrator/Downloads/rsna_stage1_png_128/stage_1_train_images\"\n",
"for ID in tqdm.tqdm(png_IDs_to_remove):\n",
" os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "PM0qyR8_3DOA",
"colab_type": "code",
"outputId": "e75bd07a-2354-4208-c353-284f6212cdae",
"colab": {}
},
"source": [
"# Verify that corrupted files have been removed\n",
"os.path.isfile(\"./rsna_stage1_png_128/stage_1_train_images/ID_0007ff5d1.png\")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"False"
]
},
"metadata": {
"tags": []
},
"execution_count": 112
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "s8Vp_bom3DOE",
"colab_type": "code",
"colab": {}
},
"source": [
"# Update labels and metadata\n",
"labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n",
"train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hIWgnYiA3DOO",
"colab_type": "text"
},
"source": [
"### **Step 2: Remove Images w/ Low Brain Percentage**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WwumGFykeJcG",
"colab_type": "text"
},
"source": [
"Some images contain virtually no brain matter (that is, they are slices of the patient's skull either above or below their brain). Here, we identify and remove these files."
]
},
{
"cell_type": "code",
"metadata": {
"id": "fcmgGDXQ3DOP",
"colab_type": "code",
"outputId": "d2f970c7-c46f-46a9-d2df-dffeec85b02f",
"colab": {}
},
"source": [
"# This histogram shows the percentage of brain matter present in all of our files \n",
"train_metadata.img_pct_window.hist(bins=100)\n",
"plt.show()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "t0C89Ux_3DPZ",
"colab_type": "code",
"colab": {}
},
"source": [
"# We choose to remove images containing less than 2% brain matter\n",
"png_IDs_to_remove = train_metadata.query(\"img_pct_window<0.02\").SOPInstanceUID.tolist()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "HjiMcCVX3DPl",
"colab_type": "code",
"colab": {}
},
"source": [
"for ID in tqdm.tqdm(png_IDs_to_remove):\n",
" os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "boyg1HUM3DPx",
"colab_type": "code",
"colab": {}
},
"source": [
"# Once again, we update labels and metadata to reflect the changes we have undertaken\n",
"labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n",
"train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SoMn02rk3DP1",
"colab_type": "text"
},
"source": [
"## **Step 3: Create New CSVs**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "MVFPLKrO3DP1",
"colab_type": "code",
"colab": {}
},
"source": [
"# Here, we write out the cleaned CSV files for labels and metadata\n",
"labels.to_csv(\"labels_cleaned.csv\")\n",
"train_metadata.to_csv(\"train_metadata_cleaned.csv\")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "0rpJfRme3DQD",
"colab_type": "text"
},
"source": [
"## **Step 4: Check that Label File & Metadata File Agree**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3uHZfip1mFpm",
"colab_type": "text"
},
"source": [
"We have made substantial udpates to both the label file and to the metadata file. It is now prudent to check that both files still agree with each other to ensure the absence of bugs in the above code. We also conduct a few other sanity checks."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PNaF8UrE3DQE",
"colab_type": "code",
"outputId": "f29604dd-e13e-41d1-8819-d79e921c26d9",
"colab": {}
},
"source": [
"csv_labels = pd.read_csv(\"./rsna_stage1_png_128/stage_1_train.csv\")\n",
"csv_labels = csv_labels.iloc[5::6, :]\n",
"csv_labels.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>ID_63eb1e259_any</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>ID_2669954a7_any</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>ID_52c9913b1_any</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>ID_4e6ff6126_any</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ID_7858edd88_any</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Label\n",
"5 ID_63eb1e259_any 0 \n",
"11 ID_2669954a7_any 0 \n",
"17 ID_52c9913b1_any 0 \n",
"23 ID_4e6ff6126_any 0 \n",
"29 ID_7858edd88_any 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 73
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "x6KgfO-Z3DQI",
"colab_type": "code",
"outputId": "8ddd8d0d-4f1d-46af-95ef-14e49f779c6b",
"colab": {}
},
"source": [
"# Get rid of the \"_any\" part of the IDs & verify that it worked\n",
"csv_labels[\"ID\"] = csv_labels[\"ID\"].str.replace(\"_any\", \"\")\n",
"csv_labels.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>ID_63eb1e259</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>ID_2669954a7</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>ID_52c9913b1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>ID_4e6ff6126</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ID_7858edd88</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Label\n",
"5 ID_63eb1e259 0 \n",
"11 ID_2669954a7 0 \n",
"17 ID_52c9913b1 0 \n",
"23 ID_4e6ff6126 0 \n",
"29 ID_7858edd88 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 74
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "c_PQHFQv3DQU",
"colab_type": "code",
"colab": {}
},
"source": [
"# Verify that \"csv_labels\" and \"labels\" contain the same number of elements (they do)\n",
"csv_labels_set = set(csv_labels[\"ID\"].tolist())\n",
"labels_set = set(labels[\"ID\"].str.replace(\".png\", \"\").tolist())\n",
"\n",
"print(len(csv_labels_set))\n",
"print(len(labels_set))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rYF4FM4K3DQY",
"colab_type": "code",
"outputId": "454a4d95-f875-4872-ee10-918fa7d15b8e",
"colab": {}
},
"source": [
"# Verify that \"csv_labels\" and \"labels\" contain the same elements\n",
"# (Indeed they do since their difference is the empty set)\n",
"csv_labels_set.symmetric_difference(labels_set)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"set()"
]
},
"metadata": {
"tags": []
},
"execution_count": 76
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "SDT9V2Fj3DQk",
"colab_type": "code",
"outputId": "b71a8c73-b206-4a5c-b279-2659bf1818ca",
"colab": {}
},
"source": [
"# Also verify that \"metadata_labels\" and \"labels\" contain the same elements\n",
"metadata_labels_set = set(train_metadata[\"SOPInstanceUID\"].tolist())\n",
"metadata_labels_set.symmetric_difference(labels_set)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"set()"
]
},
"metadata": {
"tags": []
},
"execution_count": 79
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "GjiX_onx3DQn",
"colab_type": "code",
"outputId": "67fab19e-a2b7-41d7-e72f-e0c606112535",
"colab": {}
},
"source": [
"# Get rid of .png extension\n",
"fnames = glob.glob(\"./rsna_stage1_png_128/stage_1_train_images/*\")\n",
"fnames = [fname.replace(\"./rsna_stage1_png_128/stage_1_train_images\\\\\", \"\").replace(\".png\", \"\") for fname in fnames]\n",
"print(fnames[0:10])\n",
"print(len(fnames))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"['ID_000039fa0', 'ID_00005679d', 'ID_00008ce3c', 'ID_0000950d7', 'ID_0000aee4b', 'ID_0000f1657', 'ID_000178e76', 'ID_00019828f', 'ID_0001dcc25', 'ID_0001de0e8']\n",
"641386\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "M-8LSvt50L28",
"colab_type": "text"
},
"source": [
"We have cleaned up the data by deleting ~30,000 images from it:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "1ZoqfyXc3DQ1",
"colab_type": "code",
"outputId": "1a5a26fa-e566-48aa-f4bc-7573d8c22ba8",
"colab": {}
},
"source": [
"fnames_set = set(fnames)\n",
"len(fnames_set.symmetric_difference(metadata_labels_set))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"32872"
]
},
"metadata": {
"tags": []
},
"execution_count": 94
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ai0rn23E3DQ5",
"colab_type": "text"
},
"source": [
"# **Reconstructing the Underlying CT Scans**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TdrbjOxp0-aL",
"colab_type": "text"
},
"source": [
"Now, we can finally order the images in the sequence that they were taken in. That is, for each patient, we will now reconstruct their brain by arranging the images from their CT scan by increasing scan depth. This will allow us to make use of the **spatial relations** present between a patient's various images (specifically, we do this through the use of a bidirectional LSTM)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "c3062b_a3DRC",
"colab_type": "code",
"outputId": "c221707f-c0b0-49d6-fdf1-1974a6c43785",
"colab": {}
},
"source": [
"# This is the part of the metadata the we care about:\n",
"# \"ImagePositionPatient2\" specifies the slice depth\n",
"train_metadata_slice_patient_depth = train_metadata[[\"SOPInstanceUID\", \"PatientID\", \"ImagePositionPatient2\"]]\n",
"train_metadata_slice_patient_depth.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SOPInstanceUID</th>\n",
" <th>PatientID</th>\n",
" <th>ImagePositionPatient2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ID_231d901c1</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>104.307000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ID_994bc0470</td>\n",
" <td>ID_400facde</td>\n",
" <td>223.572015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ID_127689cce</td>\n",
" <td>ID_42910d3d</td>\n",
" <td>124.321068</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ID_25457734a</td>\n",
" <td>ID_329aafa7</td>\n",
" <td>171.999939</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>ID_87e8b2528</td>\n",
" <td>ID_d6e578fb</td>\n",
" <td>156.828114</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SOPInstanceUID PatientID ImagePositionPatient2\n",
"0 ID_231d901c1 ID_b81a287f 104.307000 \n",
"1 ID_994bc0470 ID_400facde 223.572015 \n",
"2 ID_127689cce ID_42910d3d 124.321068 \n",
"3 ID_25457734a ID_329aafa7 171.999939 \n",
"5 ID_87e8b2528 ID_d6e578fb 156.828114 "
]
},
"metadata": {
"tags": []
},
"execution_count": 141
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "pP3JfpDP3DRF",
"colab_type": "code",
"colab": {}
},
"source": [
"# We create the \"patient_slices\" dictionary:\n",
"# keys are patients, values are (slice depth and file names)\n",
"patient_slices = dict()\n",
"for i, row in tqdm.tqdm(train_metadata_slice_patient_depth.iterrows()):\n",
" if row[\"PatientID\"] not in patient_slices:\n",
" patient_slices[row[\"PatientID\"]] = [(row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"])]\n",
" else:\n",
" patient_slices[row[\"PatientID\"]].append((row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"]))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "b4qQe0SF3DRI",
"colab_type": "code",
"colab": {}
},
"source": [
"# We now sort the dictionary created above so as to arrange slices in the order that they belong in,\n",
"# thereby reconstructing all patient brains\n",
"patient_slices_sorted = dict()\n",
"for i, (key, val) in enumerate(patient_slices.items()):\n",
" val.sort()\n",
" patient_slices_sorted[key] = [ID for depth, ID in val]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "0Rq67G_v_wHZ",
"colab_type": "text"
},
"source": [
"Below is an example of one (1) patient's reconstructed CT scan. Note the increasing slice depth:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-WHINOCm3DRi",
"colab_type": "code",
"outputId": "15307222-89cf-43c8-ffa4-685749cf0109",
"colab": {}
},
"source": [
"train_metadata_slice_patient_depth.query(\"PatientID == 'ID_b81a287f'\").sort_values(\"ImagePositionPatient2\")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SOPInstanceUID</th>\n",
" <th>PatientID</th>\n",
" <th>ImagePositionPatient2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>214258</th>\n",
" <td>ID_9f601fc5d</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>7.956</td>\n",
" </tr>\n",
" <tr>\n",
" <th>427173</th>\n",
" <td>ID_19cb96474</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>13.033</td>\n",
" </tr>\n",
" <tr>\n",
" <th>636486</th>\n",
" <td>ID_496ab2661</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>18.110</td>\n",
" </tr>\n",
" <tr>\n",
" <th>436756</th>\n",
" <td>ID_06fe4adc5</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>23.187</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68576</th>\n",
" <td>ID_59bc3960f</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>28.266</td>\n",
" </tr>\n",
" <tr>\n",
" <th>331217</th>\n",
" <td>ID_aef3564ac</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>33.343</td>\n",
" </tr>\n",
" <tr>\n",
" <th>503169</th>\n",
" <td>ID_c1f0895bb</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>38.420</td>\n",
" </tr>\n",
" <tr>\n",
" <th>467952</th>\n",
" <td>ID_c85081ef5</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>43.497</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53879</th>\n",
" <td>ID_079fef6c2</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>48.456</td>\n",
" </tr>\n",
" <tr>\n",
" <th>245073</th>\n",
" <td>ID_5ab7d0d9a</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>53.533</td>\n",
" </tr>\n",
" <tr>\n",
" <th>287067</th>\n",
" <td>ID_54d628968</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>58.610</td>\n",
" </tr>\n",
" <tr>\n",
" <th>218156</th>\n",
" <td>ID_d3e4638f6</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>63.687</td>\n",
" </tr>\n",
" <tr>\n",
" <th>611699</th>\n",
" <td>ID_21af3f314</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>68.766</td>\n",
" </tr>\n",
" <tr>\n",
" <th>435682</th>\n",
" <td>ID_3eb115349</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>73.843</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10788</th>\n",
" <td>ID_c489f8a64</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>78.920</td>\n",
" </tr>\n",
" <tr>\n",
" <th>653377</th>\n",
" <td>ID_55e73915f</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>83.997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>552646</th>\n",
" <td>ID_678a1a095</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>89.076</td>\n",
" </tr>\n",
" <tr>\n",
" <th>492896</th>\n",
" <td>ID_508bd479e</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>94.153</td>\n",
" </tr>\n",
" <tr>\n",
" <th>297718</th>\n",
" <td>ID_6609a6357</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>99.230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ID_231d901c1</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>104.307</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214284</th>\n",
" <td>ID_826c83df4</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>109.386</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18784</th>\n",
" <td>ID_008507574</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>114.463</td>\n",
" </tr>\n",
" <tr>\n",
" <th>562637</th>\n",
" <td>ID_f49234b83</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>119.540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>125258</th>\n",
" <td>ID_ec9041d07</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>124.617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>486312</th>\n",
" <td>ID_bcbee580c</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>129.696</td>\n",
" </tr>\n",
" <tr>\n",
" <th>411439</th>\n",
" <td>ID_df700e73f</td>\n",
" <td>ID_b81a287f</td>\n",
" <td>134.773</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SOPInstanceUID PatientID ImagePositionPatient2\n",
"214258 ID_9f601fc5d ID_b81a287f 7.956 \n",
"427173 ID_19cb96474 ID_b81a287f 13.033 \n",
"636486 ID_496ab2661 ID_b81a287f 18.110 \n",
"436756 ID_06fe4adc5 ID_b81a287f 23.187 \n",
"68576 ID_59bc3960f ID_b81a287f 28.266 \n",
"331217 ID_aef3564ac ID_b81a287f 33.343 \n",
"503169 ID_c1f0895bb ID_b81a287f 38.420 \n",
"467952 ID_c85081ef5 ID_b81a287f 43.497 \n",
"53879 ID_079fef6c2 ID_b81a287f 48.456 \n",
"245073 ID_5ab7d0d9a ID_b81a287f 53.533 \n",
"287067 ID_54d628968 ID_b81a287f 58.610 \n",
"218156 ID_d3e4638f6 ID_b81a287f 63.687 \n",
"611699 ID_21af3f314 ID_b81a287f 68.766 \n",
"435682 ID_3eb115349 ID_b81a287f 73.843 \n",
"10788 ID_c489f8a64 ID_b81a287f 78.920 \n",
"653377 ID_55e73915f ID_b81a287f 83.997 \n",
"552646 ID_678a1a095 ID_b81a287f 89.076 \n",
"492896 ID_508bd479e ID_b81a287f 94.153 \n",
"297718 ID_6609a6357 ID_b81a287f 99.230 \n",
"0 ID_231d901c1 ID_b81a287f 104.307 \n",
"214284 ID_826c83df4 ID_b81a287f 109.386 \n",
"18784 ID_008507574 ID_b81a287f 114.463 \n",
"562637 ID_f49234b83 ID_b81a287f 119.540 \n",
"125258 ID_ec9041d07 ID_b81a287f 124.617 \n",
"486312 ID_bcbee580c ID_b81a287f 129.696 \n",
"411439 ID_df700e73f ID_b81a287f 134.773 "
]
},
"metadata": {
"tags": []
},
"execution_count": 162
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "-_kOiXjQ3DRt",
"colab_type": "code",
"colab": {}
},
"source": [
"# Save the ordered dictionary to a file\n",
"with open(\"ordered_slices_by_patient.pkl\", \"wb\") as f:\n",
" pickle.dump(patient_slices_sorted, f)"
],
"execution_count": 0,
"outputs": []
}
]
}