[d301d9]: / Notebooks / EDA and Data Cleaning.ipynb

Download this file

2776 lines (2776 with data), 136.7 kB

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.5"
    },
    "colab": {
      "name": "EDA and Data Cleaning.ipynb",
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xl323Q-gMutB",
        "colab_type": "text"
      },
      "source": [
        "Credits to Jeremy Howard who discovered that some files have a corrupted rescale intercept and that other files show very little or no brain matter (https://www.kaggle.com/jhoward/cleaning-the-data-for-rapid-prototyping-fastai)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YnJowERx3Sik",
        "colab_type": "text"
      },
      "source": [
        "# **Importing Dependencies**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CPe_QcSy3DLo",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import glob\n",
        "import os\n",
        "import pickle\n",
        "\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "import tqdm.notebook as tqdm\n",
        "\n",
        "pd.set_option('display.max_columns', 500)\n",
        "pd.set_option('display.max_colwidth', -1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yObm5-yx4FSf",
        "colab_type": "text"
      },
      "source": [
        "# **Reading in Metadata & Data Labels**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YA20uuhBVxAn",
        "colab_type": "text"
      },
      "source": [
        "DICOM files---the type of data we are dealing with here---do not merely contain images (like a slice of a patient's brain), but also a heap of metadata. Using parts of this metadata (specifically, patient IDs) lets us \"piece together\" *full* scans of patient brains from otherwise unrelated images.\n",
        "\n",
        "This step is crucial for our approach: For our sequential model to make use of the spatial relations inherent in CT scans, we must first reconstruct those spatial relations."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-7aoQvsa3DMP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "labels = pd.read_feather(\"./labels_jhoward.fth\")\n",
        "train_metadata = pd.read_feather(\"./df_trn.fth\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ok_pdN2o3DMd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Add .png extension to file IDs in the \"labels\" DataFrame\n",
        "labels = labels[[\"ID\", \"any\"]]\n",
        "labels[\"ID\"] = labels[\"ID\"].str[:] + \".png\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pRB1cxyd3DMg",
        "colab_type": "code",
        "outputId": "f3b361aa-0fe2-4f6a-977e-7957846c169e",
        "colab": {}
      },
      "source": [
        "# Verify that the above cell executed correctly\n",
        "labels.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>ID</th>\n",
              "      <th>any</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>ID_000039fa0.png</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>ID_00005679d.png</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>ID_00008ce3c.png</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>ID_0000950d7.png</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>ID_0000aee4b.png</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                 ID  any\n",
              "0  ID_000039fa0.png  0  \n",
              "1  ID_00005679d.png  0  \n",
              "2  ID_00008ce3c.png  0  \n",
              "3  ID_0000950d7.png  0  \n",
              "4  ID_0000aee4b.png  0  "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 99
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "czhB_jGTMRGn",
        "colab_type": "text"
      },
      "source": [
        "# **EDA**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GSdYiywS3DMp",
        "colab_type": "code",
        "outputId": "c8375c53-cb4c-4dcb-dcbe-746fc203665f",
        "colab": {}
      },
      "source": [
        "# We find that there are ~700,000 images, 14% of which contain hemorrhages\n",
        "# (see \"mean\" (note that hemorrhages have label 1 while no-hemorrhages have label 0))\n",
        "labels.describe()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>any</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>674258.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>mean</th>\n",
              "      <td>0.144015</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>std</th>\n",
              "      <td>0.351105</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>min</th>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25%</th>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>50%</th>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75%</th>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>max</th>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                 any\n",
              "count  674258.000000\n",
              "mean   0.144015     \n",
              "std    0.351105     \n",
              "min    0.000000     \n",
              "25%    0.000000     \n",
              "50%    0.000000     \n",
              "75%    0.000000     \n",
              "max    1.000000     "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 100
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7Aep4NdF3DMs",
        "colab_type": "code",
        "outputId": "44228e66-21e0-4eae-c68b-c97f71da97f1",
        "colab": {}
      },
      "source": [
        "# For reference, the full metadata contained in 5 DICOM files\n",
        "train_metadata.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SOPInstanceUID</th>\n",
              "      <th>Modality</th>\n",
              "      <th>PatientID</th>\n",
              "      <th>StudyInstanceUID</th>\n",
              "      <th>SeriesInstanceUID</th>\n",
              "      <th>StudyID</th>\n",
              "      <th>ImagePositionPatient</th>\n",
              "      <th>ImageOrientationPatient</th>\n",
              "      <th>SamplesPerPixel</th>\n",
              "      <th>PhotometricInterpretation</th>\n",
              "      <th>Rows</th>\n",
              "      <th>Columns</th>\n",
              "      <th>PixelSpacing</th>\n",
              "      <th>BitsAllocated</th>\n",
              "      <th>BitsStored</th>\n",
              "      <th>HighBit</th>\n",
              "      <th>PixelRepresentation</th>\n",
              "      <th>WindowCenter</th>\n",
              "      <th>WindowWidth</th>\n",
              "      <th>RescaleIntercept</th>\n",
              "      <th>RescaleSlope</th>\n",
              "      <th>fname</th>\n",
              "      <th>MultiImagePositionPatient</th>\n",
              "      <th>ImagePositionPatient1</th>\n",
              "      <th>ImagePositionPatient2</th>\n",
              "      <th>MultiImageOrientationPatient</th>\n",
              "      <th>ImageOrientationPatient1</th>\n",
              "      <th>ImageOrientationPatient2</th>\n",
              "      <th>ImageOrientationPatient3</th>\n",
              "      <th>ImageOrientationPatient4</th>\n",
              "      <th>ImageOrientationPatient5</th>\n",
              "      <th>MultiPixelSpacing</th>\n",
              "      <th>PixelSpacing1</th>\n",
              "      <th>img_min</th>\n",
              "      <th>img_max</th>\n",
              "      <th>img_mean</th>\n",
              "      <th>img_std</th>\n",
              "      <th>img_pct_window</th>\n",
              "      <th>MultiWindowCenter</th>\n",
              "      <th>WindowCenter1</th>\n",
              "      <th>MultiWindowWidth</th>\n",
              "      <th>WindowWidth1</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>ID_231d901c1</td>\n",
              "      <td>CT</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>ID_dd37ba3adb</td>\n",
              "      <td>ID_15dcd6057a</td>\n",
              "      <td></td>\n",
              "      <td>-125.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>MONOCHROME2</td>\n",
              "      <td>512</td>\n",
              "      <td>512</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>16</td>\n",
              "      <td>16</td>\n",
              "      <td>15</td>\n",
              "      <td>1</td>\n",
              "      <td>40.0</td>\n",
              "      <td>100.0</td>\n",
              "      <td>-1024.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm</td>\n",
              "      <td>1</td>\n",
              "      <td>-123.101000</td>\n",
              "      <td>104.307000</td>\n",
              "      <td>1</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.984808</td>\n",
              "      <td>-0.173648</td>\n",
              "      <td>1</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>-1024</td>\n",
              "      <td>3263</td>\n",
              "      <td>171.462490</td>\n",
              "      <td>828.102464</td>\n",
              "      <td>0.164074</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>ID_994bc0470</td>\n",
              "      <td>CT</td>\n",
              "      <td>ID_400facde</td>\n",
              "      <td>ID_c5277f0c63</td>\n",
              "      <td>ID_4ba12c2161</td>\n",
              "      <td></td>\n",
              "      <td>-125.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>MONOCHROME2</td>\n",
              "      <td>512</td>\n",
              "      <td>512</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>16</td>\n",
              "      <td>12</td>\n",
              "      <td>11</td>\n",
              "      <td>0</td>\n",
              "      <td>47.0</td>\n",
              "      <td>80.0</td>\n",
              "      <td>-1024.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm</td>\n",
              "      <td>1</td>\n",
              "      <td>53.628222</td>\n",
              "      <td>223.572015</td>\n",
              "      <td>1</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.933580</td>\n",
              "      <td>-0.358368</td>\n",
              "      <td>1</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>0</td>\n",
              "      <td>2507</td>\n",
              "      <td>430.418091</td>\n",
              "      <td>599.742963</td>\n",
              "      <td>0.198139</td>\n",
              "      <td>1.0</td>\n",
              "      <td>47.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>80.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>ID_127689cce</td>\n",
              "      <td>CT</td>\n",
              "      <td>ID_42910d3d</td>\n",
              "      <td>ID_db93ade25b</td>\n",
              "      <td>ID_c4b4931314</td>\n",
              "      <td></td>\n",
              "      <td>-125.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>MONOCHROME2</td>\n",
              "      <td>512</td>\n",
              "      <td>512</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>16</td>\n",
              "      <td>16</td>\n",
              "      <td>15</td>\n",
              "      <td>1</td>\n",
              "      <td>30.0</td>\n",
              "      <td>80.0</td>\n",
              "      <td>-1024.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm</td>\n",
              "      <td>1</td>\n",
              "      <td>-123.646240</td>\n",
              "      <td>124.321068</td>\n",
              "      <td>1</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.972370</td>\n",
              "      <td>-0.233445</td>\n",
              "      <td>1</td>\n",
              "      <td>0.488281</td>\n",
              "      <td>-2000</td>\n",
              "      <td>2810</td>\n",
              "      <td>12.801376</td>\n",
              "      <td>1209.046168</td>\n",
              "      <td>0.250923</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>ID_25457734a</td>\n",
              "      <td>CT</td>\n",
              "      <td>ID_329aafa7</td>\n",
              "      <td>ID_8dd6d32f3b</td>\n",
              "      <td>ID_116558f409</td>\n",
              "      <td></td>\n",
              "      <td>-114.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>MONOCHROME2</td>\n",
              "      <td>512</td>\n",
              "      <td>512</td>\n",
              "      <td>0.445312</td>\n",
              "      <td>16</td>\n",
              "      <td>12</td>\n",
              "      <td>11</td>\n",
              "      <td>0</td>\n",
              "      <td>36.0</td>\n",
              "      <td>80.0</td>\n",
              "      <td>-1024.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm</td>\n",
              "      <td>1</td>\n",
              "      <td>-6.000000</td>\n",
              "      <td>171.999939</td>\n",
              "      <td>1</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>1.000000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>1</td>\n",
              "      <td>0.445312</td>\n",
              "      <td>0</td>\n",
              "      <td>2647</td>\n",
              "      <td>566.557011</td>\n",
              "      <td>610.152845</td>\n",
              "      <td>0.298386</td>\n",
              "      <td>1.0</td>\n",
              "      <td>36.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>80.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>ID_81c9aa125</td>\n",
              "      <td>CT</td>\n",
              "      <td>ID_6b544c3c</td>\n",
              "      <td>ID_2685c5d5c0</td>\n",
              "      <td>ID_f56d7bd0f9</td>\n",
              "      <td></td>\n",
              "      <td>-115.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>MONOCHROME2</td>\n",
              "      <td>512</td>\n",
              "      <td>512</td>\n",
              "      <td>0.449219</td>\n",
              "      <td>16</td>\n",
              "      <td>12</td>\n",
              "      <td>11</td>\n",
              "      <td>0</td>\n",
              "      <td>36.0</td>\n",
              "      <td>80.0</td>\n",
              "      <td>-1024.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm</td>\n",
              "      <td>1</td>\n",
              "      <td>-1.000000</td>\n",
              "      <td>230.500000</td>\n",
              "      <td>1</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>1.000000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>1</td>\n",
              "      <td>0.449219</td>\n",
              "      <td>4</td>\n",
              "      <td>1570</td>\n",
              "      <td>178.512295</td>\n",
              "      <td>358.235071</td>\n",
              "      <td>0.006176</td>\n",
              "      <td>1.0</td>\n",
              "      <td>36.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>80.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "  SOPInstanceUID Modality    PatientID StudyInstanceUID SeriesInstanceUID  \\\n",
              "0  ID_231d901c1   CT       ID_b81a287f  ID_dd37ba3adb    ID_15dcd6057a      \n",
              "1  ID_994bc0470   CT       ID_400facde  ID_c5277f0c63    ID_4ba12c2161      \n",
              "2  ID_127689cce   CT       ID_42910d3d  ID_db93ade25b    ID_c4b4931314      \n",
              "3  ID_25457734a   CT       ID_329aafa7  ID_8dd6d32f3b    ID_116558f409      \n",
              "4  ID_81c9aa125   CT       ID_6b544c3c  ID_2685c5d5c0    ID_f56d7bd0f9      \n",
              "\n",
              "  StudyID  ImagePositionPatient  ImageOrientationPatient  SamplesPerPixel  \\\n",
              "0         -125.0                 1.0                      1                 \n",
              "1         -125.0                 1.0                      1                 \n",
              "2         -125.0                 1.0                      1                 \n",
              "3         -114.0                 1.0                      1                 \n",
              "4         -115.0                 1.0                      1                 \n",
              "\n",
              "  PhotometricInterpretation  Rows  Columns  PixelSpacing  BitsAllocated  \\\n",
              "0  MONOCHROME2               512   512      0.488281      16              \n",
              "1  MONOCHROME2               512   512      0.488281      16              \n",
              "2  MONOCHROME2               512   512      0.488281      16              \n",
              "3  MONOCHROME2               512   512      0.445312      16              \n",
              "4  MONOCHROME2               512   512      0.449219      16              \n",
              "\n",
              "   BitsStored  HighBit  PixelRepresentation  WindowCenter  WindowWidth  \\\n",
              "0  16          15       1                    40.0          100.0         \n",
              "1  12          11       0                    47.0          80.0          \n",
              "2  16          15       1                    30.0          80.0          \n",
              "3  12          11       0                    36.0          80.0          \n",
              "4  12          11       0                    36.0          80.0          \n",
              "\n",
              "   RescaleIntercept  RescaleSlope  \\\n",
              "0 -1024.0            1.0            \n",
              "1 -1024.0            1.0            \n",
              "2 -1024.0            1.0            \n",
              "3 -1024.0            1.0            \n",
              "4 -1024.0            1.0            \n",
              "\n",
              "                                                                                   fname  \\\n",
              "0  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm   \n",
              "1  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm   \n",
              "2  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm   \n",
              "3  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm   \n",
              "4  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm   \n",
              "\n",
              "   MultiImagePositionPatient  ImagePositionPatient1  ImagePositionPatient2  \\\n",
              "0  1                         -123.101000             104.307000              \n",
              "1  1                          53.628222              223.572015              \n",
              "2  1                         -123.646240             124.321068              \n",
              "3  1                         -6.000000               171.999939              \n",
              "4  1                         -1.000000               230.500000              \n",
              "\n",
              "   MultiImageOrientationPatient  ImageOrientationPatient1  \\\n",
              "0  1                             0.0                        \n",
              "1  1                             0.0                        \n",
              "2  1                             0.0                        \n",
              "3  1                             0.0                        \n",
              "4  1                             0.0                        \n",
              "\n",
              "   ImageOrientationPatient2  ImageOrientationPatient3  \\\n",
              "0  0.0                       0.0                        \n",
              "1  0.0                       0.0                        \n",
              "2  0.0                       0.0                        \n",
              "3  0.0                       0.0                        \n",
              "4  0.0                       0.0                        \n",
              "\n",
              "   ImageOrientationPatient4  ImageOrientationPatient5  MultiPixelSpacing  \\\n",
              "0  0.984808                 -0.173648                  1                   \n",
              "1  0.933580                 -0.358368                  1                   \n",
              "2  0.972370                 -0.233445                  1                   \n",
              "3  1.000000                  0.000000                  1                   \n",
              "4  1.000000                  0.000000                  1                   \n",
              "\n",
              "   PixelSpacing1  img_min  img_max    img_mean      img_std  img_pct_window  \\\n",
              "0  0.488281      -1024     3263     171.462490  828.102464   0.164074         \n",
              "1  0.488281       0        2507     430.418091  599.742963   0.198139         \n",
              "2  0.488281      -2000     2810     12.801376   1209.046168  0.250923         \n",
              "3  0.445312       0        2647     566.557011  610.152845   0.298386         \n",
              "4  0.449219       4        1570     178.512295  358.235071   0.006176         \n",
              "\n",
              "   MultiWindowCenter  WindowCenter1  MultiWindowWidth  WindowWidth1  \n",
              "0 NaN                NaN            NaN               NaN            \n",
              "1  1.0                47.0           1.0               80.0          \n",
              "2 NaN                NaN            NaN               NaN            \n",
              "3  1.0                36.0           1.0               80.0          \n",
              "4  1.0                36.0           1.0               80.0          "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 101
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6Xai9HMWZIjT",
        "colab_type": "text"
      },
      "source": [
        "Sorting by patient ID groups patients together while sorting by \"ImagePositionPatient2\" sorts the patients' brain slices to be in correct order (thus, the 20 files output by the cell below contain subsequent slices of a single patient's brain)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "laKYqlaO3DMy",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train_metadata.sort_values(by=[\"PatientID\", \"ImagePositionPatient2\"]).head(20)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RgF5wYFK3DM-",
        "colab_type": "code",
        "outputId": "4f8c3482-d623-4668-83dc-972423dd2bc3",
        "colab": {}
      },
      "source": [
        "# Verify that we can retrieve the name of our files from the metadata\n",
        "# (important for matching our PNGs extracted from DICOM files to their metadata)\n",
        "train_metadata[[\"SOPInstanceUID\", \"fname\"]].sort_values(\"SOPInstanceUID\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SOPInstanceUID</th>\n",
              "      <th>fname</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>409738</th>\n",
              "      <td>ID_000039fa0</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>470057</th>\n",
              "      <td>ID_00005679d</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>548095</th>\n",
              "      <td>ID_00008ce3c</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>204704</th>\n",
              "      <td>ID_0000950d7</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>291987</th>\n",
              "      <td>ID_0000aee4b</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>544908</th>\n",
              "      <td>ID_ffff73ede</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>385867</th>\n",
              "      <td>ID_ffff80705</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>674027</th>\n",
              "      <td>ID_ffff82e46</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>52232</th>\n",
              "      <td>ID_ffff922b9</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5317</th>\n",
              "      <td>ID_fffff9393</td>\n",
              "      <td>../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>674258 rows × 2 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "       SOPInstanceUID  \\\n",
              "409738  ID_000039fa0    \n",
              "470057  ID_00005679d    \n",
              "548095  ID_00008ce3c    \n",
              "204704  ID_0000950d7    \n",
              "291987  ID_0000aee4b    \n",
              "...              ...    \n",
              "544908  ID_ffff73ede    \n",
              "385867  ID_ffff80705    \n",
              "674027  ID_ffff82e46    \n",
              "52232   ID_ffff922b9    \n",
              "5317    ID_fffff9393    \n",
              "\n",
              "                                                                                        fname  \n",
              "409738  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm  \n",
              "470057  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm  \n",
              "548095  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm  \n",
              "204704  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm  \n",
              "291987  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm  \n",
              "...                                                                                       ...  \n",
              "544908  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm  \n",
              "385867  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm  \n",
              "674027  ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm  \n",
              "52232   ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm  \n",
              "5317    ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm  \n",
              "\n",
              "[674258 rows x 2 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 103
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P8goUuHTaBA9",
        "colab_type": "text"
      },
      "source": [
        "Next, we want to know how many slices of a patient's brain the CT scans in our data usually contain. The histogram below tells us that the answer is ~30."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Tg-1hmrC3DND",
        "colab_type": "code",
        "outputId": "0bfdf379-e0c9-4eeb-f97d-8b2a70e5853d",
        "colab": {}
      },
      "source": [
        "plt.figure(figsize=(20, 6))\n",
        "train_metadata.groupby(\"PatientID\").Modality.count().hist(bins=150)\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAABIoAAAFlCAYAAACEOwMFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAbUUlEQVR4nO3db4xd9Xkn8O8TO6EolAaUMEIYrVnJ6pY/SrpYLKuoqyFki3eJCm+QXNHGrFhZimiVSkit6ZuqL5B4VTVRS7RWksVR0lpW2wgrlLTI7ShaiYRAmy4BgrCCl7h48bZNWpwXtKbPvphftnfN2L5jj2fGcz8f6eqe89zfmfO76PlF8VfnnFvdHQAAAAB411pPAAAAAID1QVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJks1rPYGzef/7399bt26devwPf/jDvPe9771wE4J1TP8zy/Q/s0z/M+usAWaZ/udcPffcc3/T3R84tb7ug6KtW7fm2WefnXr8wsJC5ufnL9yEYB3T/8wy/c8s0//MOmuAWab/OVdV9b+Wqrv1DAAAAIAkgiIAAAAABkERAAAAAEkERQAAAAAMgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgSbJ5rSfAudu654kl60ceuXOVZwIAAABsBK4oAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADFMFRVX1vqr6g6r6TlW9VFX/vqqurKqnquqV8X7FxPiHqupwVb1cVXdM1G+uqufHZ5+uqroQXwoAAACA5Zv2iqJPJflqd/+bJB9M8lKSPUkOdfe2JIfGfqrq+iQ7k9yQZEeSR6tq0/g7n0myO8m28dqxQt8DAAAAgPN01qCoqi5P8h+SfC5Juvsfu/sHSe5Ksm8M25fk7rF9V5L93f1Wd7+a5HCSW6rq6iSXd/fT3d1JvjBxDAAAAABrbPMUY/51kv+T5L9X1QeTPJfkk0nmuvtYknT3saq6aoy/JsnXJ44/Omr/NLZPrb9DVe3O4pVHmZuby8LCwrTfJydOnFjW+IvZgzedXLI+K9+fd5ql/odT6X9mmf5n1lkDzDL9z0qbJijanOTfJvnl7v5GVX0q4zaz01jquUN9hvo7i917k+xNku3bt/f8/PwU01y0sLCQ5Yy/mN2354kl60funV/dibBuzFL/w6n0P7NM/zPrrAFmmf5npU3zjKKjSY529zfG/h9kMTh6Y9xOlvF+fGL8tRPHb0ny+qhvWaIOAAAAwDpw1qCou/93ku9V1U+O0u1JXkxyMMmuUduV5PGxfTDJzqq6pKquy+JDq58Zt6m9WVW3jl87+/jEMQAAAACssWluPUuSX07ypap6T5LvJvkvWQyZDlTV/UleS3JPknT3C1V1IIth0skkD3T32+PvfCLJY0kuTfLkeAEAAACwDkwVFHX3t5JsX+Kj208z/uEkDy9RfzbJjcuZIAAAAACrY5pnFAEAAAAwAwRFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAIOgCAAAAIAkgiIAAAAABkERAAAAAEkERQAAAAAMgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgGGqoKiqjlTV81X1rap6dtSurKqnquqV8X7FxPiHqupwVb1cVXdM1G8ef+dwVX26qmrlvxIAAAAA52I5VxTd1t0f6u7tY39PkkPdvS3JobGfqro+yc4kNyTZkeTRqto0jvlMkt1Jto3XjvP/CgAAAACshPO59eyuJPvG9r4kd0/U93f3W939apLDSW6pqquTXN7dT3d3J/nCxDEAAAAArLFazGzOMqjq1STfT9JJ/lt3762qH3T3+ybGfL+7r6iq30ny9e7+4qh/LsmTSY4keaS7PzrqP5Pk17r7Y0ucb3cWrzzK3Nzczfv375/6C504cSKXXXbZ1OMvZs//9d8vWb/pmp9Y5ZmwXsxS/8Op9D+zTP8z66wBZpn+51zddtttz03cNfb/bJ7y+A939+tVdVWSp6rqO2cYu9Rzh/oM9XcWu/cm2Zsk27dv7/n5+SmnmSwsLGQ54y9m9+15Ysn6kXvnV3cirBuz1P9wKv3PLNP/zDprgFmm/1lpU9161t2vj/fjSb6c5JYkb4zbyTLej4/hR5NcO3H4liSvj/qWJeoAAAAArANnDYqq6r1V9eM/2k7ys0m+neRgkl1j2K4kj4/tg0l2VtUlVXVdFh9a/Ux3H0vyZlXdOn7t7OMTxwAAAACwxqa59WwuyZfHL9lvTvJ73f3VqvpmkgNVdX+S15LckyTd/UJVHUjyYpKTSR7o7rfH3/pEkseSXJrF5xY9uYLfBQAAAIDzcNagqLu/m+SDS9T/Nsntpznm4SQPL1F/NsmNy58mAAAAABfaVM8oAgAAAGDjExQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAIOgCAAAAIAkywiKqmpTVf1lVX1l7F9ZVU9V1Svj/YqJsQ9V1eGqermq7pio31xVz4/PPl1VtbJfBwAAAIBztZwrij6Z5KWJ/T1JDnX3tiSHxn6q6vokO5PckGRHkkeratM45jNJdifZNl47zmv2AAAAAKyYqYKiqtqS5M4kn50o35Vk39jel+Tuifr+7n6ru19NcjjJLVV1dZLLu/vp7u4kX5g4BgAAAIA1Nu0VRb+d5FeT/PNEba67jyXJeL9q1K9J8r2JcUdH7ZqxfWodAAAAgHVg89kGVNXHkhzv7ueqan6Kv7nUc4f6DPWlzrk7i7eoZW5uLgsLC1OcdtGJEyeWNf5i9uBNJ5esz8r3551mqf/hVPqfWab/mXXWALNM/7PSzhoUJflwkp+rqv+c5MeSXF5VX0zyRlVd3d3Hxm1lx8f4o0munTh+S5LXR33LEvV36O69SfYmyfbt23t+fn7qL7SwsJDljL+Y3bfniSXrR+6dX92JsG7MUv/DqfQ/s0z/M+usAWaZ/melnfXWs+5+qLu3dPfWLD6k+s+6+xeSHEyyawzbleTxsX0wyc6quqSqrsviQ6ufGbenvVlVt45fO/v4xDEAAAAArLFprig6nUeSHKiq+5O8luSeJOnuF6rqQJIXk5xM8kB3vz2O+USSx5JcmuTJ8QIAAABgHVhWUNTdC0kWxvbfJrn9NOMeTvLwEvVnk9y43EkCAAAAcOFN+6tnAAAAAGxwgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAcNagqKp+rKqeqaq/qqoXquo3R/3Kqnqqql4Z71dMHPNQVR2uqper6o6J+s1V9fz47NNVVRfmawEAAACwXNNcUfRWko909weTfCjJjqq6NcmeJIe6e1uSQ2M/VXV9kp1JbkiyI8mjVbVp/K3PJNmdZNt47VjB7wIAAADAeThrUNSLTozdd49XJ7kryb5R35fk7rF9V5L93f1Wd7+a5HCSW6rq6iSXd/fT3d1JvjBxDAAAAABrbKpnFFXVpqr6VpLjSZ7q7m8kmevuY0ky3q8aw69J8r2Jw4+O2jVj+9Q6AAAAAOvA5mkGdffbST5UVe9L8uWquvEMw5d67lCfof7OP1C1O4u3qGVubi4LCwvTTDNJcuLEiWWNv5g9eNPJJeuz8v15p1nqfziV/meW6X9mnTXALNP/rLSpgqIf6e4fVNVCFp8t9EZVXd3dx8ZtZcfHsKNJrp04bEuS10d9yxL1pc6zN8neJNm+fXvPz89PPceFhYUsZ/zF7L49TyxZP3Lv/OpOhHVjlvofTqX/mWX6n1lnDTDL9D8rbZpfPfvAuJIoVXVpko8m+U6Sg0l2jWG7kjw+tg8m2VlVl1TVdVl8aPUz4/a0N6vq1vFrZx+fOAYAAACANTbNFUVXJ9k3frnsXUkOdPdXqurpJAeq6v4kryW5J0m6+4WqOpDkxSQnkzwwbl1Lkk8keSzJpUmeHC/OYOtprhoCAAAAWGlnDYq6+38m+ekl6n+b5PbTHPNwkoeXqD+b5EzPNwIAAABgjUz1q2cAAAAAbHyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAADD5rWeACtv654nTvvZkUfuXMWZAAAAABcTVxQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSJJvXegKsrq17njjtZ0ceuXMVZwIAAACsN64oAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBkiqCoqq6tqj+vqpeq6oWq+uSoX1lVT1XVK+P9ioljHqqqw1X1clXdMVG/uaqeH599uqrqwnwtAAAAAJZrmiuKTiZ5sLt/KsmtSR6oquuT7ElyqLu3JTk09jM+25nkhiQ7kjxaVZvG3/pMkt1Jto3XjhX8LgAAAACch7MGRd19rLv/Ymy/meSlJNckuSvJvjFsX5K7x/ZdSfZ391vd/WqSw0luqaqrk1ze3U93dyf5wsQxAAAAAKyxWsxsphxctTXJ15LcmOS17n7fxGff7+4rqup3kny9u7846p9L8mSSI0ke6e6PjvrPJPm17v7YEufZncUrjzI3N3fz/v37p57jiRMnctlll009fr17/q//ftXOddM1P7Fq5+LC2Gj9D8uh/5ll+p9ZZw0wy/Q/5+q22257rru3n1rfPO0fqKrLkvxhkl/p7n84w+OFlvqgz1B/Z7F7b5K9SbJ9+/aen5+fdppZWFjIcsavd/fteWLVznXk3vlVOxcXxkbrf1gO/c8s0//MOmuAWab/WWlT/epZVb07iyHRl7r7j0b5jXE7Wcb78VE/muTaicO3JHl91LcsUQcAAABgHZjmV88qyeeSvNTdvzXx0cEku8b2riSPT9R3VtUlVXVdFh9a/Ux3H0vyZlXdOv7mxyeOAQAAAGCNTXPr2YeT/GKS56vqW6P260keSXKgqu5P8lqSe5Kku1+oqgNJXsziL6Y90N1vj+M+keSxJJdm8blFT67Q9wAAAADgPJ01KOru/5Glny+UJLef5piHkzy8RP3ZLD4IGwAAAIB1ZqpnFAEAAACw8QmKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAMNZg6Kq+nxVHa+qb0/Urqyqp6rqlfF+xcRnD1XV4ap6uarumKjfXFXPj88+XVW18l8HAAAAgHM1zRVFjyXZcUptT5JD3b0tyaGxn6q6PsnOJDeMYx6tqk3jmM8k2Z1k23id+jcBAAAAWENnDYq6+2tJ/u6U8l1J9o3tfUnunqjv7+63uvvVJIeT3FJVVye5vLuf7u5O8oWJYwAAAABYBzaf43Fz3X0sSbr7WFVdNerXJPn6xLijo/ZPY/vU+pKqancWrz7K3NxcFhYWpp7YiRMnljV+vXvwppOrdq6N9N9tVm20/ofl0P/MMv3PrLMGmGX6n5V2rkHR6Sz13KE+Q31J3b03yd4k2b59e8/Pz089gYWFhSxn/Hp3354nVu1cR+6dX7VzcWFstP6H5dD/zDL9z6yzBphl+p+Vdq6/evbGuJ0s4/34qB9Ncu3EuC1JXh/1LUvUAQAAAFgnzjUoOphk19jeleTxifrOqrqkqq7L4kOrnxm3qb1ZVbeOXzv7+MQxAAAAAKwDZ731rKp+P8l8kvdX1dEkv5HkkSQHqur+JK8luSdJuvuFqjqQ5MUkJ5M80N1vjz/1iSz+gtqlSZ4cLwAAAADWibMGRd3986f56PbTjH84ycNL1J9NcuOyZgcAAADAqjnXW88AAAAA2GAERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAACGzWs9AdaPrXueOO1nRx65cxVnAgAAAKwFVxQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwbF7rCXBx2LrnidN+duSRO1dxJgAAAMCF4ooiAAAAAJIIigAAAAAY3HrGhuDWOAAAADh/rigCAAAAIImgCAAAAIDBrWdcVM50ixkAAABwflxRBAAAAEASVxStC66S+f+t9H8PD7oGAACA6QiKmGlCJAAAAPgXgiIuGCEMAAAAXFw8owgAAACAJGtwRVFV7UjyqSSbkny2ux9Z7Tmw9i7m5zK5UgoAAICNalWDoqralOR3k/zHJEeTfLOqDnb3i6s5D1bWxRz6AAAAAP9ita8ouiXJ4e7+bpJU1f4kdyURFLHunEsAdq6h2blcibTUuR686WTuO8scznSu083flVIAAACzYbWDomuSfG9i/2iSf7fKc1gTrrrhTFazP1YzADsX5xJkrSenm/9qhojnc75zmYeAEQAANo7q7tU7WdU9Se7o7v869n8xyS3d/cunjNudZPfY/ckkLy/jNO9P8jcrMF24GOl/Zpn+Z5bpf2adNcAs0/+cq3/V3R84tbjaVxQdTXLtxP6WJK+fOqi79ybZey4nqKpnu3v7uU0PLm76n1mm/5ll+p9ZZw0wy/Q/K+1dq3y+bybZVlXXVdV7kuxMcnCV5wAAAADAElb1iqLuPllVv5TkT5JsSvL57n5hNecAAAAAwNJW+9azdPcfJ/njC3iKc7plDTYI/c8s0//MMv3PrLMGmGX6nxW1qg+zBgAAAGD9Wu1nFAEAAACwTm2YoKiqdlTVy1V1uKr2rPV84EKoqs9X1fGq+vZE7cqqeqqqXhnvV0x89tBYEy9X1R1rM2s4f1V1bVX9eVW9VFUvVNUnR13/MxOq6seq6pmq+quxBn5z1K0BZkJVbaqqv6yqr4x9vc/MqKojVfV8VX2rqp4dNWuAC2ZDBEVVtSnJ7yb5T0muT/LzVXX92s4KLojHkuw4pbYnyaHu3pbk0NjPWAM7k9wwjnl0rBW4GJ1M8mB3/1SSW5M8MHpc/zMr3kryke7+YJIPJdlRVbfGGmB2fDLJSxP7ep9Zc1t3f6i7t499a4ALZkMERUluSXK4u7/b3f+YZH+Su9Z4TrDiuvtrSf7ulPJdSfaN7X1J7p6o7+/ut7r71SSHs7hW4KLT3ce6+y/G9ptZ/MfCNdH/zIhedGLsvnu8OtYAM6CqtiS5M8lnJ8p6n1lnDXDBbJSg6Jok35vYPzpqMAvmuvtYsviP6SRXjbp1wYZUVVuT/HSSb0T/M0PGrTffSnI8yVPdbQ0wK347ya8m+eeJmt5nlnSSP62q56pq96hZA1wwm9d6Aiuklqj5OTdmnXXBhlNVlyX5wyS/0t3/ULVUmy8OXaKm/7modffbST5UVe9L8uWquvEMw60BNoSq+liS4939XFXNT3PIEjW9z8Xuw939elVdleSpqvrOGcZaA5y3jXJF0dEk107sb0ny+hrNBVbbG1V1dZKM9+Ojbl2woVTVu7MYEn2pu/9olPU/M6e7f5BkIYvPnrAG2Og+nOTnqupIFh8v8ZGq+mL0PjOku18f78eTfDmLt5JZA1wwGyUo+maSbVV1XVW9J4sP7zq4xnOC1XIwya6xvSvJ4xP1nVV1SVVdl2RbkmfWYH5w3mrx0qHPJXmpu39r4iP9z0yoqg+MK4lSVZcm+WiS78QaYIPr7oe6e0t3b83i/8f/s+7+heh9ZkRVvbeqfvxH20l+Nsm3Yw1wAW2IW8+6+2RV/VKSP0myKcnnu/uFNZ4WrLiq+v0k80neX1VHk/xGkkeSHKiq+5O8luSeJOnuF6rqQJIXs/iLUQ+M2xbgYvThJL+Y5PnxjJYk+fXof2bH1Un2jV+ueVeSA939lap6OtYAs8n//jMr5rJ4u3Gy+O/33+vur1bVN2MNcIFUt9sVAQAAANg4t54BAAAAcJ4ERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgSfJ/AS9YWdOz7GB3AAAAAElFTkSuQmCC\n",
            "text/plain": [
              "<Figure size 1440x432 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nn0cztnmajeI",
        "colab_type": "text"
      },
      "source": [
        "# **Data Cleaning**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "10FFXYu53DNM",
        "colab_type": "text"
      },
      "source": [
        "## **Step 1: Removing Files w/ Incorrect Rescale Intercept**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tSztirfuaz-t",
        "colab_type": "text"
      },
      "source": [
        "The rescale intercept of our DICOM files (see metadata) should be -1024 for all files. However, as the histogram below shows, some fraction of files are corrupted: Their rescale intercept is much larger."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pAgE8QJI3DNN",
        "colab_type": "code",
        "outputId": "e1c38fcb-0e95-47c4-cc2f-ed0b1d2f1a2a",
        "colab": {}
      },
      "source": [
        "train_metadata.RescaleIntercept.hist()\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD8CAYAAABpcuN4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAZx0lEQVR4nO3df5DU933f8ecrXIypXGSQzJZyTMEj7AaJ2CkXRMZtZ5VzATsZo8xI7XnU6pQwQ0Koa3fopBD/wVQaZkRjVQ2TShnGogLFDVASj5jIBJ9Rt53OYH7IkYORTLlYWFygovERReeOSE9+94/9XPiy3vvs3o+9O/Zej5md3X1/v5/Pft4nya/7/tizIgIzM7PR/MR0L8DMzGY2B4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVlWw6CQ9FFJrxYefyXpC5IWSuqTdDE9LyiM2SGpX9IFSesL9dWSzqVteyQp1edKOpTqpyQtK4zpTZ9xUVLv5LZvZmaNaCzfo5A0B/hz4H5gKzAYEU9K2g4siIh/K2kl8PvAGuDvAt8APhIR70k6DXwe+CbwNWBPRByT9OvAT0fEr0nqAX4pIv6ZpIXAWaALCOAVYHVEXJ+c9s3MrJGxnnrqBv4sIr4PbAT2p/p+4MH0eiNwMCJuRMQbQD+wRtJiYH5EnIxqOh2oGTMy1xGgOx1trAf6ImIwhUMfsGHMXZqZ2bh1jHH/HqpHCwCliLgKEBFXJS1K9SVUjxhGDKTa/0uva+sjYy6nuYYlvQ3cVazXGVPX3XffHcuWLRtbV0364Q9/yB133NGSuWea2dQrzK5+3Wv7mki/r7zyyl9ExIfqbWs6KCS9D/gMsKPRrnVqkamPd0xxbZuBzQClUokvfelLDZY4PkNDQ3zgAx9oydwzzWzqFWZXv+61fU2k3wceeOD7o20byxHFp4BvRcRb6f1bkhano4nFwLVUHwCWFsZ1AldSvbNOvThmQFIHcCcwmOrlmjGV2oVFxF5gL0BXV1eUy+XaXSZFpVKhVXPPNLOpV5hd/brX9tWqfsdyjeKz3DztBHAUGLkLqRd4sVDvSXcyLQdWAKfTaap3JK1N1x8erRkzMtdDwMvpOsZxYJ2kBemuqnWpZmZmU6SpIwpJfwv4J8CvFspPAoclbQLeBB4GiIjzkg4DrwHDwNaIeC+N2QI8D8wDjqUHwHPAC5L6qR5J9KS5BiU9AZxJ+z0eEYPj6NPMzMapqaCIiP9L9eJysfYDqndB1dt/F7CrTv0scF+d+rukoKmzbR+wr5l1mpnZ5PM3s83MLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLLG+ic82t6y7S+Num3bqmEey2yfiEtP/kJL5jUzmygfUZiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy2oqKCR9UNIRSd+V9Lqkn5O0UFKfpIvpeUFh/x2S+iVdkLS+UF8t6VzatkeSUn2upEOpfkrSssKY3vQZFyX1Tl7rZmbWjGaPKH4b+OOI+PvAx4DXge3AiYhYAZxI75G0EugB7gU2AM9ImpPmeRbYDKxIjw2pvgm4HhH3AE8Du9NcC4GdwP3AGmBnMZDMzKz1GgaFpPnAPwaeA4iIv46IvwQ2AvvTbvuBB9PrjcDBiLgREW8A/cAaSYuB+RFxMiICOFAzZmSuI0B3OtpYD/RFxGBEXAf6uBkuZmY2BZo5ovgw8H+A/yzpTyR9WdIdQCkirgKk50Vp/yXA5cL4gVRbkl7X1m8ZExHDwNvAXZm5zMxsinQ0uc8/AD4XEack/TbpNNMoVKcWmfp4x9z8QGkz1VNalEolKpVKZnl521YNj7qtNC+/fSImsuZWGBoamnFraqXZ1K97bV+t6reZoBgABiLiVHp/hGpQvCVpcURcTaeVrhX2X1oY3wlcSfXOOvXimAFJHcCdwGCql2vGVGoXGBF7gb0AXV1dUS6Xa3dp2mPbXxp127ZVwzx1rpkf2dhdeqTcknnHq1KpMJGf4+1mNvXrXttXq/pteOopIv43cFnSR1OpG3gNOAqM3IXUC7yYXh8FetKdTMupXrQ+nU5PvSNpbbr+8GjNmJG5HgJeTtcxjgPrJC1IF7HXpZqZmU2RZn89/hzwFUnvA74H/DLVkDksaRPwJvAwQEScl3SYapgMA1sj4r00zxbgeWAecCw9oHqh/AVJ/VSPJHrSXIOSngDOpP0ej4jBcfZqZmbj0FRQRMSrQFedTd2j7L8L2FWnfha4r079XVLQ1Nm2D9jXzDrNzGzy+ZvZZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tqKigkXZJ0TtKrks6m2kJJfZIupucFhf13SOqXdEHS+kJ9dZqnX9IeSUr1uZIOpfopScsKY3rTZ1yU1DtZjZuZWXPGckTxQER8PCK60vvtwImIWAGcSO+RtBLoAe4FNgDPSJqTxjwLbAZWpMeGVN8EXI+Ie4Cngd1proXATuB+YA2wsxhIZmbWehM59bQR2J9e7wceLNQPRsSNiHgD6AfWSFoMzI+IkxERwIGaMSNzHQG609HGeqAvIgYj4jrQx81wMTOzKdBsUATwdUmvSNqcaqWIuAqQnhel+hLgcmHsQKotSa9r67eMiYhh4G3grsxcZmY2RTqa3O8TEXFF0iKgT9J3M/uqTi0y9fGOufmB1fDaDFAqlahUKpnl5W1bNTzqttK8/PaJmMiaW2FoaGjGramVZlO/7rV9tarfpoIiIq6k52uSvkr1esFbkhZHxNV0Wula2n0AWFoY3glcSfXOOvXimAFJHcCdwGCql2vGVOqsby+wF6CrqyvK5XLtLk17bPtLo27btmqYp841m61jc+mRckvmHa9KpcJEfo63m9nUr3ttX63qt+GpJ0l3SPrbI6+BdcB3gKPAyF1IvcCL6fVRoCfdybSc6kXr0+n01DuS1qbrD4/WjBmZ6yHg5XQd4ziwTtKCdBF7XaqZmdkUaebX4xLw1XQnawfwXyLijyWdAQ5L2gS8CTwMEBHnJR0GXgOGga0R8V6aawvwPDAPOJYeAM8BL0jqp3ok0ZPmGpT0BHAm7fd4RAxOoF8zMxujhkEREd8DPlan/gOge5Qxu4Bddepngfvq1N8lBU2dbfuAfY3WaWZmreFvZpuZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsq+mgkDRH0p9I+qP0fqGkPkkX0/OCwr47JPVLuiBpfaG+WtK5tG2PJKX6XEmHUv2UpGWFMb3pMy5K6p2Mps3MrHljOaL4PPB64f124ERErABOpPdIWgn0APcCG4BnJM1JY54FNgMr0mNDqm8CrkfEPcDTwO4010JgJ3A/sAbYWQwkMzNrvaaCQlIn8AvAlwvljcD+9Ho/8GChfjAibkTEG0A/sEbSYmB+RJyMiAAO1IwZmesI0J2ONtYDfRExGBHXgT5uhouZmU2BZo8o/iPwG8CPCrVSRFwFSM+LUn0JcLmw30CqLUmva+u3jImIYeBt4K7MXGZmNkU6Gu0g6ReBaxHxiqRyE3OqTi0y9fGOKa5xM9VTWpRKJSqVShPLrG/bquFRt5Xm5bdPxETW3ApDQ0Mzbk2tNJv6da/tq1X9NgwK4BPAZyR9Gng/MF/S7wFvSVocEVfTaaVraf8BYGlhfCdwJdU769SLYwYkdQB3AoOpXq4ZU6ldYETsBfYCdHV1Rblcrt2laY9tf2nUbdtWDfPUuWZ+ZGN36ZFyS+Ydr0qlwkR+jreb2dSve21freq34amniNgREZ0RsYzqReqXI+KfA0eBkbuQeoEX0+ujQE+6k2k51YvWp9PpqXckrU3XHx6tGTMy10PpMwI4DqyTtCBdxF6XamZmNkUm8uvxk8BhSZuAN4GHASLivKTDwGvAMLA1It5LY7YAzwPzgGPpAfAc8IKkfqpHEj1prkFJTwBn0n6PR8TgBNZsZmZjNKagiIgK6dRPRPwA6B5lv13Arjr1s8B9dervkoKmzrZ9wL6xrNPMzCaPv5ltZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzs6yGQSHp/ZJOS/q2pPOS/l2qL5TUJ+liel5QGLNDUr+kC5LWF+qrJZ1L2/ZIUqrPlXQo1U9JWlYY05s+46Kk3sls3szMGmvmiOIG8PMR8THg48AGSWuB7cCJiFgBnEjvkbQS6AHuBTYAz0iak+Z6FtgMrEiPDam+CbgeEfcATwO701wLgZ3A/cAaYGcxkMzMrPUaBkVUDaW3P5keAWwE9qf6fuDB9HojcDAibkTEG0A/sEbSYmB+RJyMiAAO1IwZmesI0J2ONtYDfRExGBHXgT5uhouZmU2Bpq5RSJoj6VXgGtX/4T4FlCLiKkB6XpR2XwJcLgwfSLUl6XVt/ZYxETEMvA3clZnLzMymSEczO0XEe8DHJX0Q+Kqk+zK7q94Umfp4x9z8QGkz1VNalEolKpVKZnl521YNj7qtNC+/fSImsuZWGBoamnFraqXZ1K97bV+t6repoBgREX8pqUL19M9bkhZHxNV0Wula2m0AWFoY1glcSfXOOvXimAFJHcCdwGCql2vGVOqsay+wF6CrqyvK5XLtLk17bPtLo27btmqYp86N6UfWtEuPlFsy73hVKhUm8nO83cymft1r+2pVv83c9fShdCSBpHnAJ4HvAkeBkbuQeoEX0+ujQE+6k2k51YvWp9PpqXckrU3XHx6tGTMy10PAy+k6xnFgnaQF6SL2ulQzM7Mp0syvx4uB/enOpZ8ADkfEH0k6CRyWtAl4E3gYICLOSzoMvAYMA1vTqSuALcDzwDzgWHoAPAe8IKmf6pFET5prUNITwJm03+MRMTiRhs3MbGwaBkVE/CnwM3XqPwC6RxmzC9hVp34W+LHrGxHxLilo6mzbB+xrtE4zM2sNfzPbzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkNg0LSUkn/TdLrks5L+nyqL5TUJ+liel5QGLNDUr+kC5LWF+qrJZ1L2/ZIUqrPlXQo1U9JWlYY05s+46Kk3sls3szMGmvmiGIY2BYRPwWsBbZKWglsB05ExArgRHpP2tYD3AtsAJ6RNCfN9SywGViRHhtSfRNwPSLuAZ4Gdqe5FgI7gfuBNcDOYiCZmVnrNQyKiLgaEd9Kr98BXgeWABuB/Wm3/cCD6fVG4GBE3IiIN4B+YI2kxcD8iDgZEQEcqBkzMtcRoDsdbawH+iJiMCKuA33cDBczM5sCY7pGkU4J/QxwCihFxFWohgmwKO22BLhcGDaQakvS69r6LWMiYhh4G7grM5eZmU2RjmZ3lPQB4A+AL0TEX6XLC3V3rVOLTH28Y4pr20z1lBalUolKpTLa2hratmp41G2lefntEzGRNbfC0NDQjFtTK82mft1r+2pVv00FhaSfpBoSX4mIP0zltyQtjoir6bTStVQfAJYWhncCV1K9s069OGZAUgdwJzCY6uWaMZXa9UXEXmAvQFdXV5TL5dpdmvbY9pdG3bZt1TBPnWs6W8fk0iPllsw7XpVKhYn8HG83s6lf99q+WtVvM3c9CXgOeD0i/kNh01Fg5C6kXuDFQr0n3cm0nOpF69Pp9NQ7ktamOR+tGTMy10PAy+k6xnFgnaQF6SL2ulQzM7Mp0syvx58A/gVwTtKrqfabwJPAYUmbgDeBhwEi4rykw8BrVO+Y2hoR76VxW4DngXnAsfSAahC9IKmf6pFET5prUNITwJm03+MRMTjOXs3MbBwaBkVE/E/qXysA6B5lzC5gV536WeC+OvV3SUFTZ9s+YF+jdZqZWWv4m9lmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy2oYFJL2Sbom6TuF2kJJfZIupucFhW07JPVLuiBpfaG+WtK5tG2PJKX6XEmHUv2UpGWFMb3pMy5K6p2sps3MrHnNHFE8D2yoqW0HTkTECuBEeo+klUAPcG8a84ykOWnMs8BmYEV6jMy5CbgeEfcATwO701wLgZ3A/cAaYGcxkMzMbGo0DIqI+B/AYE15I7A/vd4PPFioH4yIGxHxBtAPrJG0GJgfEScjIoADNWNG5joCdKejjfVAX0QMRsR1oI8fDywzM2ux8V6jKEXEVYD0vCjVlwCXC/sNpNqS9Lq2fsuYiBgG3gbuysxlZmZTqGOS51OdWmTq4x1z64dKm6me1qJUKlGpVBoudDTbVg2Puq00L799Iiay5lYYGhqacWtqpdnUr3ttX63qd7xB8ZakxRFxNZ1WupbqA8DSwn6dwJVU76xTL44ZkNQB3En1VNcAUK4ZU6m3mIjYC+wF6OrqinK5XG+3pjy2/aVRt21bNcxT5yY7W6suPVJuybzjValUmMjP8XYzm/p1r+2rVf2O99TTUWDkLqRe4MVCvSfdybSc6kXr0+n01DuS1qbrD4/WjBmZ6yHg5XQd4ziwTtKCdBF7XaqZmdkUavjrsaTfp/qb/d2SBqjeifQkcFjSJuBN4GGAiDgv6TDwGjAMbI2I99JUW6jeQTUPOJYeAM8BL0jqp3ok0ZPmGpT0BHAm7fd4RNReVDczsxZrGBQR8dlRNnWPsv8uYFed+lngvjr1d0lBU2fbPmBfozWamVnr+JvZZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8u6LYJC0gZJFyT1S9o+3esxM5tNZnxQSJoD/CfgU8BK4LOSVk7vqszMZo+O6V5AE9YA/RHxPQBJB4GNwGvTuiozs1Es2/7StHzu8xvuaMm8M/6IAlgCXC68H0g1MzObArfDEYXq1OKWHaTNwOb0dkjShVYs5F/B3cBftGJu7W7FrBPSsl5nqNnUr3ttUw/snlC/f2+0DbdDUAwASwvvO4ErxR0iYi+wt9ULkXQ2Irpa/TkzwWzqFWZXv+61fbWq39vh1NMZYIWk5ZLeB/QAR6d5TWZms8aMP6KIiGFJ/xI4DswB9kXE+WlelpnZrDHjgwIgIr4GfG2618EUnN6aQWZTrzC7+nWv7asl/SoiGu9lZmaz1u1wjcLMzKaRgyKR9LCk85J+JKmrZtuO9OdDLkhaX6ivlnQubdsjSak+V9KhVD8ladnUdjM2kj4u6ZuSXpV0VtKawrYx9X47kPS51M95Sf++UG+7XgEk/RtJIenuQq3tepX0W5K+K+lPJX1V0gcL29qu36KW/5mjiPCjevrtp4CPAhWgq1BfCXwbmAssB/4MmJO2nQZ+jup3PY4Bn0r1Xwd+N73uAQ5Nd38Nev96Ye2fBirj7X2mP4AHgG8Ac9P7Re3aa1r7Uqo3gnwfuLvNe10HdKTXu4Hd7dxvoe85qacPA+9Lva6czM/wEUUSEa9HRL0v6m0EDkbEjYh4A+gH1khaDMyPiJNR/ad1AHiwMGZ/en0E6J7hv6kEMD+9vpOb31MZT+8z3RbgyYi4ARAR11K9HXsFeBr4DW79kmpb9hoRX4+I4fT2m1S/cwVt2m/B3/yZo4j4a2DkzxxNGgdFY6P9CZEl6XVt/ZYx6V/ct4G7Wr7S8fsC8FuSLgNfAnak+nh6n+k+AvyjdErwv0v62VRvu14lfQb484j4ds2mtuu1jl+heoQA7d9vy//M0W1xe+xkkfQN4O/U2fTFiHhxtGF1apGp58ZMm1zvQDfwryPiDyT9U+A54JOMr/dp16DXDmABsBb4WeCwpA/Tnr3+JtXTMT82rE5txvcKzf03LOmLwDDwlZFhdfa/LfptUsv7mFVBERGfHMew0f6EyAA3D22L9eKYAUkdVE/nDI7jsydNrndJB4DPp7f/Ffhyej2e3qddg163AH+YTjWclvQjqn8PqK16lbSK6vn4b6eznp3At9KNCrdlr9D4v2FJvcAvAt3pnzHcxv02qeGfOZqw6b4QM9Me/PjF7Hu59ULY97h5IewM1d9MRy6EfTrVt3LrxezD091Xg55fB8rpdTfwynh7n+kP4NeAx9Prj1A9ZFc79lrT9yVuXsxuy16BDVT/7wc+VFNvy34L/XWknpZz82L2vZP6GdPd5Ex5AL9ENZlvAG8Bxwvbvkj1roILFO6KALqA76Rtv8PNLzC+n+pv5v1U76r48HT316D3fwi8kv4FOwWsHm/vM/2R/kP6vbT2bwE/36691vT9N0HRrr2m/94uA6+mx++2c781vX8a+F+pjy9O9vz+ZraZmWX5riczM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWf8fGulILBXr8WcAAAAASUVORK5CYII=\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "km5BztPZ3DNR",
        "colab_type": "code",
        "outputId": "196e276a-5c36-43c1-f88f-9912cee61fd5",
        "colab": {}
      },
      "source": [
        "# We identify the corrupted files in the next three cells\n",
        "train_metadata.query(\"RescaleIntercept!=-1024\").groupby(\"PatientID\").count().sort_values(\"SOPInstanceUID\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SOPInstanceUID</th>\n",
              "      <th>Modality</th>\n",
              "      <th>StudyInstanceUID</th>\n",
              "      <th>SeriesInstanceUID</th>\n",
              "      <th>StudyID</th>\n",
              "      <th>ImagePositionPatient</th>\n",
              "      <th>ImageOrientationPatient</th>\n",
              "      <th>SamplesPerPixel</th>\n",
              "      <th>PhotometricInterpretation</th>\n",
              "      <th>Rows</th>\n",
              "      <th>Columns</th>\n",
              "      <th>PixelSpacing</th>\n",
              "      <th>BitsAllocated</th>\n",
              "      <th>BitsStored</th>\n",
              "      <th>HighBit</th>\n",
              "      <th>PixelRepresentation</th>\n",
              "      <th>WindowCenter</th>\n",
              "      <th>WindowWidth</th>\n",
              "      <th>RescaleIntercept</th>\n",
              "      <th>RescaleSlope</th>\n",
              "      <th>fname</th>\n",
              "      <th>MultiImagePositionPatient</th>\n",
              "      <th>ImagePositionPatient1</th>\n",
              "      <th>ImagePositionPatient2</th>\n",
              "      <th>MultiImageOrientationPatient</th>\n",
              "      <th>ImageOrientationPatient1</th>\n",
              "      <th>ImageOrientationPatient2</th>\n",
              "      <th>ImageOrientationPatient3</th>\n",
              "      <th>ImageOrientationPatient4</th>\n",
              "      <th>ImageOrientationPatient5</th>\n",
              "      <th>MultiPixelSpacing</th>\n",
              "      <th>PixelSpacing1</th>\n",
              "      <th>img_min</th>\n",
              "      <th>img_max</th>\n",
              "      <th>img_mean</th>\n",
              "      <th>img_std</th>\n",
              "      <th>img_pct_window</th>\n",
              "      <th>MultiWindowCenter</th>\n",
              "      <th>WindowCenter1</th>\n",
              "      <th>MultiWindowWidth</th>\n",
              "      <th>WindowWidth1</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>PatientID</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>ID_b956c8dd</th>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_11e103d4</th>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "      <td>20</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_03ac0e28</th>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "      <td>22</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_57a06f55</th>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_cd2e1b47</th>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>23</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_a579ac67</th>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "      <td>42</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_2b35cfb8</th>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_00526c11</th>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "      <td>52</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_aa91c454</th>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "      <td>56</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>ID_b4f7750e</th>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "      <td>64</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>407 rows × 41 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "             SOPInstanceUID  Modality  StudyInstanceUID  SeriesInstanceUID  \\\n",
              "PatientID                                                                    \n",
              "ID_b956c8dd  20              20        20                20                  \n",
              "ID_11e103d4  20              20        20                20                  \n",
              "ID_03ac0e28  22              22        22                22                  \n",
              "ID_57a06f55  23              23        23                23                  \n",
              "ID_cd2e1b47  23              23        23                23                  \n",
              "...          ..              ..        ..                ..                  \n",
              "ID_a579ac67  42              42        42                42                  \n",
              "ID_2b35cfb8  52              52        52                52                  \n",
              "ID_00526c11  52              52        52                52                  \n",
              "ID_aa91c454  56              56        56                56                  \n",
              "ID_b4f7750e  64              64        64                64                  \n",
              "\n",
              "             StudyID  ImagePositionPatient  ImageOrientationPatient  \\\n",
              "PatientID                                                             \n",
              "ID_b956c8dd  20       20                    20                        \n",
              "ID_11e103d4  20       20                    20                        \n",
              "ID_03ac0e28  22       22                    22                        \n",
              "ID_57a06f55  23       23                    23                        \n",
              "ID_cd2e1b47  23       23                    23                        \n",
              "...          ..       ..                    ..                        \n",
              "ID_a579ac67  42       42                    42                        \n",
              "ID_2b35cfb8  52       52                    52                        \n",
              "ID_00526c11  52       52                    52                        \n",
              "ID_aa91c454  56       56                    56                        \n",
              "ID_b4f7750e  64       64                    64                        \n",
              "\n",
              "             SamplesPerPixel  PhotometricInterpretation  Rows  Columns  \\\n",
              "PatientID                                                                \n",
              "ID_b956c8dd  20               20                         20    20        \n",
              "ID_11e103d4  20               20                         20    20        \n",
              "ID_03ac0e28  22               22                         22    22        \n",
              "ID_57a06f55  23               23                         23    23        \n",
              "ID_cd2e1b47  23               23                         23    23        \n",
              "...          ..               ..                         ..    ..        \n",
              "ID_a579ac67  42               42                         42    42        \n",
              "ID_2b35cfb8  52               52                         52    52        \n",
              "ID_00526c11  52               52                         52    52        \n",
              "ID_aa91c454  56               56                         56    56        \n",
              "ID_b4f7750e  64               64                         64    64        \n",
              "\n",
              "             PixelSpacing  BitsAllocated  BitsStored  HighBit  \\\n",
              "PatientID                                                       \n",
              "ID_b956c8dd  20            20             20          20        \n",
              "ID_11e103d4  20            20             20          20        \n",
              "ID_03ac0e28  22            22             22          22        \n",
              "ID_57a06f55  23            23             23          23        \n",
              "ID_cd2e1b47  23            23             23          23        \n",
              "...          ..            ..             ..          ..        \n",
              "ID_a579ac67  42            42             42          42        \n",
              "ID_2b35cfb8  52            52             52          52        \n",
              "ID_00526c11  52            52             52          52        \n",
              "ID_aa91c454  56            56             56          56        \n",
              "ID_b4f7750e  64            64             64          64        \n",
              "\n",
              "             PixelRepresentation  WindowCenter  WindowWidth  RescaleIntercept  \\\n",
              "PatientID                                                                       \n",
              "ID_b956c8dd  20                   20            20           20                 \n",
              "ID_11e103d4  20                   20            20           20                 \n",
              "ID_03ac0e28  22                   22            22           22                 \n",
              "ID_57a06f55  23                   23            23           23                 \n",
              "ID_cd2e1b47  23                   23            23           23                 \n",
              "...          ..                   ..            ..           ..                 \n",
              "ID_a579ac67  42                   42            42           42                 \n",
              "ID_2b35cfb8  52                   52            52           52                 \n",
              "ID_00526c11  52                   52            52           52                 \n",
              "ID_aa91c454  56                   56            56           56                 \n",
              "ID_b4f7750e  64                   64            64           64                 \n",
              "\n",
              "             RescaleSlope  fname  MultiImagePositionPatient  \\\n",
              "PatientID                                                     \n",
              "ID_b956c8dd  20            20     20                          \n",
              "ID_11e103d4  20            20     20                          \n",
              "ID_03ac0e28  22            22     22                          \n",
              "ID_57a06f55  23            23     23                          \n",
              "ID_cd2e1b47  23            23     23                          \n",
              "...          ..            ..     ..                          \n",
              "ID_a579ac67  42            42     42                          \n",
              "ID_2b35cfb8  52            52     52                          \n",
              "ID_00526c11  52            52     52                          \n",
              "ID_aa91c454  56            56     56                          \n",
              "ID_b4f7750e  64            64     64                          \n",
              "\n",
              "             ImagePositionPatient1  ImagePositionPatient2  \\\n",
              "PatientID                                                   \n",
              "ID_b956c8dd  20                     20                      \n",
              "ID_11e103d4  20                     20                      \n",
              "ID_03ac0e28  22                     22                      \n",
              "ID_57a06f55  23                     23                      \n",
              "ID_cd2e1b47  23                     23                      \n",
              "...          ..                     ..                      \n",
              "ID_a579ac67  42                     42                      \n",
              "ID_2b35cfb8  52                     52                      \n",
              "ID_00526c11  52                     52                      \n",
              "ID_aa91c454  56                     56                      \n",
              "ID_b4f7750e  64                     64                      \n",
              "\n",
              "             MultiImageOrientationPatient  ImageOrientationPatient1  \\\n",
              "PatientID                                                             \n",
              "ID_b956c8dd  20                            20                         \n",
              "ID_11e103d4  20                            20                         \n",
              "ID_03ac0e28  22                            22                         \n",
              "ID_57a06f55  23                            23                         \n",
              "ID_cd2e1b47  23                            23                         \n",
              "...          ..                            ..                         \n",
              "ID_a579ac67  42                            42                         \n",
              "ID_2b35cfb8  52                            52                         \n",
              "ID_00526c11  52                            52                         \n",
              "ID_aa91c454  56                            56                         \n",
              "ID_b4f7750e  64                            64                         \n",
              "\n",
              "             ImageOrientationPatient2  ImageOrientationPatient3  \\\n",
              "PatientID                                                         \n",
              "ID_b956c8dd  20                        20                         \n",
              "ID_11e103d4  20                        20                         \n",
              "ID_03ac0e28  22                        22                         \n",
              "ID_57a06f55  23                        23                         \n",
              "ID_cd2e1b47  23                        23                         \n",
              "...          ..                        ..                         \n",
              "ID_a579ac67  42                        42                         \n",
              "ID_2b35cfb8  52                        52                         \n",
              "ID_00526c11  52                        52                         \n",
              "ID_aa91c454  56                        56                         \n",
              "ID_b4f7750e  64                        64                         \n",
              "\n",
              "             ImageOrientationPatient4  ImageOrientationPatient5  \\\n",
              "PatientID                                                         \n",
              "ID_b956c8dd  20                        20                         \n",
              "ID_11e103d4  20                        20                         \n",
              "ID_03ac0e28  22                        22                         \n",
              "ID_57a06f55  23                        23                         \n",
              "ID_cd2e1b47  23                        23                         \n",
              "...          ..                        ..                         \n",
              "ID_a579ac67  42                        42                         \n",
              "ID_2b35cfb8  52                        52                         \n",
              "ID_00526c11  52                        52                         \n",
              "ID_aa91c454  56                        56                         \n",
              "ID_b4f7750e  64                        64                         \n",
              "\n",
              "             MultiPixelSpacing  PixelSpacing1  img_min  img_max  img_mean  \\\n",
              "PatientID                                                                   \n",
              "ID_b956c8dd  20                 20             20       20       20         \n",
              "ID_11e103d4  20                 20             20       20       20         \n",
              "ID_03ac0e28  22                 22             22       22       22         \n",
              "ID_57a06f55  23                 23             23       23       23         \n",
              "ID_cd2e1b47  23                 23             23       23       23         \n",
              "...          ..                 ..             ..       ..       ..         \n",
              "ID_a579ac67  42                 42             42       42       42         \n",
              "ID_2b35cfb8  52                 52             52       52       52         \n",
              "ID_00526c11  52                 52             52       52       52         \n",
              "ID_aa91c454  56                 56             56       56       56         \n",
              "ID_b4f7750e  64                 64             64       64       64         \n",
              "\n",
              "             img_std  img_pct_window  MultiWindowCenter  WindowCenter1  \\\n",
              "PatientID                                                                \n",
              "ID_b956c8dd  20       20              20                 20              \n",
              "ID_11e103d4  20       20              20                 20              \n",
              "ID_03ac0e28  22       22              22                 22              \n",
              "ID_57a06f55  23       23              23                 23              \n",
              "ID_cd2e1b47  23       23              0                  0               \n",
              "...          ..       ..              ..                 ..              \n",
              "ID_a579ac67  42       42              42                 42              \n",
              "ID_2b35cfb8  52       52              52                 52              \n",
              "ID_00526c11  52       52              52                 52              \n",
              "ID_aa91c454  56       56              56                 56              \n",
              "ID_b4f7750e  64       64              64                 64              \n",
              "\n",
              "             MultiWindowWidth  WindowWidth1  \n",
              "PatientID                                    \n",
              "ID_b956c8dd  20                20            \n",
              "ID_11e103d4  20                20            \n",
              "ID_03ac0e28  22                22            \n",
              "ID_57a06f55  23                23            \n",
              "ID_cd2e1b47  0                 0             \n",
              "...          ..                ..            \n",
              "ID_a579ac67  42                42            \n",
              "ID_2b35cfb8  52                52            \n",
              "ID_00526c11  52                52            \n",
              "ID_aa91c454  56                56            \n",
              "ID_b4f7750e  64                64            \n",
              "\n",
              "[407 rows x 41 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 106
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "A4FcWqQk3DNb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "png_IDs_to_remove = sorted(train_metadata.query(\"RescaleIntercept!=-1024\").SOPInstanceUID.tolist())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tT9LMoLl3DNf",
        "colab_type": "code",
        "outputId": "8a4deaf2-005f-4821-d992-98eaf00f6a18",
        "colab": {}
      },
      "source": [
        "png_IDs_to_remove[0:10]"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['ID_0007ff5d1',\n",
              " 'ID_000aa2bce',\n",
              " 'ID_000bd8380',\n",
              " 'ID_0012b1611',\n",
              " 'ID_0015e926e',\n",
              " 'ID_001bdd8fb',\n",
              " 'ID_001d4ce1c',\n",
              " 'ID_0023e98ab',\n",
              " 'ID_0024b1888',\n",
              " 'ID_00382ea5e']"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 108
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3O7qOYMr3DNv",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "Now, we remove the corrupted files and update the metadata & labels to no longer contain references to said files."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "P7KMYBqb3DNw",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Removing corrupted files\n",
        "folder_path = \"C:/Users/Administrator/Downloads/rsna_stage1_png_128/stage_1_train_images\"\n",
        "for ID in tqdm.tqdm(png_IDs_to_remove):\n",
        "    os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PM0qyR8_3DOA",
        "colab_type": "code",
        "outputId": "e75bd07a-2354-4208-c353-284f6212cdae",
        "colab": {}
      },
      "source": [
        "# Verify that corrupted files have been removed\n",
        "os.path.isfile(\"./rsna_stage1_png_128/stage_1_train_images/ID_0007ff5d1.png\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "False"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 112
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "s8Vp_bom3DOE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Update labels and metadata\n",
        "labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n",
        "train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hIWgnYiA3DOO",
        "colab_type": "text"
      },
      "source": [
        "### **Step 2: Remove Images w/ Low Brain Percentage**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WwumGFykeJcG",
        "colab_type": "text"
      },
      "source": [
        "Some images contain virtually no brain matter (that is, they are slices of the patient's skull either above or below their brain). Here, we identify and remove these files."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fcmgGDXQ3DOP",
        "colab_type": "code",
        "outputId": "d2f970c7-c46f-46a9-d2df-dffeec85b02f",
        "colab": {}
      },
      "source": [
        "# This histogram shows the percentage of brain matter present in all of our files \n",
        "train_metadata.img_pct_window.hist(bins=100)\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD4CAYAAADsKpHdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAXE0lEQVR4nO3df6zd9X3f8edrpqFOMhII5day2ew2VjLAiRZumdtO051Yh5dUMdVAc0SDaZmsMpqlm6PGrFIzabJEtGVp6AaVFTJMF4V4NB1WKV0Q2VU0lR9x0iQOEBq3MLjgxk2TEm7aUC5774/z8c3x9bF97j331zl+PqSj+/2+P9/PuZ+Pzzl+n8/n8/1+b6oKSZL+1ko3QJK0OpgQJEmACUGS1JgQJEmACUGS1Jyz0g1YqAsvvLA2btw4u/+9732P173udSvXoGVgH0eDfRwNw9rHL37xi9+qqh/pVTa0CWHjxo0cOnRodn9ycpKJiYmVa9AysI+jwT6OhmHtY5L/e6oyp4wkSYAJQZLUmBAkSYAJQZLUmBAkSYAJQZLUmBAkSYAJQZLUmBAkScAQX6msxbdxz/2z28/c+q4VbImklWBCOMt1J4FTxU0O0tnBKSNJEmBCkCQ1ThmdhU41TdTP8U4fSaPLEYIkCegjIST5RJJjSb7Wo+wDSSrJhV2xW5IcSfJUkqu64pcnOdzKbkuSFj83yadb/NEkGxena5Kk+ehnhHAXsG1uMMnFwM8Az3bFLgF2AJe2OrcnWdOK7wB2AZvb4/hz3gh8p6reDHwU+PBCOiJJGswZ1xCq6vOn+Nb+UeBXgfu6YtuBe6rqZeDpJEeAK5I8A5xXVQ8DJLkbuBp4oNX5963+vcB/SZKqqoV0SFqoU62tuG6is8WCFpWTvBt4vqq+0mZ+jlsPPNK1P9Vir7TtufHjdZ4DqKqZJC8CbwK+tZC2aWm5wCyNrnknhCSvBX4N+Ke9invE6jTx09Xp9bt30Zl2YmxsjMnJydmy6enpE/ZH0SB9PPz8i7Pbu7csTnuW4t97JV/H3VtmesYXuz2+V0fDKPZxISOEHwc2AcdHBxuALyW5gs43/4u7jt0AvNDiG3rE6aozleQc4A3At3v94qraB+wDGB8fr+4/cD2sf/B6Pgbp4w3zPNW0H89cN7Hoz7ncr+OJ00S9Pw6L3U/fq6NhFPs479NOq+pwVV1UVRuraiOd/9DfUVV/BhwEdrQzhzbRWTx+rKqOAi8l2drOLrqeH6w9HAR2tu1rgM+5fiBJy6+f004/BTwMvCXJVJIbT3VsVT0OHACeAP4AuLmqXm3FNwEfB44Af0JnQRngTuBNbQH63wJ7FtgXSdIA+jnL6D1nKN84Z38vsLfHcYeAy3rEvw9ce6Z2aP7me0WypLObVypLkgATgiSpMSFIkgDvdqqzkHd7lXpzhCBJAkwIkqTGKSMt2Nypl7NhOsXpI40yRwiSJMCEIElqnDLSolnN0yletS2dmQlhxPgfn6SFcspIkgSYECRJjQlBkgSYECRJjYvKWhKr+YwjSb05QpAkAY4QpAVzFKRR4whBkgQ4QtAI8yI9aX7OOEJI8okkx5J8rSv2H5N8PclXk/xukjd2ld2S5EiSp5Jc1RW/PMnhVnZbkrT4uUk+3eKPJtm4uF3UStu45/7Zh6TVq58po7uAbXNiDwKXVdXbgD8GbgFIcgmwA7i01bk9yZpW5w5gF7C5PY4/543Ad6rqzcBHgQ8vtDOSpIU7Y0Koqs8D354T+2xVzbTdR4ANbXs7cE9VvVxVTwNHgCuSrAPOq6qHq6qAu4Gru+rsb9v3AlceHz1IkpbPYqwh/CLw6ba9nk6COG6qxV5p23Pjx+s8B1BVM0leBN4EfGvuL0qyi84og7GxMSYnJ2fLpqenT9gfRf30cfeWmdOWr7Tf/OR9s9tb1r/hpPLFfB2X899iPm32vToaRrGPAyWEJL8GzACfPB7qcVidJn66OicHq/YB+wDGx8drYmJitmxycpLu/VHUTx9vGKJ5+meumzgptpiv43L+W/Tqy6n4Xh0No9jHBSeEJDuBnwWubNNA0Pnmf3HXYRuAF1p8Q494d52pJOcAb2DOFJVOz8VaSYthQdchJNkGfBB4d1X9VVfRQWBHO3NoE53F48eq6ijwUpKtbX3geuC+rjo72/Y1wOe6EowkaZmccYSQ5FPABHBhkingQ3TOKjoXeLCt/z5SVb9UVY8nOQA8QWcq6eaqerU91U10zlhaCzzQHgB3Ar+d5AidkcGOxematHy8almj4IwJoare0yN852mO3wvs7RE/BFzWI/594NoztUOStLS8UllDzzUUaXF4LyNJEuAIQSvIeXdpdXGEIEkCHCFolTg+Wti9ZYaJlW2KdNZyhCBJAhwhaEh5ZpG0+BwhSJIAE4IkqTEhSJIA1xC0Cp1qfcBrFaSlZUIYUmfjouqw9HluO01kGhZOGUmSABOCJKkxIUiSABOCJKkxIUiSABOCJKkxIUiSgD4SQpJPJDmW5GtdsQuSPJjkG+3n+V1ltyQ5kuSpJFd1xS9PcriV3ZYkLX5ukk+3+KNJNi5uFyVJ/ehnhHAXsG1ObA/wUFVtBh5q+yS5BNgBXNrq3J5kTatzB7AL2Nwex5/zRuA7VfVm4KPAhxfaGUnSwp0xIVTV54FvzwlvB/a37f3A1V3xe6rq5ap6GjgCXJFkHXBeVT1cVQXcPafO8ee6F7jy+OhBkrR8FnrrirGqOgpQVUeTXNTi64FHuo6barFX2vbc+PE6z7XnmknyIvAm4Ftzf2mSXXRGGYyNjTE5OTlbNj09fcL+KOru4+4tMyvbmCUytnb0+jb3fXm2vVdH1Sj2cbHvZdTrm32dJn66OicHq/YB+wDGx8drYmJitmxycpLu/VHU3ccbhuS+PvO1e8sMHzk8WrfYeua6iRP2z7b36qgaxT4u9JP3zSTr2uhgHXCsxaeAi7uO2wC80OIbesS760wlOQd4AydPUUlDq/tmd97oTqvZQk87PQjsbNs7gfu64jvamUOb6CweP9aml15KsrWtD1w/p87x57oG+FxbZ5AkLaMzjhCSfAqYAC5MMgV8CLgVOJDkRuBZ4FqAqno8yQHgCWAGuLmqXm1PdROdM5bWAg+0B8CdwG8nOUJnZLBjUXomSZqXMyaEqnrPKYquPMXxe4G9PeKHgMt6xL9PSyiSpJXjlcqSJMCEIElqTAiSJMC/qTxUDj//4shefyBp5TlCkCQBJgRJUmNCkCQBJgRpWW3ccz+Hn3/xhNtZSKuFCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNt7+WVkj37SueufVdK9gSqWOgEUKSf5Pk8SRfS/KpJD+c5IIkDyb5Rvt5ftfxtyQ5kuSpJFd1xS9PcriV3ZYkg7RLkjR/C04ISdYD/xoYr6rLgDXADmAP8FBVbQYeavskuaSVXwpsA25PsqY93R3ALmBze2xbaLtGzcY9988+JGkpDbqGcA6wNsk5wGuBF4DtwP5Wvh+4um1vB+6pqper6mngCHBFknXAeVX1cFUVcHdXHUnSMlnwGkJVPZ/kPwHPAn8NfLaqPptkrKqOtmOOJrmoVVkPPNL1FFMt9krbnhs/SZJddEYSjI2NMTk5OVs2PT19wv6o2L1lZnZ7bO2J+6PobO3jqL13R/Xz2G0U+7jghNDWBrYDm4C/BP5Hkp8/XZUesTpN/ORg1T5gH8D4+HhNTEzMlk1OTtK9Pyq6/4by7i0zfOTwaJ8HcLb28ZnrJlamMUtkVD+P3Uaxj4NMGf0T4Omq+vOqegX4DPBTwDfbNBDt57F2/BRwcVf9DXSmmKba9ty4JGkZDZIQngW2JnltOyvoSuBJ4CCwsx2zE7ivbR8EdiQ5N8kmOovHj7XppZeSbG3Pc31XHUnSMhlkDeHRJPcCXwJmgD+iM53zeuBAkhvpJI1r2/GPJzkAPNGOv7mqXm1PdxNwF7AWeKA9JEnLaKDJ2qr6EPChOeGX6YwWeh2/F9jbI34IuGyQtkiSBuOtKyRJgAlBktSYECRJgDe3k1YFb3Sn1cARgiQJMCFIkhoTgiQJMCFIkhoTgiQJ8CyjVck/hiNpJThCkCQBJgRJUmNCkCQBJgRJUuOisrTKeBsLrRRHCJIkwIQgSWpMCJIkwIQgSWoGSghJ3pjk3iRfT/Jkkp9MckGSB5N8o/08v+v4W5IcSfJUkqu64pcnOdzKbkuSQdolSZq/QUcIHwP+oKreCrwdeBLYAzxUVZuBh9o+SS4BdgCXAtuA25Osac9zB7AL2Nwe2wZslyRpnhacEJKcB/wj4E6AqvqbqvpLYDuwvx22H7i6bW8H7qmql6vqaeAIcEWSdcB5VfVwVRVwd1cdSdIyGeQ6hB8D/hz4b0neDnwReD8wVlVHAarqaJKL2vHrgUe66k+12Ctte278JEl20RlJMDY2xuTk5GzZ9PT0CfvD5vDzL85u797S+5ixtbB7y8wytWhl2McTDet7etg/j/0YxT4OkhDOAd4BvK+qHk3yMdr00Cn0Wheo08RPDlbtA/YBjI+P18TExGzZ5OQk3fvD5oY+7nC6e8sMHzk82tcS2scTPXPdxNI2ZokM++exH6PYx0HWEKaAqap6tO3fSydBfLNNA9F+Hus6/uKu+huAF1p8Q4+4JGkZLTghVNWfAc8leUsLXQk8ARwEdrbYTuC+tn0Q2JHk3CSb6CweP9aml15KsrWdXXR9Vx1J0jIZdGz+PuCTSV4D/CnwC3SSzIEkNwLPAtcCVNXjSQ7QSRozwM1V9Wp7npuAu4C1wAPtIUlaRgMlhKr6MjDeo+jKUxy/F9jbI34IuGyQtkijyBvdaTl5pbIkCTAhSJIaE4IkCTAhSJIaE4IkCTAhSJIaE4IkCTAhSJKa0b6L2Cq3sY8b2knScnGEIEkCTAiSpMaEIEkCXEOQhoY3utNSc4QgSQJMCJKkxoQgSQJMCJKkxkXlZebFaJJWK0cIkiRgERJCkjVJ/ijJ77X9C5I8mOQb7ef5XcfekuRIkqeSXNUVvzzJ4VZ2W5IM2i5J0vwsxgjh/cCTXft7gIeqajPwUNsnySXADuBSYBtwe5I1rc4dwC5gc3tsW4R2SZLmYaCEkGQD8C7g413h7cD+tr0fuLorfk9VvVxVTwNHgCuSrAPOq6qHq6qAu7vqSJKWyaCLyr8B/Crwt7tiY1V1FKCqjia5qMXXA490HTfVYq+07bnxkyTZRWckwdjYGJOTk7Nl09PTJ+yvVru3zCy47tjaweoPA/vYn9X+Xh+Wz+MgRrGPC04ISX4WOFZVX0wy0U+VHrE6TfzkYNU+YB/A+Ph4TUz84NdOTk7Svb9a3TDAWUa7t8zwkcOjfWKYfezT4e/Nbq7G21gMy+dxEKPYx0HelT8NvDvJO4EfBs5L8t+BbyZZ10YH64Bj7fgp4OKu+huAF1p8Q4/4klnue8J4qqmkYbDgNYSquqWqNlTVRjqLxZ+rqp8HDgI722E7gfva9kFgR5Jzk2yis3j8WJteeinJ1nZ20fVddSRJy2Qpxua3AgeS3Ag8C1wLUFWPJzkAPAHMADdX1autzk3AXcBa4IH2kCQto0VJCFU1CUy27b8ArjzFcXuBvT3ih4DLFqMtkqSFGe3Vuz54j3lJ6vDWFZIkwBHCCQYdLXg2kaRhZkKQhpzTnlosJoRT8Nu+pLONawiSJMCEIElqTAiSJMCEIElqTAiSJMCEIElqTAiSJMDrEKSR4kVqGoQjBEkSYEKQJDUmBEkSYEKQJDUuKksjygVmzZcjBEkSMEBCSHJxkv+d5Mkkjyd5f4tfkOTBJN9oP8/vqnNLkiNJnkpyVVf88iSHW9ltSTJYtyRJ8zXICGEG2F1Vfw/YCtyc5BJgD/BQVW0GHmr7tLIdwKXANuD2JGvac90B7AI2t8e2AdolSVqABSeEqjpaVV9q2y8BTwLrge3A/nbYfuDqtr0duKeqXq6qp4EjwBVJ1gHnVdXDVVXA3V11JEnLZFHWEJJsBP4+8CgwVlVHoZM0gIvaYeuB57qqTbXY+rY9Ny5JWkYDn2WU5PXA7wC/UlXfPc30f6+COk281+/aRWdqibGxMSYnJ2fLpqenT9g/nd1bZvo6brUZWzu8be+XfVwa/X42Fst8Po/DahT7OFBCSPJDdJLBJ6vqMy38zSTrqupomw461uJTwMVd1TcAL7T4hh7xk1TVPmAfwPj4eE1MTMyWTU5O0r1/OjcM6d9L3r1lho8cHu0zhe3jEjn8vdnN5TgFdT6fx2E1in0c5CyjAHcCT1bVf+4qOgjsbNs7gfu64juSnJtkE53F48fatNJLSba257y+q44kaZkM8jXlp4H3AoeTfLnF/h1wK3AgyY3As8C1AFX1eJIDwBN0zlC6uapebfVuAu4C1gIPtIekJeAFazqVBSeEqvo/9J7/B7jyFHX2Ant7xA8Bly20LZKkwXmlsiQJMCFIkprRPp1D0mm5nqBujhAkSYAjBEmNowU5QpAkASYESVLjlJGkkzh9dHZyhCBJAhwhSDoDRwtnD0cIkiTAhCBJapwyktQ3p49GmyMESRLgCEHSAjlaGD2OECRJgCMESYvA0cJocIQgaVFt3HM/h59/8YQkoeHgCEHSknHkMFxMCJKWxdwRgwli9Vk1CSHJNuBjwBrg41V16wo3SdIScvSw+qyKhJBkDfBfgZ8BpoAvJDlYVU+sbMskLYd+1htMGktvVSQE4ArgSFX9KUCSe4DtgAlBEtBf0jgVk0l/UlUr3QaSXANsq6p/2fbfC/yDqvrlOcftAna13bcAT3UVXwh8axmau5Ls42iwj6NhWPv4d6vqR3oVrJYRQnrETspUVbUP2NfzCZJDVTW+2A1bTezjaLCPo2EU+7harkOYAi7u2t8AvLBCbZGks9JqSQhfADYn2ZTkNcAO4OAKt0mSziqrYsqoqmaS/DLwv+icdvqJqnp8nk/TcyppxNjH0WAfR8PI9XFVLCpLklbeapkykiStMBOCJAkYwoSQZFuSp5IcSbKnR3mS3NbKv5rkHSvRzkH00ce3Jnk4yctJPrASbRxUH328rr1+X03yh0nevhLtHEQffdze+vflJIeS/MOVaOcgztTHruN+Ismr7ZqjodLH6ziR5MX2On45ya+vRDsXRVUNzYPOgvOfAD8GvAb4CnDJnGPeCTxA59qGrcCjK93uJejjRcBPAHuBD6x0m5eojz8FnN+2/9mIvo6v5wfreG8Dvr7S7V7sPnYd9zng94FrVrrdS/A6TgC/t9JtXYzHsI0QZm9xUVV/Axy/xUW37cDd1fEI8MYk65a7oQM4Yx+r6lhVfQF4ZSUauAj66eMfVtV32u4jdK5NGSb99HG62v8owOvocTHmKtfP5xHgfcDvAMeWs3GLpN8+joRhSwjrgee69qdabL7HrGbD3v5+zLePN9IZ9Q2TvvqY5OeSfB24H/jFZWrbYjljH5OsB34O+K1lbNdi6ve9+pNJvpLkgSSXLk/TFt+wJYR+bnHR120wVrFhb38/+u5jkn9MJyF8cElbtPj6vR3L71bVW4Grgf+w5K1aXP308TeAD1bVq8vQnqXQTx+/ROf+QG8HfhP4n0veqiUybAmhn1tcDPttMIa9/f3oq49J3gZ8HNheVX+xTG1bLPN6Havq88CPJ7lwqRu2iPrp4zhwT5JngGuA25NcvTzNWxRn7GNVfbeqptv27wM/NGSv46xhSwj93OLiIHB9O9toK/BiVR1d7oYO4Gy4jccZ+5jk7wCfAd5bVX+8Am0cVD99fHOStO130Fm0HKbEd8Y+VtWmqtpYVRuBe4F/VVXD9A26n9fxR7texyvo/L86TK/jrFVx64p+1SlucZHkl1r5b9E5k+GdwBHgr4BfWKn2LkQ/fUzyo8Ah4Dzg/yX5FTpnPnx3xRo+D32+jr8OvInON0qAmRqiO0v22cd/TufLyyvAXwP/omuRedXrs49Drc8+XgPclGSGzuu4Y5hex27eukKSBAzflJEkaYmYECRJgAlBktSYECRJgAlBktSYECRJgAlBktT8fwnxIIz8m5muAAAAAElFTkSuQmCC\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "t0C89Ux_3DPZ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# We choose to remove images containing less than 2% brain matter\n",
        "png_IDs_to_remove = train_metadata.query(\"img_pct_window<0.02\").SOPInstanceUID.tolist()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HjiMcCVX3DPl",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "for ID in tqdm.tqdm(png_IDs_to_remove):\n",
        "    os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "boyg1HUM3DPx",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Once again, we update labels and metadata to reflect the changes we have undertaken\n",
        "labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n",
        "train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SoMn02rk3DP1",
        "colab_type": "text"
      },
      "source": [
        "## **Step 3: Create New CSVs**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MVFPLKrO3DP1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Here, we write out the cleaned CSV files for labels and metadata\n",
        "labels.to_csv(\"labels_cleaned.csv\")\n",
        "train_metadata.to_csv(\"train_metadata_cleaned.csv\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0rpJfRme3DQD",
        "colab_type": "text"
      },
      "source": [
        "## **Step 4: Check that Label File & Metadata File Agree**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3uHZfip1mFpm",
        "colab_type": "text"
      },
      "source": [
        "We have made substantial udpates to both the label file and to the metadata file. It is now prudent to check that both files still agree with each other to ensure the absence of bugs in the above code. We also conduct a few other sanity checks."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PNaF8UrE3DQE",
        "colab_type": "code",
        "outputId": "f29604dd-e13e-41d1-8819-d79e921c26d9",
        "colab": {}
      },
      "source": [
        "csv_labels = pd.read_csv(\"./rsna_stage1_png_128/stage_1_train.csv\")\n",
        "csv_labels = csv_labels.iloc[5::6, :]\n",
        "csv_labels.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>ID</th>\n",
              "      <th>Label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>ID_63eb1e259_any</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>ID_2669954a7_any</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>ID_52c9913b1_any</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>23</th>\n",
              "      <td>ID_4e6ff6126_any</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>ID_7858edd88_any</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                  ID  Label\n",
              "5   ID_63eb1e259_any  0    \n",
              "11  ID_2669954a7_any  0    \n",
              "17  ID_52c9913b1_any  0    \n",
              "23  ID_4e6ff6126_any  0    \n",
              "29  ID_7858edd88_any  0    "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 73
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x6KgfO-Z3DQI",
        "colab_type": "code",
        "outputId": "8ddd8d0d-4f1d-46af-95ef-14e49f779c6b",
        "colab": {}
      },
      "source": [
        "# Get rid of the \"_any\" part of the IDs & verify that it worked\n",
        "csv_labels[\"ID\"] = csv_labels[\"ID\"].str.replace(\"_any\", \"\")\n",
        "csv_labels.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>ID</th>\n",
              "      <th>Label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>ID_63eb1e259</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>ID_2669954a7</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>ID_52c9913b1</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>23</th>\n",
              "      <td>ID_4e6ff6126</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>ID_7858edd88</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "              ID  Label\n",
              "5   ID_63eb1e259  0    \n",
              "11  ID_2669954a7  0    \n",
              "17  ID_52c9913b1  0    \n",
              "23  ID_4e6ff6126  0    \n",
              "29  ID_7858edd88  0    "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 74
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c_PQHFQv3DQU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Verify that \"csv_labels\" and \"labels\" contain the same number of elements (they do)\n",
        "csv_labels_set = set(csv_labels[\"ID\"].tolist())\n",
        "labels_set = set(labels[\"ID\"].str.replace(\".png\", \"\").tolist())\n",
        "\n",
        "print(len(csv_labels_set))\n",
        "print(len(labels_set))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rYF4FM4K3DQY",
        "colab_type": "code",
        "outputId": "454a4d95-f875-4872-ee10-918fa7d15b8e",
        "colab": {}
      },
      "source": [
        "# Verify that \"csv_labels\" and \"labels\" contain the same elements\n",
        "# (Indeed they do since their difference is the empty set)\n",
        "csv_labels_set.symmetric_difference(labels_set)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "set()"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 76
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SDT9V2Fj3DQk",
        "colab_type": "code",
        "outputId": "b71a8c73-b206-4a5c-b279-2659bf1818ca",
        "colab": {}
      },
      "source": [
        "# Also verify that \"metadata_labels\" and \"labels\" contain the same elements\n",
        "metadata_labels_set = set(train_metadata[\"SOPInstanceUID\"].tolist())\n",
        "metadata_labels_set.symmetric_difference(labels_set)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "set()"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 79
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GjiX_onx3DQn",
        "colab_type": "code",
        "outputId": "67fab19e-a2b7-41d7-e72f-e0c606112535",
        "colab": {}
      },
      "source": [
        "# Get rid of .png extension\n",
        "fnames = glob.glob(\"./rsna_stage1_png_128/stage_1_train_images/*\")\n",
        "fnames = [fname.replace(\"./rsna_stage1_png_128/stage_1_train_images\\\\\", \"\").replace(\".png\", \"\") for fname in fnames]\n",
        "print(fnames[0:10])\n",
        "print(len(fnames))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['ID_000039fa0', 'ID_00005679d', 'ID_00008ce3c', 'ID_0000950d7', 'ID_0000aee4b', 'ID_0000f1657', 'ID_000178e76', 'ID_00019828f', 'ID_0001dcc25', 'ID_0001de0e8']\n",
            "641386\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M-8LSvt50L28",
        "colab_type": "text"
      },
      "source": [
        "We have cleaned up the data by deleting ~30,000 images from it:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1ZoqfyXc3DQ1",
        "colab_type": "code",
        "outputId": "1a5a26fa-e566-48aa-f4bc-7573d8c22ba8",
        "colab": {}
      },
      "source": [
        "fnames_set = set(fnames)\n",
        "len(fnames_set.symmetric_difference(metadata_labels_set))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "32872"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 94
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ai0rn23E3DQ5",
        "colab_type": "text"
      },
      "source": [
        "# **Reconstructing the Underlying CT Scans**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TdrbjOxp0-aL",
        "colab_type": "text"
      },
      "source": [
        "Now, we can finally order the images in the sequence that they were taken in. That is, for each patient, we will now reconstruct their brain by arranging the images from their CT scan by increasing scan depth. This will allow us to make use of the **spatial relations** present between a patient's various images (specifically, we do this through the use of a bidirectional LSTM)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c3062b_a3DRC",
        "colab_type": "code",
        "outputId": "c221707f-c0b0-49d6-fdf1-1974a6c43785",
        "colab": {}
      },
      "source": [
        "# This is the part of the metadata the we care about:\n",
        "# \"ImagePositionPatient2\" specifies the slice depth\n",
        "train_metadata_slice_patient_depth = train_metadata[[\"SOPInstanceUID\", \"PatientID\", \"ImagePositionPatient2\"]]\n",
        "train_metadata_slice_patient_depth.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SOPInstanceUID</th>\n",
              "      <th>PatientID</th>\n",
              "      <th>ImagePositionPatient2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>ID_231d901c1</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>104.307000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>ID_994bc0470</td>\n",
              "      <td>ID_400facde</td>\n",
              "      <td>223.572015</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>ID_127689cce</td>\n",
              "      <td>ID_42910d3d</td>\n",
              "      <td>124.321068</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>ID_25457734a</td>\n",
              "      <td>ID_329aafa7</td>\n",
              "      <td>171.999939</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>ID_87e8b2528</td>\n",
              "      <td>ID_d6e578fb</td>\n",
              "      <td>156.828114</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "  SOPInstanceUID    PatientID  ImagePositionPatient2\n",
              "0  ID_231d901c1   ID_b81a287f  104.307000           \n",
              "1  ID_994bc0470   ID_400facde  223.572015           \n",
              "2  ID_127689cce   ID_42910d3d  124.321068           \n",
              "3  ID_25457734a   ID_329aafa7  171.999939           \n",
              "5  ID_87e8b2528   ID_d6e578fb  156.828114           "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 141
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pP3JfpDP3DRF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# We create the \"patient_slices\" dictionary:\n",
        "# keys are patients, values are (slice depth and file names)\n",
        "patient_slices = dict()\n",
        "for i, row in tqdm.tqdm(train_metadata_slice_patient_depth.iterrows()):\n",
        "    if row[\"PatientID\"] not in patient_slices:\n",
        "        patient_slices[row[\"PatientID\"]] = [(row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"])]\n",
        "    else:\n",
        "        patient_slices[row[\"PatientID\"]].append((row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"]))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "b4qQe0SF3DRI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# We now sort the dictionary created above so as to arrange slices in the order that they belong in,\n",
        "# thereby reconstructing all patient brains\n",
        "patient_slices_sorted = dict()\n",
        "for i, (key, val) in enumerate(patient_slices.items()):\n",
        "    val.sort()\n",
        "    patient_slices_sorted[key] = [ID for depth, ID in val]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Rq67G_v_wHZ",
        "colab_type": "text"
      },
      "source": [
        "Below is an example of one (1) patient's reconstructed CT scan. Note the increasing slice depth:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-WHINOCm3DRi",
        "colab_type": "code",
        "outputId": "15307222-89cf-43c8-ffa4-685749cf0109",
        "colab": {}
      },
      "source": [
        "train_metadata_slice_patient_depth.query(\"PatientID == 'ID_b81a287f'\").sort_values(\"ImagePositionPatient2\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SOPInstanceUID</th>\n",
              "      <th>PatientID</th>\n",
              "      <th>ImagePositionPatient2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>214258</th>\n",
              "      <td>ID_9f601fc5d</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>7.956</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>427173</th>\n",
              "      <td>ID_19cb96474</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>13.033</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>636486</th>\n",
              "      <td>ID_496ab2661</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>18.110</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>436756</th>\n",
              "      <td>ID_06fe4adc5</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>23.187</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>68576</th>\n",
              "      <td>ID_59bc3960f</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>28.266</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>331217</th>\n",
              "      <td>ID_aef3564ac</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>33.343</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>503169</th>\n",
              "      <td>ID_c1f0895bb</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>38.420</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>467952</th>\n",
              "      <td>ID_c85081ef5</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>43.497</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>53879</th>\n",
              "      <td>ID_079fef6c2</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>48.456</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>245073</th>\n",
              "      <td>ID_5ab7d0d9a</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>53.533</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>287067</th>\n",
              "      <td>ID_54d628968</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>58.610</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>218156</th>\n",
              "      <td>ID_d3e4638f6</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>63.687</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>611699</th>\n",
              "      <td>ID_21af3f314</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>68.766</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>435682</th>\n",
              "      <td>ID_3eb115349</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>73.843</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10788</th>\n",
              "      <td>ID_c489f8a64</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>78.920</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>653377</th>\n",
              "      <td>ID_55e73915f</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>83.997</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>552646</th>\n",
              "      <td>ID_678a1a095</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>89.076</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>492896</th>\n",
              "      <td>ID_508bd479e</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>94.153</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>297718</th>\n",
              "      <td>ID_6609a6357</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>99.230</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>ID_231d901c1</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>104.307</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>214284</th>\n",
              "      <td>ID_826c83df4</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>109.386</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18784</th>\n",
              "      <td>ID_008507574</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>114.463</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>562637</th>\n",
              "      <td>ID_f49234b83</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>119.540</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>125258</th>\n",
              "      <td>ID_ec9041d07</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>124.617</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>486312</th>\n",
              "      <td>ID_bcbee580c</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>129.696</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>411439</th>\n",
              "      <td>ID_df700e73f</td>\n",
              "      <td>ID_b81a287f</td>\n",
              "      <td>134.773</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       SOPInstanceUID    PatientID  ImagePositionPatient2\n",
              "214258  ID_9f601fc5d   ID_b81a287f  7.956                \n",
              "427173  ID_19cb96474   ID_b81a287f  13.033               \n",
              "636486  ID_496ab2661   ID_b81a287f  18.110               \n",
              "436756  ID_06fe4adc5   ID_b81a287f  23.187               \n",
              "68576   ID_59bc3960f   ID_b81a287f  28.266               \n",
              "331217  ID_aef3564ac   ID_b81a287f  33.343               \n",
              "503169  ID_c1f0895bb   ID_b81a287f  38.420               \n",
              "467952  ID_c85081ef5   ID_b81a287f  43.497               \n",
              "53879   ID_079fef6c2   ID_b81a287f  48.456               \n",
              "245073  ID_5ab7d0d9a   ID_b81a287f  53.533               \n",
              "287067  ID_54d628968   ID_b81a287f  58.610               \n",
              "218156  ID_d3e4638f6   ID_b81a287f  63.687               \n",
              "611699  ID_21af3f314   ID_b81a287f  68.766               \n",
              "435682  ID_3eb115349   ID_b81a287f  73.843               \n",
              "10788   ID_c489f8a64   ID_b81a287f  78.920               \n",
              "653377  ID_55e73915f   ID_b81a287f  83.997               \n",
              "552646  ID_678a1a095   ID_b81a287f  89.076               \n",
              "492896  ID_508bd479e   ID_b81a287f  94.153               \n",
              "297718  ID_6609a6357   ID_b81a287f  99.230               \n",
              "0       ID_231d901c1   ID_b81a287f  104.307              \n",
              "214284  ID_826c83df4   ID_b81a287f  109.386              \n",
              "18784   ID_008507574   ID_b81a287f  114.463              \n",
              "562637  ID_f49234b83   ID_b81a287f  119.540              \n",
              "125258  ID_ec9041d07   ID_b81a287f  124.617              \n",
              "486312  ID_bcbee580c   ID_b81a287f  129.696              \n",
              "411439  ID_df700e73f   ID_b81a287f  134.773              "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 162
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-_kOiXjQ3DRt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Save the ordered dictionary to a file\n",
        "with open(\"ordered_slices_by_patient.pkl\", \"wb\") as f:\n",
        "    pickle.dump(patient_slices_sorted, f)"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}