{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "colab": { "name": "EDA and Data Cleaning.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "Xl323Q-gMutB", "colab_type": "text" }, "source": [ "Credits to Jeremy Howard who discovered that some files have a corrupted rescale intercept and that other files show very little or no brain matter (https://www.kaggle.com/jhoward/cleaning-the-data-for-rapid-prototyping-fastai)" ] }, { "cell_type": "markdown", "metadata": { "id": "YnJowERx3Sik", "colab_type": "text" }, "source": [ "# **Importing Dependencies**" ] }, { "cell_type": "code", "metadata": { "id": "CPe_QcSy3DLo", "colab_type": "code", "colab": {} }, "source": [ "import glob\n", "import os\n", "import pickle\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "import tqdm.notebook as tqdm\n", "\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.max_colwidth', -1)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "yObm5-yx4FSf", "colab_type": "text" }, "source": [ "# **Reading in Metadata & Data Labels**" ] }, { "cell_type": "markdown", "metadata": { "id": "YA20uuhBVxAn", "colab_type": "text" }, "source": [ "DICOM files---the type of data we are dealing with here---do not merely contain images (like a slice of a patient's brain), but also a heap of metadata. Using parts of this metadata (specifically, patient IDs) lets us \"piece together\" *full* scans of patient brains from otherwise unrelated images.\n", "\n", "This step is crucial for our approach: For our sequential model to make use of the spatial relations inherent in CT scans, we must first reconstruct those spatial relations." ] }, { "cell_type": "code", "metadata": { "id": "-7aoQvsa3DMP", "colab_type": "code", "colab": {} }, "source": [ "labels = pd.read_feather(\"./labels_jhoward.fth\")\n", "train_metadata = pd.read_feather(\"./df_trn.fth\")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Ok_pdN2o3DMd", "colab_type": "code", "colab": {} }, "source": [ "# Add .png extension to file IDs in the \"labels\" DataFrame\n", "labels = labels[[\"ID\", \"any\"]]\n", "labels[\"ID\"] = labels[\"ID\"].str[:] + \".png\"" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "pRB1cxyd3DMg", "colab_type": "code", "outputId": "f3b361aa-0fe2-4f6a-977e-7957846c169e", "colab": {} }, "source": [ "# Verify that the above cell executed correctly\n", "labels.head()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDany
0ID_000039fa0.png0
1ID_00005679d.png0
2ID_00008ce3c.png0
3ID_0000950d7.png0
4ID_0000aee4b.png0
\n", "
" ], "text/plain": [ " ID any\n", "0 ID_000039fa0.png 0 \n", "1 ID_00005679d.png 0 \n", "2 ID_00008ce3c.png 0 \n", "3 ID_0000950d7.png 0 \n", "4 ID_0000aee4b.png 0 " ] }, "metadata": { "tags": [] }, "execution_count": 99 } ] }, { "cell_type": "markdown", "metadata": { "id": "czhB_jGTMRGn", "colab_type": "text" }, "source": [ "# **EDA**" ] }, { "cell_type": "code", "metadata": { "id": "GSdYiywS3DMp", "colab_type": "code", "outputId": "c8375c53-cb4c-4dcb-dcbe-746fc203665f", "colab": {} }, "source": [ "# We find that there are ~700,000 images, 14% of which contain hemorrhages\n", "# (see \"mean\" (note that hemorrhages have label 1 while no-hemorrhages have label 0))\n", "labels.describe()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
any
count674258.000000
mean0.144015
std0.351105
min0.000000
25%0.000000
50%0.000000
75%0.000000
max1.000000
\n", "
" ], "text/plain": [ " any\n", "count 674258.000000\n", "mean 0.144015 \n", "std 0.351105 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 0.000000 \n", "max 1.000000 " ] }, "metadata": { "tags": [] }, "execution_count": 100 } ] }, { "cell_type": "code", "metadata": { "id": "7Aep4NdF3DMs", "colab_type": "code", "outputId": "44228e66-21e0-4eae-c68b-c97f71da97f1", "colab": {} }, "source": [ "# For reference, the full metadata contained in 5 DICOM files\n", "train_metadata.head()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SOPInstanceUIDModalityPatientIDStudyInstanceUIDSeriesInstanceUIDStudyIDImagePositionPatientImageOrientationPatientSamplesPerPixelPhotometricInterpretationRowsColumnsPixelSpacingBitsAllocatedBitsStoredHighBitPixelRepresentationWindowCenterWindowWidthRescaleInterceptRescaleSlopefnameMultiImagePositionPatientImagePositionPatient1ImagePositionPatient2MultiImageOrientationPatientImageOrientationPatient1ImageOrientationPatient2ImageOrientationPatient3ImageOrientationPatient4ImageOrientationPatient5MultiPixelSpacingPixelSpacing1img_minimg_maximg_meanimg_stdimg_pct_windowMultiWindowCenterWindowCenter1MultiWindowWidthWindowWidth1
0ID_231d901c1CTID_b81a287fID_dd37ba3adbID_15dcd6057a-125.01.01MONOCHROME25125120.488281161615140.0100.0-1024.01.0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm1-123.101000104.30700010.00.00.00.984808-0.17364810.488281-10243263171.462490828.1024640.164074NaNNaNNaNNaN
1ID_994bc0470CTID_400facdeID_c5277f0c63ID_4ba12c2161-125.01.01MONOCHROME25125120.488281161211047.080.0-1024.01.0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm153.628222223.57201510.00.00.00.933580-0.35836810.48828102507430.418091599.7429630.1981391.047.01.080.0
2ID_127689cceCTID_42910d3dID_db93ade25bID_c4b4931314-125.01.01MONOCHROME25125120.488281161615130.080.0-1024.01.0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm1-123.646240124.32106810.00.00.00.972370-0.23344510.488281-2000281012.8013761209.0461680.250923NaNNaNNaNNaN
3ID_25457734aCTID_329aafa7ID_8dd6d32f3bID_116558f409-114.01.01MONOCHROME25125120.445312161211036.080.0-1024.01.0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm1-6.000000171.99993910.00.00.01.0000000.00000010.44531202647566.557011610.1528450.2983861.036.01.080.0
4ID_81c9aa125CTID_6b544c3cID_2685c5d5c0ID_f56d7bd0f9-115.01.01MONOCHROME25125120.449219161211036.080.0-1024.01.0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm1-1.000000230.50000010.00.00.01.0000000.00000010.44921941570178.512295358.2350710.0061761.036.01.080.0
\n", "
" ], "text/plain": [ " SOPInstanceUID Modality PatientID StudyInstanceUID SeriesInstanceUID \\\n", "0 ID_231d901c1 CT ID_b81a287f ID_dd37ba3adb ID_15dcd6057a \n", "1 ID_994bc0470 CT ID_400facde ID_c5277f0c63 ID_4ba12c2161 \n", "2 ID_127689cce CT ID_42910d3d ID_db93ade25b ID_c4b4931314 \n", "3 ID_25457734a CT ID_329aafa7 ID_8dd6d32f3b ID_116558f409 \n", "4 ID_81c9aa125 CT ID_6b544c3c ID_2685c5d5c0 ID_f56d7bd0f9 \n", "\n", " StudyID ImagePositionPatient ImageOrientationPatient SamplesPerPixel \\\n", "0 -125.0 1.0 1 \n", "1 -125.0 1.0 1 \n", "2 -125.0 1.0 1 \n", "3 -114.0 1.0 1 \n", "4 -115.0 1.0 1 \n", "\n", " PhotometricInterpretation Rows Columns PixelSpacing BitsAllocated \\\n", "0 MONOCHROME2 512 512 0.488281 16 \n", "1 MONOCHROME2 512 512 0.488281 16 \n", "2 MONOCHROME2 512 512 0.488281 16 \n", "3 MONOCHROME2 512 512 0.445312 16 \n", "4 MONOCHROME2 512 512 0.449219 16 \n", "\n", " BitsStored HighBit PixelRepresentation WindowCenter WindowWidth \\\n", "0 16 15 1 40.0 100.0 \n", "1 12 11 0 47.0 80.0 \n", "2 16 15 1 30.0 80.0 \n", "3 12 11 0 36.0 80.0 \n", "4 12 11 0 36.0 80.0 \n", "\n", " RescaleIntercept RescaleSlope \\\n", "0 -1024.0 1.0 \n", "1 -1024.0 1.0 \n", "2 -1024.0 1.0 \n", "3 -1024.0 1.0 \n", "4 -1024.0 1.0 \n", "\n", " fname \\\n", "0 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_231d901c1.dcm \n", "1 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_994bc0470.dcm \n", "2 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_127689cce.dcm \n", "3 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_25457734a.dcm \n", "4 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_81c9aa125.dcm \n", "\n", " MultiImagePositionPatient ImagePositionPatient1 ImagePositionPatient2 \\\n", "0 1 -123.101000 104.307000 \n", "1 1 53.628222 223.572015 \n", "2 1 -123.646240 124.321068 \n", "3 1 -6.000000 171.999939 \n", "4 1 -1.000000 230.500000 \n", "\n", " MultiImageOrientationPatient ImageOrientationPatient1 \\\n", "0 1 0.0 \n", "1 1 0.0 \n", "2 1 0.0 \n", "3 1 0.0 \n", "4 1 0.0 \n", "\n", " ImageOrientationPatient2 ImageOrientationPatient3 \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " ImageOrientationPatient4 ImageOrientationPatient5 MultiPixelSpacing \\\n", "0 0.984808 -0.173648 1 \n", "1 0.933580 -0.358368 1 \n", "2 0.972370 -0.233445 1 \n", "3 1.000000 0.000000 1 \n", "4 1.000000 0.000000 1 \n", "\n", " PixelSpacing1 img_min img_max img_mean img_std img_pct_window \\\n", "0 0.488281 -1024 3263 171.462490 828.102464 0.164074 \n", "1 0.488281 0 2507 430.418091 599.742963 0.198139 \n", "2 0.488281 -2000 2810 12.801376 1209.046168 0.250923 \n", "3 0.445312 0 2647 566.557011 610.152845 0.298386 \n", "4 0.449219 4 1570 178.512295 358.235071 0.006176 \n", "\n", " MultiWindowCenter WindowCenter1 MultiWindowWidth WindowWidth1 \n", "0 NaN NaN NaN NaN \n", "1 1.0 47.0 1.0 80.0 \n", "2 NaN NaN NaN NaN \n", "3 1.0 36.0 1.0 80.0 \n", "4 1.0 36.0 1.0 80.0 " ] }, "metadata": { "tags": [] }, "execution_count": 101 } ] }, { "cell_type": "markdown", "metadata": { "id": "6Xai9HMWZIjT", "colab_type": "text" }, "source": [ "Sorting by patient ID groups patients together while sorting by \"ImagePositionPatient2\" sorts the patients' brain slices to be in correct order (thus, the 20 files output by the cell below contain subsequent slices of a single patient's brain)." ] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "laKYqlaO3DMy", "colab_type": "code", "colab": {} }, "source": [ "train_metadata.sort_values(by=[\"PatientID\", \"ImagePositionPatient2\"]).head(20)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "RgF5wYFK3DM-", "colab_type": "code", "outputId": "4f8c3482-d623-4668-83dc-972423dd2bc3", "colab": {} }, "source": [ "# Verify that we can retrieve the name of our files from the metadata\n", "# (important for matching our PNGs extracted from DICOM files to their metadata)\n", "train_metadata[[\"SOPInstanceUID\", \"fname\"]].sort_values(\"SOPInstanceUID\")" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SOPInstanceUIDfname
409738ID_000039fa0../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm
470057ID_00005679d../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm
548095ID_00008ce3c../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm
204704ID_0000950d7../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm
291987ID_0000aee4b../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm
.........
544908ID_ffff73ede../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm
385867ID_ffff80705../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm
674027ID_ffff82e46../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm
52232ID_ffff922b9../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm
5317ID_fffff9393../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm
\n", "

674258 rows × 2 columns

\n", "
" ], "text/plain": [ " SOPInstanceUID \\\n", "409738 ID_000039fa0 \n", "470057 ID_00005679d \n", "548095 ID_00008ce3c \n", "204704 ID_0000950d7 \n", "291987 ID_0000aee4b \n", "... ... \n", "544908 ID_ffff73ede \n", "385867 ID_ffff80705 \n", "674027 ID_ffff82e46 \n", "52232 ID_ffff922b9 \n", "5317 ID_fffff9393 \n", "\n", " fname \n", "409738 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_000039fa0.dcm \n", "470057 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00005679d.dcm \n", "548095 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_00008ce3c.dcm \n", "204704 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000950d7.dcm \n", "291987 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_0000aee4b.dcm \n", "... ... \n", "544908 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff73ede.dcm \n", "385867 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff80705.dcm \n", "674027 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff82e46.dcm \n", "52232 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_ffff922b9.dcm \n", "5317 ../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_fffff9393.dcm \n", "\n", "[674258 rows x 2 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 103 } ] }, { "cell_type": "markdown", "metadata": { "id": "P8goUuHTaBA9", "colab_type": "text" }, "source": [ "Next, we want to know how many slices of a patient's brain the CT scans in our data usually contain. The histogram below tells us that the answer is ~30." ] }, { "cell_type": "code", "metadata": { "id": "Tg-1hmrC3DND", "colab_type": "code", "outputId": "0bfdf379-e0c9-4eeb-f97d-8b2a70e5853d", "colab": {} }, "source": [ "plt.figure(figsize=(20, 6))\n", "train_metadata.groupby(\"PatientID\").Modality.count().hist(bins=150)\n", "plt.show()" ], "execution_count": 0, "outputs": [ { "output_type": "display_data", "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABIoAAAFlCAYAAACEOwMFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAbUUlEQVR4nO3db4xd9Xkn8O8TO6EolAaUMEIYrVnJ6pY/SrpYLKuoqyFki3eJCm+QXNHGrFhZimiVSkit6ZuqL5B4VTVRS7RWksVR0lpW2wgrlLTI7ShaiYRAmy4BgrCCl7h48bZNWpwXtKbPvphftnfN2L5jj2fGcz8f6eqe89zfmfO76PlF8VfnnFvdHQAAAAB411pPAAAAAID1QVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJks1rPYGzef/7399bt26devwPf/jDvPe9771wE4J1TP8zy/Q/s0z/M+usAWaZ/udcPffcc3/T3R84tb7ug6KtW7fm2WefnXr8wsJC5ufnL9yEYB3T/8wy/c8s0//MOmuAWab/OVdV9b+Wqrv1DAAAAIAkgiIAAAAABkERAAAAAEkERQAAAAAMgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgSbJ5rSfAudu654kl60ceuXOVZwIAAABsBK4oAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADFMFRVX1vqr6g6r6TlW9VFX/vqqurKqnquqV8X7FxPiHqupwVb1cVXdM1G+uqufHZ5+uqroQXwoAAACA5Zv2iqJPJflqd/+bJB9M8lKSPUkOdfe2JIfGfqrq+iQ7k9yQZEeSR6tq0/g7n0myO8m28dqxQt8DAAAAgPN01qCoqi5P8h+SfC5Juvsfu/sHSe5Ksm8M25fk7rF9V5L93f1Wd7+a5HCSW6rq6iSXd/fT3d1JvjBxDAAAAABrbPMUY/51kv+T5L9X1QeTPJfkk0nmuvtYknT3saq6aoy/JsnXJ44/Omr/NLZPrb9DVe3O4pVHmZuby8LCwrTfJydOnFjW+IvZgzedXLI+K9+fd5ql/odT6X9mmf5n1lkDzDL9z0qbJijanOTfJvnl7v5GVX0q4zaz01jquUN9hvo7i917k+xNku3bt/f8/PwU01y0sLCQ5Yy/mN2354kl60funV/dibBuzFL/w6n0P7NM/zPrrAFmmf5npU3zjKKjSY529zfG/h9kMTh6Y9xOlvF+fGL8tRPHb0ny+qhvWaIOAAAAwDpw1qCou/93ku9V1U+O0u1JXkxyMMmuUduV5PGxfTDJzqq6pKquy+JDq58Zt6m9WVW3jl87+/jEMQAAAACssWluPUuSX07ypap6T5LvJvkvWQyZDlTV/UleS3JPknT3C1V1IIth0skkD3T32+PvfCLJY0kuTfLkeAEAAACwDkwVFHX3t5JsX+Kj208z/uEkDy9RfzbJjcuZIAAAAACrY5pnFAEAAAAwAwRFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAIOgCAAAAIAkgiIAAAAABkERAAAAAEkERQAAAAAMgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgGGqoKiqjlTV81X1rap6dtSurKqnquqV8X7FxPiHqupwVb1cVXdM1G8ef+dwVX26qmrlvxIAAAAA52I5VxTd1t0f6u7tY39PkkPdvS3JobGfqro+yc4kNyTZkeTRqto0jvlMkt1Jto3XjvP/CgAAAACshPO59eyuJPvG9r4kd0/U93f3W939apLDSW6pqquTXN7dT3d3J/nCxDEAAAAArLFazGzOMqjq1STfT9JJ/lt3762qH3T3+ybGfL+7r6iq30ny9e7+4qh/LsmTSY4keaS7PzrqP5Pk17r7Y0ucb3cWrzzK3Nzczfv375/6C504cSKXXXbZ1OMvZs//9d8vWb/pmp9Y5ZmwXsxS/8Op9D+zTP8z66wBZpn+51zddtttz03cNfb/bJ7y+A939+tVdVWSp6rqO2cYu9Rzh/oM9XcWu/cm2Zsk27dv7/n5+SmnmSwsLGQ54y9m9+15Ysn6kXvnV3cirBuz1P9wKv3PLNP/zDprgFmm/1lpU9161t2vj/fjSb6c5JYkb4zbyTLej4/hR5NcO3H4liSvj/qWJeoAAAAArANnDYqq6r1V9eM/2k7ys0m+neRgkl1j2K4kj4/tg0l2VtUlVXVdFh9a/Ux3H0vyZlXdOn7t7OMTxwAAAACwxqa59WwuyZfHL9lvTvJ73f3VqvpmkgNVdX+S15LckyTd/UJVHUjyYpKTSR7o7rfH3/pEkseSXJrF5xY9uYLfBQAAAIDzcNagqLu/m+SDS9T/Nsntpznm4SQPL1F/NsmNy58mAAAAABfaVM8oAgAAAGDjExQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAIOgCAAAAIAkywiKqmpTVf1lVX1l7F9ZVU9V1Svj/YqJsQ9V1eGqermq7pio31xVz4/PPl1VtbJfBwAAAIBztZwrij6Z5KWJ/T1JDnX3tiSHxn6q6vokO5PckGRHkkeratM45jNJdifZNl47zmv2AAAAAKyYqYKiqtqS5M4kn50o35Vk39jel+Tuifr+7n6ru19NcjjJLVV1dZLLu/vp7u4kX5g4BgAAAIA1Nu0VRb+d5FeT/PNEba67jyXJeL9q1K9J8r2JcUdH7ZqxfWodAAAAgHVg89kGVNXHkhzv7ueqan6Kv7nUc4f6DPWlzrk7i7eoZW5uLgsLC1OcdtGJEyeWNf5i9uBNJ5esz8r3551mqf/hVPqfWab/mXXWALNM/7PSzhoUJflwkp+rqv+c5MeSXF5VX0zyRlVd3d3Hxm1lx8f4o0munTh+S5LXR33LEvV36O69SfYmyfbt23t+fn7qL7SwsJDljL+Y3bfniSXrR+6dX92JsG7MUv/DqfQ/s0z/M+usAWaZ/melnfXWs+5+qLu3dPfWLD6k+s+6+xeSHEyyawzbleTxsX0wyc6quqSqrsviQ6ufGbenvVlVt45fO/v4xDEAAAAArLFprig6nUeSHKiq+5O8luSeJOnuF6rqQJIXk5xM8kB3vz2O+USSx5JcmuTJ8QIAAABgHVhWUNTdC0kWxvbfJrn9NOMeTvLwEvVnk9y43EkCAAAAcOFN+6tnAAAAAGxwgiIAAAAAkgiKAAAAABgERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAcNagqKp+rKqeqaq/qqoXquo3R/3Kqnqqql4Z71dMHPNQVR2uqper6o6J+s1V9fz47NNVVRfmawEAAACwXNNcUfRWko909weTfCjJjqq6NcmeJIe6e1uSQ2M/VXV9kp1JbkiyI8mjVbVp/K3PJNmdZNt47VjB7wIAAADAeThrUNSLTozdd49XJ7kryb5R35fk7rF9V5L93f1Wd7+a5HCSW6rq6iSXd/fT3d1JvjBxDAAAAABrbKpnFFXVpqr6VpLjSZ7q7m8kmevuY0ky3q8aw69J8r2Jw4+O2jVj+9Q6AAAAAOvA5mkGdffbST5UVe9L8uWquvEMw5d67lCfof7OP1C1O4u3qGVubi4LCwvTTDNJcuLEiWWNv5g9eNPJJeuz8v15p1nqfziV/meW6X9mnTXALNP/rLSpgqIf6e4fVNVCFp8t9EZVXd3dx8ZtZcfHsKNJrp04bEuS10d9yxL1pc6zN8neJNm+fXvPz89PPceFhYUsZ/zF7L49TyxZP3Lv/OpOhHVjlvofTqX/mWX6n1lnDTDL9D8rbZpfPfvAuJIoVXVpko8m+U6Sg0l2jWG7kjw+tg8m2VlVl1TVdVl8aPUz4/a0N6vq1vFrZx+fOAYAAACANTbNFUVXJ9k3frnsXUkOdPdXqurpJAeq6v4kryW5J0m6+4WqOpDkxSQnkzwwbl1Lkk8keSzJpUmeHC/OYOtprhoCAAAAWGlnDYq6+38m+ekl6n+b5PbTHPNwkoeXqD+b5EzPNwIAAABgjUz1q2cAAAAAbHyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAADD5rWeACtv654nTvvZkUfuXMWZAAAAABcTVxQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSJJvXegKsrq17njjtZ0ceuXMVZwIAAACsN64oAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBkiqCoqq6tqj+vqpeq6oWq+uSoX1lVT1XVK+P9ioljHqqqw1X1clXdMVG/uaqeH599uqrqwnwtAAAAAJZrmiuKTiZ5sLt/KsmtSR6oquuT7ElyqLu3JTk09jM+25nkhiQ7kjxaVZvG3/pMkt1Jto3XjhX8LgAAAACch7MGRd19rLv/Ymy/meSlJNckuSvJvjFsX5K7x/ZdSfZ391vd/WqSw0luqaqrk1ze3U93dyf5wsQxAAAAAKyxWsxsphxctTXJ15LcmOS17n7fxGff7+4rqup3kny9u7846p9L8mSSI0ke6e6PjvrPJPm17v7YEufZncUrjzI3N3fz/v37p57jiRMnctlll009fr17/q//ftXOddM1P7Fq5+LC2Gj9D8uh/5ll+p9ZZw0wy/Q/5+q22257rru3n1rfPO0fqKrLkvxhkl/p7n84w+OFlvqgz1B/Z7F7b5K9SbJ9+/aen5+fdppZWFjIcsavd/fteWLVznXk3vlVOxcXxkbrf1gO/c8s0//MOmuAWab/WWlT/epZVb07iyHRl7r7j0b5jXE7Wcb78VE/muTaicO3JHl91LcsUQcAAABgHZjmV88qyeeSvNTdvzXx0cEku8b2riSPT9R3VtUlVXVdFh9a/Ux3H0vyZlXdOv7mxyeOAQAAAGCNTXPr2YeT/GKS56vqW6P260keSXKgqu5P8lqSe5Kku1+oqgNJXsziL6Y90N1vj+M+keSxJJdm8blFT67Q9wAAAADgPJ01KOru/5Glny+UJLef5piHkzy8RP3ZLD4IGwAAAIB1ZqpnFAEAAACw8QmKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwCIoAAAAASCIoAgAAAGAQFAEAAACQRFAEAAAAwCAoAgAAACCJoAgAAACAQVAEAAAAQBJBEQAAAACDoAgAAACAJIIiAAAAAAZBEQAAAABJBEUAAAAADIIiAAAAAJIIigAAAAAYBEUAAAAAJBEUAQAAADAIigAAAABIIigCAAAAYBAUAQAAAJBEUAQAAADAICgCAAAAIImgCAAAAIBBUAQAAABAEkERAAAAAMNZg6Kq+nxVHa+qb0/Urqyqp6rqlfF+xcRnD1XV4ap6uarumKjfXFXPj88+XVW18l8HAAAAgHM1zRVFjyXZcUptT5JD3b0tyaGxn6q6PsnOJDeMYx6tqk3jmM8k2Z1k23id+jcBAAAAWENnDYq6+2tJ/u6U8l1J9o3tfUnunqjv7+63uvvVJIeT3FJVVye5vLuf7u5O8oWJYwAAAABYBzaf43Fz3X0sSbr7WFVdNerXJPn6xLijo/ZPY/vU+pKqancWrz7K3NxcFhYWpp7YiRMnljV+vXvwppOrdq6N9N9tVm20/ofl0P/MMv3PrLMGmGX6n5V2rkHR6Sz13KE+Q31J3b03yd4k2b59e8/Pz089gYWFhSxn/Hp3354nVu1cR+6dX7VzcWFstP6H5dD/zDL9z6yzBphl+p+Vdq6/evbGuJ0s4/34qB9Ncu3EuC1JXh/1LUvUAQAAAFgnzjUoOphk19jeleTxifrOqrqkqq7L4kOrnxm3qb1ZVbeOXzv7+MQxAAAAAKwDZ731rKp+P8l8kvdX1dEkv5HkkSQHqur+JK8luSdJuvuFqjqQ5MUkJ5M80N1vjz/1iSz+gtqlSZ4cLwAAAADWibMGRd3986f56PbTjH84ycNL1J9NcuOyZgcAAADAqjnXW88AAAAA2GAERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAACGzWs9AdaPrXueOO1nRx65cxVnAgAAAKwFVxQBAAAAkERQBAAAAMAgKAIAAAAgiaAIAAAAgEFQBAAAAEASQREAAAAAg6AIAAAAgCSCIgAAAAAGQREAAAAASQRFAAAAAAyCIgAAAACSCIoAAAAAGARFAAAAACQRFAEAAAAwbF7rCXBx2LrnidN+duSRO1dxJgAAAMCF4ooiAAAAAJIIigAAAAAY3HrGhuDWOAAAADh/rigCAAAAIImgCAAAAIDBrWdcVM50ixkAAABwflxRBAAAAEASVxStC66S+f+t9H8PD7oGAACA6QiKmGlCJAAAAPgXgiIuGCEMAAAAXFw8owgAAACAJGtwRVFV7UjyqSSbkny2ux9Z7Tmw9i7m5zK5UgoAAICNalWDoqralOR3k/zHJEeTfLOqDnb3i6s5D1bWxRz6AAAAAP9ita8ouiXJ4e7+bpJU1f4kdyURFLHunEsAdq6h2blcibTUuR686WTuO8scznSu083flVIAAACzYbWDomuSfG9i/2iSf7fKc1gTrrrhTFazP1YzADsX5xJkrSenm/9qhojnc75zmYeAEQAANo7q7tU7WdU9Se7o7v869n8xyS3d/cunjNudZPfY/ckkLy/jNO9P8jcrMF24GOl/Zpn+Z5bpf2adNcAs0/+cq3/V3R84tbjaVxQdTXLtxP6WJK+fOqi79ybZey4nqKpnu3v7uU0PLm76n1mm/5ll+p9ZZw0wy/Q/K+1dq3y+bybZVlXXVdV7kuxMcnCV5wAAAADAElb1iqLuPllVv5TkT5JsSvL57n5hNecAAAAAwNJW+9azdPcfJ/njC3iKc7plDTYI/c8s0//MMv3PrLMGmGX6nxW1qg+zBgAAAGD9Wu1nFAEAAACwTm2YoKiqdlTVy1V1uKr2rPV84EKoqs9X1fGq+vZE7cqqeqqqXhnvV0x89tBYEy9X1R1rM2s4f1V1bVX9eVW9VFUvVNUnR13/MxOq6seq6pmq+quxBn5z1K0BZkJVbaqqv6yqr4x9vc/MqKojVfV8VX2rqp4dNWuAC2ZDBEVVtSnJ7yb5T0muT/LzVXX92s4KLojHkuw4pbYnyaHu3pbk0NjPWAM7k9wwjnl0rBW4GJ1M8mB3/1SSW5M8MHpc/zMr3kryke7+YJIPJdlRVbfGGmB2fDLJSxP7ep9Zc1t3f6i7t499a4ALZkMERUluSXK4u7/b3f+YZH+Su9Z4TrDiuvtrSf7ulPJdSfaN7X1J7p6o7+/ut7r71SSHs7hW4KLT3ce6+y/G9ptZ/MfCNdH/zIhedGLsvnu8OtYAM6CqtiS5M8lnJ8p6n1lnDXDBbJSg6Jok35vYPzpqMAvmuvtYsviP6SRXjbp1wYZUVVuT/HSSb0T/M0PGrTffSnI8yVPdbQ0wK347ya8m+eeJmt5nlnSSP62q56pq96hZA1wwm9d6Aiuklqj5OTdmnXXBhlNVlyX5wyS/0t3/ULVUmy8OXaKm/7modffbST5UVe9L8uWquvEMw60BNoSq+liS4939XFXNT3PIEjW9z8Xuw939elVdleSpqvrOGcZaA5y3jXJF0dEk107sb0ny+hrNBVbbG1V1dZKM9+Ojbl2woVTVu7MYEn2pu/9olPU/M6e7f5BkIYvPnrAG2Og+nOTnqupIFh8v8ZGq+mL0PjOku18f78eTfDmLt5JZA1wwGyUo+maSbVV1XVW9J4sP7zq4xnOC1XIwya6xvSvJ4xP1nVV1SVVdl2RbkmfWYH5w3mrx0qHPJXmpu39r4iP9z0yoqg+MK4lSVZcm+WiS78QaYIPr7oe6e0t3b83i/8f/s+7+heh9ZkRVvbeqfvxH20l+Nsm3Yw1wAW2IW8+6+2RV/VKSP0myKcnnu/uFNZ4WrLiq+v0k80neX1VHk/xGkkeSHKiq+5O8luSeJOnuF6rqQJIXs/iLUQ+M2xbgYvThJL+Y5PnxjJYk+fXof2bH1Un2jV+ueVeSA939lap6OtYAs8n//jMr5rJ4u3Gy+O/33+vur1bVN2MNcIFUt9sVAQAAANg4t54BAAAAcJ4ERQAAAAAkERQBAAAAMAiKAAAAAEgiKAIAAABgEBQBAAAAkERQBAAAAMAgKAIAAAAgSfJ/AS9YWdOz7GB3AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "Nn0cztnmajeI", "colab_type": "text" }, "source": [ "# **Data Cleaning**" ] }, { "cell_type": "markdown", "metadata": { "id": "10FFXYu53DNM", "colab_type": "text" }, "source": [ "## **Step 1: Removing Files w/ Incorrect Rescale Intercept**" ] }, { "cell_type": "markdown", "metadata": { "id": "tSztirfuaz-t", "colab_type": "text" }, "source": [ "The rescale intercept of our DICOM files (see metadata) should be -1024 for all files. However, as the histogram below shows, some fraction of files are corrupted: Their rescale intercept is much larger." ] }, { "cell_type": "code", "metadata": { "id": "pAgE8QJI3DNN", "colab_type": "code", "outputId": "e1c38fcb-0e95-47c4-cc2f-ed0b1d2f1a2a", "colab": {} }, "source": [ "train_metadata.RescaleIntercept.hist()\n", "plt.show()" ], "execution_count": 0, "outputs": [ { "output_type": "display_data", "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD8CAYAAABpcuN4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAZx0lEQVR4nO3df5DU933f8ecrXIypXGSQzJZyTMEj7AaJ2CkXRMZtZ5VzATsZo8xI7XnU6pQwQ0Koa3fopBD/wVQaZkRjVQ2TShnGogLFDVASj5jIBJ9Rt53OYH7IkYORTLlYWFygovERReeOSE9+94/9XPiy3vvs3o+9O/Zej5md3X1/v5/Pft4nya/7/tizIgIzM7PR/MR0L8DMzGY2B4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVlWw6CQ9FFJrxYefyXpC5IWSuqTdDE9LyiM2SGpX9IFSesL9dWSzqVteyQp1edKOpTqpyQtK4zpTZ9xUVLv5LZvZmaNaCzfo5A0B/hz4H5gKzAYEU9K2g4siIh/K2kl8PvAGuDvAt8APhIR70k6DXwe+CbwNWBPRByT9OvAT0fEr0nqAX4pIv6ZpIXAWaALCOAVYHVEXJ+c9s3MrJGxnnrqBv4sIr4PbAT2p/p+4MH0eiNwMCJuRMQbQD+wRtJiYH5EnIxqOh2oGTMy1xGgOx1trAf6ImIwhUMfsGHMXZqZ2bh1jHH/HqpHCwCliLgKEBFXJS1K9SVUjxhGDKTa/0uva+sjYy6nuYYlvQ3cVazXGVPX3XffHcuWLRtbV0364Q9/yB133NGSuWea2dQrzK5+3Wv7mki/r7zyyl9ExIfqbWs6KCS9D/gMsKPRrnVqkamPd0xxbZuBzQClUokvfelLDZY4PkNDQ3zgAx9oydwzzWzqFWZXv+61fU2k3wceeOD7o20byxHFp4BvRcRb6f1bkhano4nFwLVUHwCWFsZ1AldSvbNOvThmQFIHcCcwmOrlmjGV2oVFxF5gL0BXV1eUy+XaXSZFpVKhVXPPNLOpV5hd/brX9tWqfsdyjeKz3DztBHAUGLkLqRd4sVDvSXcyLQdWAKfTaap3JK1N1x8erRkzMtdDwMvpOsZxYJ2kBemuqnWpZmZmU6SpIwpJfwv4J8CvFspPAoclbQLeBB4GiIjzkg4DrwHDwNaIeC+N2QI8D8wDjqUHwHPAC5L6qR5J9KS5BiU9AZxJ+z0eEYPj6NPMzMapqaCIiP9L9eJysfYDqndB1dt/F7CrTv0scF+d+rukoKmzbR+wr5l1mpnZ5PM3s83MLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLLG+ic82t6y7S+Num3bqmEey2yfiEtP/kJL5jUzmygfUZiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy2oqKCR9UNIRSd+V9Lqkn5O0UFKfpIvpeUFh/x2S+iVdkLS+UF8t6VzatkeSUn2upEOpfkrSssKY3vQZFyX1Tl7rZmbWjGaPKH4b+OOI+PvAx4DXge3AiYhYAZxI75G0EugB7gU2AM9ImpPmeRbYDKxIjw2pvgm4HhH3AE8Du9NcC4GdwP3AGmBnMZDMzKz1GgaFpPnAPwaeA4iIv46IvwQ2AvvTbvuBB9PrjcDBiLgREW8A/cAaSYuB+RFxMiICOFAzZmSuI0B3OtpYD/RFxGBEXAf6uBkuZmY2BZo5ovgw8H+A/yzpTyR9WdIdQCkirgKk50Vp/yXA5cL4gVRbkl7X1m8ZExHDwNvAXZm5zMxsinQ0uc8/AD4XEack/TbpNNMoVKcWmfp4x9z8QGkz1VNalEolKpVKZnl521YNj7qtNC+/fSImsuZWGBoamnFraqXZ1K97bV+t6reZoBgABiLiVHp/hGpQvCVpcURcTaeVrhX2X1oY3wlcSfXOOvXimAFJHcCdwGCql2vGVGoXGBF7gb0AXV1dUS6Xa3dp2mPbXxp127ZVwzx1rpkf2dhdeqTcknnHq1KpMJGf4+1mNvXrXttXq/pteOopIv43cFnSR1OpG3gNOAqM3IXUC7yYXh8FetKdTMupXrQ+nU5PvSNpbbr+8GjNmJG5HgJeTtcxjgPrJC1IF7HXpZqZmU2RZn89/hzwFUnvA74H/DLVkDksaRPwJvAwQEScl3SYapgMA1sj4r00zxbgeWAecCw9oHqh/AVJ/VSPJHrSXIOSngDOpP0ej4jBcfZqZmbj0FRQRMSrQFedTd2j7L8L2FWnfha4r079XVLQ1Nm2D9jXzDrNzGzy+ZvZZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tqKigkXZJ0TtKrks6m2kJJfZIupucFhf13SOqXdEHS+kJ9dZqnX9IeSUr1uZIOpfopScsKY3rTZ1yU1DtZjZuZWXPGckTxQER8PCK60vvtwImIWAGcSO+RtBLoAe4FNgDPSJqTxjwLbAZWpMeGVN8EXI+Ie4Cngd1proXATuB+YA2wsxhIZmbWehM59bQR2J9e7wceLNQPRsSNiHgD6AfWSFoMzI+IkxERwIGaMSNzHQG609HGeqAvIgYj4jrQx81wMTOzKdBsUATwdUmvSNqcaqWIuAqQnhel+hLgcmHsQKotSa9r67eMiYhh4G3grsxcZmY2RTqa3O8TEXFF0iKgT9J3M/uqTi0y9fGOufmB1fDaDFAqlahUKpnl5W1bNTzqttK8/PaJmMiaW2FoaGjGramVZlO/7rV9tarfpoIiIq6k52uSvkr1esFbkhZHxNV0Wula2n0AWFoY3glcSfXOOvXimAFJHcCdwGCql2vGVOqsby+wF6CrqyvK5XLtLk17bPtLo27btmqYp841m61jc+mRckvmHa9KpcJEfo63m9nUr3ttX63qt+GpJ0l3SPrbI6+BdcB3gKPAyF1IvcCL6fVRoCfdybSc6kXr0+n01DuS1qbrD4/WjBmZ6yHg5XQd4ziwTtKCdBF7XaqZmdkUaebX4xLw1XQnawfwXyLijyWdAQ5L2gS8CTwMEBHnJR0GXgOGga0R8V6aawvwPDAPOJYeAM8BL0jqp3ok0ZPmGpT0BHAm7fd4RAxOoF8zMxujhkEREd8DPlan/gOge5Qxu4Bddepngfvq1N8lBU2dbfuAfY3WaWZmreFvZpuZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsq+mgkDRH0p9I+qP0fqGkPkkX0/OCwr47JPVLuiBpfaG+WtK5tG2PJKX6XEmHUv2UpGWFMb3pMy5K6p2Mps3MrHljOaL4PPB64f124ERErABOpPdIWgn0APcCG4BnJM1JY54FNgMr0mNDqm8CrkfEPcDTwO4010JgJ3A/sAbYWQwkMzNrvaaCQlIn8AvAlwvljcD+9Ho/8GChfjAibkTEG0A/sEbSYmB+RJyMiAAO1IwZmesI0J2ONtYDfRExGBHXgT5uhouZmU2BZo8o/iPwG8CPCrVSRFwFSM+LUn0JcLmw30CqLUmva+u3jImIYeBt4K7MXGZmNkU6Gu0g6ReBaxHxiqRyE3OqTi0y9fGOKa5xM9VTWpRKJSqVShPLrG/bquFRt5Xm5bdPxETW3ApDQ0Mzbk2tNJv6da/tq1X9NgwK4BPAZyR9Gng/MF/S7wFvSVocEVfTaaVraf8BYGlhfCdwJdU769SLYwYkdQB3AoOpXq4ZU6ldYETsBfYCdHV1Rblcrt2laY9tf2nUbdtWDfPUuWZ+ZGN36ZFyS+Ydr0qlwkR+jreb2dSve21freq34amniNgREZ0RsYzqReqXI+KfA0eBkbuQeoEX0+ujQE+6k2k51YvWp9PpqXckrU3XHx6tGTMy10PpMwI4DqyTtCBdxF6XamZmNkUm8uvxk8BhSZuAN4GHASLivKTDwGvAMLA1It5LY7YAzwPzgGPpAfAc8IKkfqpHEj1prkFJTwBn0n6PR8TgBNZsZmZjNKagiIgK6dRPRPwA6B5lv13Arjr1s8B9dervkoKmzrZ9wL6xrNPMzCaPv5ltZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzs6yGQSHp/ZJOS/q2pPOS/l2qL5TUJ+liel5QGLNDUr+kC5LWF+qrJZ1L2/ZIUqrPlXQo1U9JWlYY05s+46Kk3sls3szMGmvmiOIG8PMR8THg48AGSWuB7cCJiFgBnEjvkbQS6AHuBTYAz0iak+Z6FtgMrEiPDam+CbgeEfcATwO701wLgZ3A/cAaYGcxkMzMrPUaBkVUDaW3P5keAWwE9qf6fuDB9HojcDAibkTEG0A/sEbSYmB+RJyMiAAO1IwZmesI0J2ONtYDfRExGBHXgT5uhouZmU2Bpq5RSJoj6VXgGtX/4T4FlCLiKkB6XpR2XwJcLgwfSLUl6XVt/ZYxETEMvA3clZnLzMymSEczO0XEe8DHJX0Q+Kqk+zK7q94Umfp4x9z8QGkz1VNalEolKpVKZnl521YNj7qtNC+/fSImsuZWGBoamnFraqXZ1K97bV+t6repoBgREX8pqUL19M9bkhZHxNV0Wula2m0AWFoY1glcSfXOOvXimAFJHcCdwGCql2vGVOqsay+wF6CrqyvK5XLtLk17bPtLo27btmqYp86N6UfWtEuPlFsy73hVKhUm8nO83cymft1r+2pVv83c9fShdCSBpHnAJ4HvAkeBkbuQeoEX0+ujQE+6k2k51YvWp9PpqXckrU3XHx6tGTMy10PAy+k6xnFgnaQF6SL2ulQzM7Mp0syvx4uB/enOpZ8ADkfEH0k6CRyWtAl4E3gYICLOSzoMvAYMA1vTqSuALcDzwDzgWHoAPAe8IKmf6pFET5prUNITwJm03+MRMTiRhs3MbGwaBkVE/CnwM3XqPwC6RxmzC9hVp34W+LHrGxHxLilo6mzbB+xrtE4zM2sNfzPbzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkNg0LSUkn/TdLrks5L+nyqL5TUJ+liel5QGLNDUr+kC5LWF+qrJZ1L2/ZIUqrPlXQo1U9JWlYY05s+46Kk3sls3szMGmvmiGIY2BYRPwWsBbZKWglsB05ExArgRHpP2tYD3AtsAJ6RNCfN9SywGViRHhtSfRNwPSLuAZ4Gdqe5FgI7gfuBNcDOYiCZmVnrNQyKiLgaEd9Kr98BXgeWABuB/Wm3/cCD6fVG4GBE3IiIN4B+YI2kxcD8iDgZEQEcqBkzMtcRoDsdbawH+iJiMCKuA33cDBczM5sCY7pGkU4J/QxwCihFxFWohgmwKO22BLhcGDaQakvS69r6LWMiYhh4G7grM5eZmU2RjmZ3lPQB4A+AL0TEX6XLC3V3rVOLTH28Y4pr20z1lBalUolKpTLa2hratmp41G2lefntEzGRNbfC0NDQjFtTK82mft1r+2pVv00FhaSfpBoSX4mIP0zltyQtjoir6bTStVQfAJYWhncCV1K9s069OGZAUgdwJzCY6uWaMZXa9UXEXmAvQFdXV5TL5dpdmvbY9pdG3bZt1TBPnWs6W8fk0iPllsw7XpVKhYn8HG83s6lf99q+WtVvM3c9CXgOeD0i/kNh01Fg5C6kXuDFQr0n3cm0nOpF69Pp9NQ7ktamOR+tGTMy10PAy+k6xnFgnaQF6SL2ulQzM7Mp0syvx58A/gVwTtKrqfabwJPAYUmbgDeBhwEi4rykw8BrVO+Y2hoR76VxW4DngXnAsfSAahC9IKmf6pFET5prUNITwJm03+MRMTjOXs3MbBwaBkVE/E/qXysA6B5lzC5gV536WeC+OvV3SUFTZ9s+YF+jdZqZWWv4m9lmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy2oYFJL2Sbom6TuF2kJJfZIupucFhW07JPVLuiBpfaG+WtK5tG2PJKX6XEmHUv2UpGWFMb3pMy5K6p2sps3MrHnNHFE8D2yoqW0HTkTECuBEeo+klUAPcG8a84ykOWnMs8BmYEV6jMy5CbgeEfcATwO701wLgZ3A/cAaYGcxkMzMbGo0DIqI+B/AYE15I7A/vd4PPFioH4yIGxHxBtAPrJG0GJgfEScjIoADNWNG5joCdKejjfVAX0QMRsR1oI8fDywzM2ux8V6jKEXEVYD0vCjVlwCXC/sNpNqS9Lq2fsuYiBgG3gbuysxlZmZTqGOS51OdWmTq4x1z64dKm6me1qJUKlGpVBoudDTbVg2Puq00L799Iiay5lYYGhqacWtqpdnUr3ttX63qd7xB8ZakxRFxNZ1WupbqA8DSwn6dwJVU76xTL44ZkNQB3En1VNcAUK4ZU6m3mIjYC+wF6OrqinK5XG+3pjy2/aVRt21bNcxT5yY7W6suPVJuybzjValUmMjP8XYzm/p1r+2rVf2O99TTUWDkLqRe4MVCvSfdybSc6kXr0+n01DuS1qbrD4/WjBmZ6yHg5XQd4ziwTtKCdBF7XaqZmdkUavjrsaTfp/qb/d2SBqjeifQkcFjSJuBN4GGAiDgv6TDwGjAMbI2I99JUW6jeQTUPOJYeAM8BL0jqp3ok0ZPmGpT0BHAm7fd4RNReVDczsxZrGBQR8dlRNnWPsv8uYFed+lngvjr1d0lBU2fbPmBfozWamVnr+JvZZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWQ4KMzPLclCYmVmWg8LMzLIcFGZmluWgMDOzLAeFmZllOSjMzCzLQWFmZlkOCjMzy3JQmJlZloPCzMyyHBRmZpbloDAzsywHhZmZZTkozMwsy0FhZmZZDgozM8u6LYJC0gZJFyT1S9o+3esxM5tNZnxQSJoD/CfgU8BK4LOSVk7vqszMZo+O6V5AE9YA/RHxPQBJB4GNwGvTuiozs1Es2/7StHzu8xvuaMm8M/6IAlgCXC68H0g1MzObArfDEYXq1OKWHaTNwOb0dkjShVYs5F/B3cBftGJu7W7FrBPSsl5nqNnUr3ttUw/snlC/f2+0DbdDUAwASwvvO4ErxR0iYi+wt9ULkXQ2Irpa/TkzwWzqFWZXv+61fbWq39vh1NMZYIWk5ZLeB/QAR6d5TWZms8aMP6KIiGFJ/xI4DswB9kXE+WlelpnZrDHjgwIgIr4GfG2618EUnN6aQWZTrzC7+nWv7asl/SoiGu9lZmaz1u1wjcLMzKaRgyKR9LCk85J+JKmrZtuO9OdDLkhaX6ivlnQubdsjSak+V9KhVD8ladnUdjM2kj4u6ZuSXpV0VtKawrYx9X47kPS51M95Sf++UG+7XgEk/RtJIenuQq3tepX0W5K+K+lPJX1V0gcL29qu36KW/5mjiPCjevrtp4CPAhWgq1BfCXwbmAssB/4MmJO2nQZ+jup3PY4Bn0r1Xwd+N73uAQ5Nd38Nev96Ye2fBirj7X2mP4AHgG8Ac9P7Re3aa1r7Uqo3gnwfuLvNe10HdKTXu4Hd7dxvoe85qacPA+9Lva6czM/wEUUSEa9HRL0v6m0EDkbEjYh4A+gH1khaDMyPiJNR/ad1AHiwMGZ/en0E6J7hv6kEMD+9vpOb31MZT+8z3RbgyYi4ARAR11K9HXsFeBr4DW79kmpb9hoRX4+I4fT2m1S/cwVt2m/B3/yZo4j4a2DkzxxNGgdFY6P9CZEl6XVt/ZYx6V/ct4G7Wr7S8fsC8FuSLgNfAnak+nh6n+k+AvyjdErwv0v62VRvu14lfQb484j4ds2mtuu1jl+heoQA7d9vy//M0W1xe+xkkfQN4O/U2fTFiHhxtGF1apGp58ZMm1zvQDfwryPiDyT9U+A54JOMr/dp16DXDmABsBb4WeCwpA/Tnr3+JtXTMT82rE5txvcKzf03LOmLwDDwlZFhdfa/LfptUsv7mFVBERGfHMew0f6EyAA3D22L9eKYAUkdVE/nDI7jsydNrndJB4DPp7f/Ffhyej2e3qddg163AH+YTjWclvQjqn8PqK16lbSK6vn4b6eznp3At9KNCrdlr9D4v2FJvcAvAt3pnzHcxv02qeGfOZqw6b4QM9Me/PjF7Hu59ULY97h5IewM1d9MRy6EfTrVt3LrxezD091Xg55fB8rpdTfwynh7n+kP4NeAx9Prj1A9ZFc79lrT9yVuXsxuy16BDVT/7wc+VFNvy34L/XWknpZz82L2vZP6GdPd5Ex5AL9ENZlvAG8Bxwvbvkj1roILFO6KALqA76Rtv8PNLzC+n+pv5v1U76r48HT316D3fwi8kv4FOwWsHm/vM/2R/kP6vbT2bwE/36691vT9N0HRrr2m/94uA6+mx++2c781vX8a+F+pjy9O9vz+ZraZmWX5riczM8tyUJiZWZaDwszMshwUZmaW5aAwM7MsB4WZmWU5KMzMLMtBYWZmWf8fGulILBXr8WcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "km5BztPZ3DNR", "colab_type": "code", "outputId": "196e276a-5c36-43c1-f88f-9912cee61fd5", "colab": {} }, "source": [ "# We identify the corrupted files in the next three cells\n", "train_metadata.query(\"RescaleIntercept!=-1024\").groupby(\"PatientID\").count().sort_values(\"SOPInstanceUID\")" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SOPInstanceUIDModalityStudyInstanceUIDSeriesInstanceUIDStudyIDImagePositionPatientImageOrientationPatientSamplesPerPixelPhotometricInterpretationRowsColumnsPixelSpacingBitsAllocatedBitsStoredHighBitPixelRepresentationWindowCenterWindowWidthRescaleInterceptRescaleSlopefnameMultiImagePositionPatientImagePositionPatient1ImagePositionPatient2MultiImageOrientationPatientImageOrientationPatient1ImageOrientationPatient2ImageOrientationPatient3ImageOrientationPatient4ImageOrientationPatient5MultiPixelSpacingPixelSpacing1img_minimg_maximg_meanimg_stdimg_pct_windowMultiWindowCenterWindowCenter1MultiWindowWidthWindowWidth1
PatientID
ID_b956c8dd2020202020202020202020202020202020202020202020202020202020202020202020202020202020
ID_11e103d42020202020202020202020202020202020202020202020202020202020202020202020202020202020
ID_03ac0e282222222222222222222222222222222222222222222222222222222222222222222222222222222222
ID_57a06f552323232323232323232323232323232323232323232323232323232323232323232323232323232323
ID_cd2e1b47232323232323232323232323232323232323232323232323232323232323232323232323230000
..............................................................................................................................
ID_a579ac674242424242424242424242424242424242424242424242424242424242424242424242424242424242
ID_2b35cfb85252525252525252525252525252525252525252525252525252525252525252525252525252525252
ID_00526c115252525252525252525252525252525252525252525252525252525252525252525252525252525252
ID_aa91c4545656565656565656565656565656565656565656565656565656565656565656565656565656565656
ID_b4f7750e6464646464646464646464646464646464646464646464646464646464646464646464646464646464
\n", "

407 rows × 41 columns

\n", "
" ], "text/plain": [ " SOPInstanceUID Modality StudyInstanceUID SeriesInstanceUID \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 \n", "ID_cd2e1b47 23 23 23 23 \n", "... .. .. .. .. \n", "ID_a579ac67 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 \n", "ID_00526c11 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 \n", "\n", " StudyID ImagePositionPatient ImageOrientationPatient \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 \n", "ID_11e103d4 20 20 20 \n", "ID_03ac0e28 22 22 22 \n", "ID_57a06f55 23 23 23 \n", "ID_cd2e1b47 23 23 23 \n", "... .. .. .. \n", "ID_a579ac67 42 42 42 \n", "ID_2b35cfb8 52 52 52 \n", "ID_00526c11 52 52 52 \n", "ID_aa91c454 56 56 56 \n", "ID_b4f7750e 64 64 64 \n", "\n", " SamplesPerPixel PhotometricInterpretation Rows Columns \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 \n", "ID_cd2e1b47 23 23 23 23 \n", "... .. .. .. .. \n", "ID_a579ac67 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 \n", "ID_00526c11 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 \n", "\n", " PixelSpacing BitsAllocated BitsStored HighBit \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 \n", "ID_cd2e1b47 23 23 23 23 \n", "... .. .. .. .. \n", "ID_a579ac67 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 \n", "ID_00526c11 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 \n", "\n", " PixelRepresentation WindowCenter WindowWidth RescaleIntercept \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 \n", "ID_cd2e1b47 23 23 23 23 \n", "... .. .. .. .. \n", "ID_a579ac67 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 \n", "ID_00526c11 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 \n", "\n", " RescaleSlope fname MultiImagePositionPatient \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 \n", "ID_11e103d4 20 20 20 \n", "ID_03ac0e28 22 22 22 \n", "ID_57a06f55 23 23 23 \n", "ID_cd2e1b47 23 23 23 \n", "... .. .. .. \n", "ID_a579ac67 42 42 42 \n", "ID_2b35cfb8 52 52 52 \n", "ID_00526c11 52 52 52 \n", "ID_aa91c454 56 56 56 \n", "ID_b4f7750e 64 64 64 \n", "\n", " ImagePositionPatient1 ImagePositionPatient2 \\\n", "PatientID \n", "ID_b956c8dd 20 20 \n", "ID_11e103d4 20 20 \n", "ID_03ac0e28 22 22 \n", "ID_57a06f55 23 23 \n", "ID_cd2e1b47 23 23 \n", "... .. .. \n", "ID_a579ac67 42 42 \n", "ID_2b35cfb8 52 52 \n", "ID_00526c11 52 52 \n", "ID_aa91c454 56 56 \n", "ID_b4f7750e 64 64 \n", "\n", " MultiImageOrientationPatient ImageOrientationPatient1 \\\n", "PatientID \n", "ID_b956c8dd 20 20 \n", "ID_11e103d4 20 20 \n", "ID_03ac0e28 22 22 \n", "ID_57a06f55 23 23 \n", "ID_cd2e1b47 23 23 \n", "... .. .. \n", "ID_a579ac67 42 42 \n", "ID_2b35cfb8 52 52 \n", "ID_00526c11 52 52 \n", "ID_aa91c454 56 56 \n", "ID_b4f7750e 64 64 \n", "\n", " ImageOrientationPatient2 ImageOrientationPatient3 \\\n", "PatientID \n", "ID_b956c8dd 20 20 \n", "ID_11e103d4 20 20 \n", "ID_03ac0e28 22 22 \n", "ID_57a06f55 23 23 \n", "ID_cd2e1b47 23 23 \n", "... .. .. \n", "ID_a579ac67 42 42 \n", "ID_2b35cfb8 52 52 \n", "ID_00526c11 52 52 \n", "ID_aa91c454 56 56 \n", "ID_b4f7750e 64 64 \n", "\n", " ImageOrientationPatient4 ImageOrientationPatient5 \\\n", "PatientID \n", "ID_b956c8dd 20 20 \n", "ID_11e103d4 20 20 \n", "ID_03ac0e28 22 22 \n", "ID_57a06f55 23 23 \n", "ID_cd2e1b47 23 23 \n", "... .. .. \n", "ID_a579ac67 42 42 \n", "ID_2b35cfb8 52 52 \n", "ID_00526c11 52 52 \n", "ID_aa91c454 56 56 \n", "ID_b4f7750e 64 64 \n", "\n", " MultiPixelSpacing PixelSpacing1 img_min img_max img_mean \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 23 \n", "ID_cd2e1b47 23 23 23 23 23 \n", "... .. .. .. .. .. \n", "ID_a579ac67 42 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 52 \n", "ID_00526c11 52 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 64 \n", "\n", " img_std img_pct_window MultiWindowCenter WindowCenter1 \\\n", "PatientID \n", "ID_b956c8dd 20 20 20 20 \n", "ID_11e103d4 20 20 20 20 \n", "ID_03ac0e28 22 22 22 22 \n", "ID_57a06f55 23 23 23 23 \n", "ID_cd2e1b47 23 23 0 0 \n", "... .. .. .. .. \n", "ID_a579ac67 42 42 42 42 \n", "ID_2b35cfb8 52 52 52 52 \n", "ID_00526c11 52 52 52 52 \n", "ID_aa91c454 56 56 56 56 \n", "ID_b4f7750e 64 64 64 64 \n", "\n", " MultiWindowWidth WindowWidth1 \n", "PatientID \n", "ID_b956c8dd 20 20 \n", "ID_11e103d4 20 20 \n", "ID_03ac0e28 22 22 \n", "ID_57a06f55 23 23 \n", "ID_cd2e1b47 0 0 \n", "... .. .. \n", "ID_a579ac67 42 42 \n", "ID_2b35cfb8 52 52 \n", "ID_00526c11 52 52 \n", "ID_aa91c454 56 56 \n", "ID_b4f7750e 64 64 \n", "\n", "[407 rows x 41 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 106 } ] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "A4FcWqQk3DNb", "colab_type": "code", "colab": {} }, "source": [ "png_IDs_to_remove = sorted(train_metadata.query(\"RescaleIntercept!=-1024\").SOPInstanceUID.tolist())" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "tT9LMoLl3DNf", "colab_type": "code", "outputId": "8a4deaf2-005f-4821-d992-98eaf00f6a18", "colab": {} }, "source": [ "png_IDs_to_remove[0:10]" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['ID_0007ff5d1',\n", " 'ID_000aa2bce',\n", " 'ID_000bd8380',\n", " 'ID_0012b1611',\n", " 'ID_0015e926e',\n", " 'ID_001bdd8fb',\n", " 'ID_001d4ce1c',\n", " 'ID_0023e98ab',\n", " 'ID_0024b1888',\n", " 'ID_00382ea5e']" ] }, "metadata": { "tags": [] }, "execution_count": 108 } ] }, { "cell_type": "markdown", "metadata": { "id": "3O7qOYMr3DNv", "colab_type": "text" }, "source": [ "\n", "Now, we remove the corrupted files and update the metadata & labels to no longer contain references to said files." ] }, { "cell_type": "code", "metadata": { "id": "P7KMYBqb3DNw", "colab_type": "code", "colab": {} }, "source": [ "# Removing corrupted files\n", "folder_path = \"C:/Users/Administrator/Downloads/rsna_stage1_png_128/stage_1_train_images\"\n", "for ID in tqdm.tqdm(png_IDs_to_remove):\n", " os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "PM0qyR8_3DOA", "colab_type": "code", "outputId": "e75bd07a-2354-4208-c353-284f6212cdae", "colab": {} }, "source": [ "# Verify that corrupted files have been removed\n", "os.path.isfile(\"./rsna_stage1_png_128/stage_1_train_images/ID_0007ff5d1.png\")" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "False" ] }, "metadata": { "tags": [] }, "execution_count": 112 } ] }, { "cell_type": "code", "metadata": { "id": "s8Vp_bom3DOE", "colab_type": "code", "colab": {} }, "source": [ "# Update labels and metadata\n", "labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n", "train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hIWgnYiA3DOO", "colab_type": "text" }, "source": [ "### **Step 2: Remove Images w/ Low Brain Percentage**" ] }, { "cell_type": "markdown", "metadata": { "id": "WwumGFykeJcG", "colab_type": "text" }, "source": [ "Some images contain virtually no brain matter (that is, they are slices of the patient's skull either above or below their brain). Here, we identify and remove these files." ] }, { "cell_type": "code", "metadata": { "id": "fcmgGDXQ3DOP", "colab_type": "code", "outputId": "d2f970c7-c46f-46a9-d2df-dffeec85b02f", "colab": {} }, "source": [ "# This histogram shows the percentage of brain matter present in all of our files \n", "train_metadata.img_pct_window.hist(bins=100)\n", "plt.show()" ], "execution_count": 0, "outputs": [ { "output_type": "display_data", "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD4CAYAAADsKpHdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAXE0lEQVR4nO3df6zd9X3f8edrpqFOMhII5day2ew2VjLAiRZumdtO051Yh5dUMdVAc0SDaZmsMpqlm6PGrFIzabJEtGVp6AaVFTJMF4V4NB1WKV0Q2VU0lR9x0iQOEBq3MLjgxk2TEm7aUC5774/z8c3x9bF97j331zl+PqSj+/2+P9/PuZ+Pzzl+n8/n8/1+b6oKSZL+1ko3QJK0OpgQJEmACUGS1JgQJEmACUGS1Jyz0g1YqAsvvLA2btw4u/+9732P173udSvXoGVgH0eDfRwNw9rHL37xi9+qqh/pVTa0CWHjxo0cOnRodn9ycpKJiYmVa9AysI+jwT6OhmHtY5L/e6oyp4wkSYAJQZLUmBAkSYAJQZLUmBAkSYAJQZLUmBAkSYAJQZLUmBAkScAQX6msxbdxz/2z28/c+q4VbImklWBCOMt1J4FTxU0O0tnBKSNJEmBCkCQ1ThmdhU41TdTP8U4fSaPLEYIkCegjIST5RJJjSb7Wo+wDSSrJhV2xW5IcSfJUkqu64pcnOdzKbkuSFj83yadb/NEkGxena5Kk+ehnhHAXsG1uMMnFwM8Az3bFLgF2AJe2OrcnWdOK7wB2AZvb4/hz3gh8p6reDHwU+PBCOiJJGswZ1xCq6vOn+Nb+UeBXgfu6YtuBe6rqZeDpJEeAK5I8A5xXVQ8DJLkbuBp4oNX5963+vcB/SZKqqoV0SFqoU62tuG6is8WCFpWTvBt4vqq+0mZ+jlsPPNK1P9Vir7TtufHjdZ4DqKqZJC8CbwK+tZC2aWm5wCyNrnknhCSvBX4N+Ke9invE6jTx09Xp9bt30Zl2YmxsjMnJydmy6enpE/ZH0SB9PPz8i7Pbu7csTnuW4t97JV/H3VtmesYXuz2+V0fDKPZxISOEHwc2AcdHBxuALyW5gs43/4u7jt0AvNDiG3rE6aozleQc4A3At3v94qraB+wDGB8fr+4/cD2sf/B6Pgbp4w3zPNW0H89cN7Hoz7ncr+OJ00S9Pw6L3U/fq6NhFPs479NOq+pwVV1UVRuraiOd/9DfUVV/BhwEdrQzhzbRWTx+rKqOAi8l2drOLrqeH6w9HAR2tu1rgM+5fiBJy6+f004/BTwMvCXJVJIbT3VsVT0OHACeAP4AuLmqXm3FNwEfB44Af0JnQRngTuBNbQH63wJ7FtgXSdIA+jnL6D1nKN84Z38vsLfHcYeAy3rEvw9ce6Z2aP7me0WypLObVypLkgATgiSpMSFIkgDvdqqzkHd7lXpzhCBJAkwIkqTGKSMt2Nypl7NhOsXpI40yRwiSJMCEIElqnDLSolnN0yletS2dmQlhxPgfn6SFcspIkgSYECRJjQlBkgSYECRJjYvKWhKr+YwjSb05QpAkAY4QpAVzFKRR4whBkgQ4QtAI8yI9aX7OOEJI8okkx5J8rSv2H5N8PclXk/xukjd2ld2S5EiSp5Jc1RW/PMnhVnZbkrT4uUk+3eKPJtm4uF3UStu45/7Zh6TVq58po7uAbXNiDwKXVdXbgD8GbgFIcgmwA7i01bk9yZpW5w5gF7C5PY4/543Ad6rqzcBHgQ8vtDOSpIU7Y0Koqs8D354T+2xVzbTdR4ANbXs7cE9VvVxVTwNHgCuSrAPOq6qHq6qAu4Gru+rsb9v3AlceHz1IkpbPYqwh/CLw6ba9nk6COG6qxV5p23Pjx+s8B1BVM0leBN4EfGvuL0qyi84og7GxMSYnJ2fLpqenT9gfRf30cfeWmdOWr7Tf/OR9s9tb1r/hpPLFfB2X899iPm32vToaRrGPAyWEJL8GzACfPB7qcVidJn66OicHq/YB+wDGx8drYmJitmxycpLu/VHUTx9vGKJ5+meumzgptpiv43L+W/Tqy6n4Xh0No9jHBSeEJDuBnwWubNNA0Pnmf3HXYRuAF1p8Q494d52pJOcAb2DOFJVOz8VaSYthQdchJNkGfBB4d1X9VVfRQWBHO3NoE53F48eq6ijwUpKtbX3geuC+rjo72/Y1wOe6EowkaZmccYSQ5FPABHBhkingQ3TOKjoXeLCt/z5SVb9UVY8nOQA8QWcq6eaqerU91U10zlhaCzzQHgB3Ar+d5AidkcGOxematHy8almj4IwJoare0yN852mO3wvs7RE/BFzWI/594NoztUOStLS8UllDzzUUaXF4LyNJEuAIQSvIeXdpdXGEIEkCHCFolTg+Wti9ZYaJlW2KdNZyhCBJAhwhaEh5ZpG0+BwhSJIAE4IkqTEhSJIA1xC0Cp1qfcBrFaSlZUIYUmfjouqw9HluO01kGhZOGUmSABOCJKkxIUiSABOCJKkxIUiSABOCJKkxIUiSgD4SQpJPJDmW5GtdsQuSPJjkG+3n+V1ltyQ5kuSpJFd1xS9PcriV3ZYkLX5ukk+3+KNJNi5uFyVJ/ehnhHAXsG1ObA/wUFVtBh5q+yS5BNgBXNrq3J5kTatzB7AL2Nwex5/zRuA7VfVm4KPAhxfaGUnSwp0xIVTV54FvzwlvB/a37f3A1V3xe6rq5ap6GjgCXJFkHXBeVT1cVQXcPafO8ee6F7jy+OhBkrR8FnrrirGqOgpQVUeTXNTi64FHuo6barFX2vbc+PE6z7XnmknyIvAm4Ftzf2mSXXRGGYyNjTE5OTlbNj09fcL+KOru4+4tMyvbmCUytnb0+jb3fXm2vVdH1Sj2cbHvZdTrm32dJn66OicHq/YB+wDGx8drYmJitmxycpLu/VHU3ccbhuS+PvO1e8sMHzk8WrfYeua6iRP2z7b36qgaxT4u9JP3zSTr2uhgHXCsxaeAi7uO2wC80OIbesS760wlOQd4AydPUUlDq/tmd97oTqvZQk87PQjsbNs7gfu64jvamUOb6CweP9aml15KsrWtD1w/p87x57oG+FxbZ5AkLaMzjhCSfAqYAC5MMgV8CLgVOJDkRuBZ4FqAqno8yQHgCWAGuLmqXm1PdROdM5bWAg+0B8CdwG8nOUJnZLBjUXomSZqXMyaEqnrPKYquPMXxe4G9PeKHgMt6xL9PSyiSpJXjlcqSJMCEIElqTAiSJMC/qTxUDj//4shefyBp5TlCkCQBJgRJUmNCkCQBJgRpWW3ccz+Hn3/xhNtZSKuFCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNCUGSBJgQJEmNt7+WVkj37SueufVdK9gSqWOgEUKSf5Pk8SRfS/KpJD+c5IIkDyb5Rvt5ftfxtyQ5kuSpJFd1xS9PcriV3ZYkg7RLkjR/C04ISdYD/xoYr6rLgDXADmAP8FBVbQYeavskuaSVXwpsA25PsqY93R3ALmBze2xbaLtGzcY9988+JGkpDbqGcA6wNsk5wGuBF4DtwP5Wvh+4um1vB+6pqper6mngCHBFknXAeVX1cFUVcHdXHUnSMlnwGkJVPZ/kPwHPAn8NfLaqPptkrKqOtmOOJrmoVVkPPNL1FFMt9krbnhs/SZJddEYSjI2NMTk5OVs2PT19wv6o2L1lZnZ7bO2J+6PobO3jqL13R/Xz2G0U+7jghNDWBrYDm4C/BP5Hkp8/XZUesTpN/ORg1T5gH8D4+HhNTEzMlk1OTtK9Pyq6/4by7i0zfOTwaJ8HcLb28ZnrJlamMUtkVD+P3Uaxj4NMGf0T4Omq+vOqegX4DPBTwDfbNBDt57F2/BRwcVf9DXSmmKba9ty4JGkZDZIQngW2JnltOyvoSuBJ4CCwsx2zE7ivbR8EdiQ5N8kmOovHj7XppZeSbG3Pc31XHUnSMhlkDeHRJPcCXwJmgD+iM53zeuBAkhvpJI1r2/GPJzkAPNGOv7mqXm1PdxNwF7AWeKA9JEnLaKDJ2qr6EPChOeGX6YwWeh2/F9jbI34IuGyQtkiSBuOtKyRJgAlBktSYECRJgDe3k1YFb3Sn1cARgiQJMCFIkhoTgiQJMCFIkhoTgiQJ8CyjVck/hiNpJThCkCQBJgRJUmNCkCQBJgRJUuOisrTKeBsLrRRHCJIkwIQgSWpMCJIkwIQgSWoGSghJ3pjk3iRfT/Jkkp9MckGSB5N8o/08v+v4W5IcSfJUkqu64pcnOdzKbkuSQdolSZq/QUcIHwP+oKreCrwdeBLYAzxUVZuBh9o+SS4BdgCXAtuA25Osac9zB7AL2Nwe2wZslyRpnhacEJKcB/wj4E6AqvqbqvpLYDuwvx22H7i6bW8H7qmql6vqaeAIcEWSdcB5VfVwVRVwd1cdSdIyGeQ6hB8D/hz4b0neDnwReD8wVlVHAarqaJKL2vHrgUe66k+12Ctte278JEl20RlJMDY2xuTk5GzZ9PT0CfvD5vDzL85u797S+5ixtbB7y8wytWhl2McTDet7etg/j/0YxT4OkhDOAd4BvK+qHk3yMdr00Cn0Wheo08RPDlbtA/YBjI+P18TExGzZ5OQk3fvD5oY+7nC6e8sMHzk82tcS2scTPXPdxNI2ZokM++exH6PYx0HWEKaAqap6tO3fSydBfLNNA9F+Hus6/uKu+huAF1p8Q4+4JGkZLTghVNWfAc8leUsLXQk8ARwEdrbYTuC+tn0Q2JHk3CSb6CweP9aml15KsrWdXXR9Vx1J0jIZdGz+PuCTSV4D/CnwC3SSzIEkNwLPAtcCVNXjSQ7QSRozwM1V9Wp7npuAu4C1wAPtIUlaRgMlhKr6MjDeo+jKUxy/F9jbI34IuGyQtkijyBvdaTl5pbIkCTAhSJIaE4IkCTAhSJIaE4IkCTAhSJIaE4IkCTAhSJKa0b6L2Cq3sY8b2knScnGEIEkCTAiSpMaEIEkCXEOQhoY3utNSc4QgSQJMCJKkxoQgSQJMCJKkxkXlZebFaJJWK0cIkiRgERJCkjVJ/ijJ77X9C5I8mOQb7ef5XcfekuRIkqeSXNUVvzzJ4VZ2W5IM2i5J0vwsxgjh/cCTXft7gIeqajPwUNsnySXADuBSYBtwe5I1rc4dwC5gc3tsW4R2SZLmYaCEkGQD8C7g413h7cD+tr0fuLorfk9VvVxVTwNHgCuSrAPOq6qHq6qAu7vqSJKWyaCLyr8B/Crwt7tiY1V1FKCqjia5qMXXA490HTfVYq+07bnxkyTZRWckwdjYGJOTk7Nl09PTJ+yvVru3zCy47tjaweoPA/vYn9X+Xh+Wz+MgRrGPC04ISX4WOFZVX0wy0U+VHrE6TfzkYNU+YB/A+Ph4TUz84NdOTk7Svb9a3TDAWUa7t8zwkcOjfWKYfezT4e/Nbq7G21gMy+dxEKPYx0HelT8NvDvJO4EfBs5L8t+BbyZZ10YH64Bj7fgp4OKu+huAF1p8Q4/4klnue8J4qqmkYbDgNYSquqWqNlTVRjqLxZ+rqp8HDgI722E7gfva9kFgR5Jzk2yis3j8WJteeinJ1nZ20fVddSRJy2Qpxua3AgeS3Ag8C1wLUFWPJzkAPAHMADdX1autzk3AXcBa4IH2kCQto0VJCFU1CUy27b8ArjzFcXuBvT3ih4DLFqMtkqSFGe3Vuz54j3lJ6vDWFZIkwBHCCQYdLXg2kaRhZkKQhpzTnlosJoRT8Nu+pLONawiSJMCEIElqTAiSJMCEIElqTAiSJMCEIElqTAiSJMDrEKSR4kVqGoQjBEkSYEKQJDUmBEkSYEKQJDUuKksjygVmzZcjBEkSMEBCSHJxkv+d5Mkkjyd5f4tfkOTBJN9oP8/vqnNLkiNJnkpyVVf88iSHW9ltSTJYtyRJ8zXICGEG2F1Vfw/YCtyc5BJgD/BQVW0GHmr7tLIdwKXANuD2JGvac90B7AI2t8e2AdolSVqABSeEqjpaVV9q2y8BTwLrge3A/nbYfuDqtr0duKeqXq6qp4EjwBVJ1gHnVdXDVVXA3V11JEnLZFHWEJJsBP4+8CgwVlVHoZM0gIvaYeuB57qqTbXY+rY9Ny5JWkYDn2WU5PXA7wC/UlXfPc30f6+COk281+/aRWdqibGxMSYnJ2fLpqenT9g/nd1bZvo6brUZWzu8be+XfVwa/X42Fst8Po/DahT7OFBCSPJDdJLBJ6vqMy38zSTrqupomw461uJTwMVd1TcAL7T4hh7xk1TVPmAfwPj4eE1MTMyWTU5O0r1/OjcM6d9L3r1lho8cHu0zhe3jEjn8vdnN5TgFdT6fx2E1in0c5CyjAHcCT1bVf+4qOgjsbNs7gfu64juSnJtkE53F48fatNJLSba257y+q44kaZkM8jXlp4H3AoeTfLnF/h1wK3AgyY3As8C1AFX1eJIDwBN0zlC6uapebfVuAu4C1gIPtIekJeAFazqVBSeEqvo/9J7/B7jyFHX2Ant7xA8Bly20LZKkwXmlsiQJMCFIkprRPp1D0mm5nqBujhAkSYAjBEmNowU5QpAkASYESVLjlJGkkzh9dHZyhCBJAhwhSDoDRwtnD0cIkiTAhCBJapwyktQ3p49GmyMESRLgCEHSAjlaGD2OECRJgCMESYvA0cJocIQgaVFt3HM/h59/8YQkoeHgCEHSknHkMFxMCJKWxdwRgwli9Vk1CSHJNuBjwBrg41V16wo3SdIScvSw+qyKhJBkDfBfgZ8BpoAvJDlYVU+sbMskLYd+1htMGktvVSQE4ArgSFX9KUCSe4DtgAlBEtBf0jgVk0l/UlUr3QaSXANsq6p/2fbfC/yDqvrlOcftAna13bcAT3UVXwh8axmau5Ls42iwj6NhWPv4d6vqR3oVrJYRQnrETspUVbUP2NfzCZJDVTW+2A1bTezjaLCPo2EU+7harkOYAi7u2t8AvLBCbZGks9JqSQhfADYn2ZTkNcAO4OAKt0mSziqrYsqoqmaS/DLwv+icdvqJqnp8nk/TcyppxNjH0WAfR8PI9XFVLCpLklbeapkykiStMBOCJAkYwoSQZFuSp5IcSbKnR3mS3NbKv5rkHSvRzkH00ce3Jnk4yctJPrASbRxUH328rr1+X03yh0nevhLtHEQffdze+vflJIeS/MOVaOcgztTHruN+Ismr7ZqjodLH6ziR5MX2On45ya+vRDsXRVUNzYPOgvOfAD8GvAb4CnDJnGPeCTxA59qGrcCjK93uJejjRcBPAHuBD6x0m5eojz8FnN+2/9mIvo6v5wfreG8Dvr7S7V7sPnYd9zng94FrVrrdS/A6TgC/t9JtXYzHsI0QZm9xUVV/Axy/xUW37cDd1fEI8MYk65a7oQM4Yx+r6lhVfQF4ZSUauAj66eMfVtV32u4jdK5NGSb99HG62v8owOvocTHmKtfP5xHgfcDvAMeWs3GLpN8+joRhSwjrgee69qdabL7HrGbD3v5+zLePN9IZ9Q2TvvqY5OeSfB24H/jFZWrbYjljH5OsB34O+K1lbNdi6ve9+pNJvpLkgSSXLk/TFt+wJYR+bnHR120wVrFhb38/+u5jkn9MJyF8cElbtPj6vR3L71bVW4Grgf+w5K1aXP308TeAD1bVq8vQnqXQTx+/ROf+QG8HfhP4n0veqiUybAmhn1tcDPttMIa9/f3oq49J3gZ8HNheVX+xTG1bLPN6Havq88CPJ7lwqRu2iPrp4zhwT5JngGuA25NcvTzNWxRn7GNVfbeqptv27wM/NGSv46xhSwj93OLiIHB9O9toK/BiVR1d7oYO4Gy4jccZ+5jk7wCfAd5bVX+8Am0cVD99fHOStO130Fm0HKbEd8Y+VtWmqtpYVRuBe4F/VVXD9A26n9fxR7texyvo/L86TK/jrFVx64p+1SlucZHkl1r5b9E5k+GdwBHgr4BfWKn2LkQ/fUzyo8Ah4Dzg/yX5FTpnPnx3xRo+D32+jr8OvInON0qAmRqiO0v22cd/TufLyyvAXwP/omuRedXrs49Drc8+XgPclGSGzuu4Y5hex27eukKSBAzflJEkaYmYECRJgAlBktSYECRJgAlBktSYECRJgAlBktT8fwnxIIz8m5muAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "t0C89Ux_3DPZ", "colab_type": "code", "colab": {} }, "source": [ "# We choose to remove images containing less than 2% brain matter\n", "png_IDs_to_remove = train_metadata.query(\"img_pct_window<0.02\").SOPInstanceUID.tolist()" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "HjiMcCVX3DPl", "colab_type": "code", "colab": {} }, "source": [ "for ID in tqdm.tqdm(png_IDs_to_remove):\n", " os.remove(\"./rsna_stage1_png_128/stage_1_train_images/{}.png\".format(ID))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "boyg1HUM3DPx", "colab_type": "code", "colab": {} }, "source": [ "# Once again, we update labels and metadata to reflect the changes we have undertaken\n", "labels = labels[~labels[\"ID\"].isin(png_IDs_to_remove)]\n", "train_metadata = train_metadata[~train_metadata[\"SOPInstanceUID\"].isin(png_IDs_to_remove)]" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "SoMn02rk3DP1", "colab_type": "text" }, "source": [ "## **Step 3: Create New CSVs**" ] }, { "cell_type": "code", "metadata": { "id": "MVFPLKrO3DP1", "colab_type": "code", "colab": {} }, "source": [ "# Here, we write out the cleaned CSV files for labels and metadata\n", "labels.to_csv(\"labels_cleaned.csv\")\n", "train_metadata.to_csv(\"train_metadata_cleaned.csv\")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0rpJfRme3DQD", "colab_type": "text" }, "source": [ "## **Step 4: Check that Label File & Metadata File Agree**" ] }, { "cell_type": "markdown", "metadata": { "id": "3uHZfip1mFpm", "colab_type": "text" }, "source": [ "We have made substantial udpates to both the label file and to the metadata file. It is now prudent to check that both files still agree with each other to ensure the absence of bugs in the above code. We also conduct a few other sanity checks." ] }, { "cell_type": "code", "metadata": { "id": "PNaF8UrE3DQE", "colab_type": "code", "outputId": "f29604dd-e13e-41d1-8819-d79e921c26d9", "colab": {} }, "source": [ "csv_labels = pd.read_csv(\"./rsna_stage1_png_128/stage_1_train.csv\")\n", "csv_labels = csv_labels.iloc[5::6, :]\n", "csv_labels.head()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDLabel
5ID_63eb1e259_any0
11ID_2669954a7_any0
17ID_52c9913b1_any0
23ID_4e6ff6126_any0
29ID_7858edd88_any0
\n", "
" ], "text/plain": [ " ID Label\n", "5 ID_63eb1e259_any 0 \n", "11 ID_2669954a7_any 0 \n", "17 ID_52c9913b1_any 0 \n", "23 ID_4e6ff6126_any 0 \n", "29 ID_7858edd88_any 0 " ] }, "metadata": { "tags": [] }, "execution_count": 73 } ] }, { "cell_type": "code", "metadata": { "id": "x6KgfO-Z3DQI", "colab_type": "code", "outputId": "8ddd8d0d-4f1d-46af-95ef-14e49f779c6b", "colab": {} }, "source": [ "# Get rid of the \"_any\" part of the IDs & verify that it worked\n", "csv_labels[\"ID\"] = csv_labels[\"ID\"].str.replace(\"_any\", \"\")\n", "csv_labels.head()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDLabel
5ID_63eb1e2590
11ID_2669954a70
17ID_52c9913b10
23ID_4e6ff61260
29ID_7858edd880
\n", "
" ], "text/plain": [ " ID Label\n", "5 ID_63eb1e259 0 \n", "11 ID_2669954a7 0 \n", "17 ID_52c9913b1 0 \n", "23 ID_4e6ff6126 0 \n", "29 ID_7858edd88 0 " ] }, "metadata": { "tags": [] }, "execution_count": 74 } ] }, { "cell_type": "code", "metadata": { "id": "c_PQHFQv3DQU", "colab_type": "code", "colab": {} }, "source": [ "# Verify that \"csv_labels\" and \"labels\" contain the same number of elements (they do)\n", "csv_labels_set = set(csv_labels[\"ID\"].tolist())\n", "labels_set = set(labels[\"ID\"].str.replace(\".png\", \"\").tolist())\n", "\n", "print(len(csv_labels_set))\n", "print(len(labels_set))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "rYF4FM4K3DQY", "colab_type": "code", "outputId": "454a4d95-f875-4872-ee10-918fa7d15b8e", "colab": {} }, "source": [ "# Verify that \"csv_labels\" and \"labels\" contain the same elements\n", "# (Indeed they do since their difference is the empty set)\n", "csv_labels_set.symmetric_difference(labels_set)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "set()" ] }, "metadata": { "tags": [] }, "execution_count": 76 } ] }, { "cell_type": "code", "metadata": { "id": "SDT9V2Fj3DQk", "colab_type": "code", "outputId": "b71a8c73-b206-4a5c-b279-2659bf1818ca", "colab": {} }, "source": [ "# Also verify that \"metadata_labels\" and \"labels\" contain the same elements\n", "metadata_labels_set = set(train_metadata[\"SOPInstanceUID\"].tolist())\n", "metadata_labels_set.symmetric_difference(labels_set)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "set()" ] }, "metadata": { "tags": [] }, "execution_count": 79 } ] }, { "cell_type": "code", "metadata": { "id": "GjiX_onx3DQn", "colab_type": "code", "outputId": "67fab19e-a2b7-41d7-e72f-e0c606112535", "colab": {} }, "source": [ "# Get rid of .png extension\n", "fnames = glob.glob(\"./rsna_stage1_png_128/stage_1_train_images/*\")\n", "fnames = [fname.replace(\"./rsna_stage1_png_128/stage_1_train_images\\\\\", \"\").replace(\".png\", \"\") for fname in fnames]\n", "print(fnames[0:10])\n", "print(len(fnames))" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "['ID_000039fa0', 'ID_00005679d', 'ID_00008ce3c', 'ID_0000950d7', 'ID_0000aee4b', 'ID_0000f1657', 'ID_000178e76', 'ID_00019828f', 'ID_0001dcc25', 'ID_0001de0e8']\n", "641386\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "M-8LSvt50L28", "colab_type": "text" }, "source": [ "We have cleaned up the data by deleting ~30,000 images from it:" ] }, { "cell_type": "code", "metadata": { "id": "1ZoqfyXc3DQ1", "colab_type": "code", "outputId": "1a5a26fa-e566-48aa-f4bc-7573d8c22ba8", "colab": {} }, "source": [ "fnames_set = set(fnames)\n", "len(fnames_set.symmetric_difference(metadata_labels_set))" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "32872" ] }, "metadata": { "tags": [] }, "execution_count": 94 } ] }, { "cell_type": "markdown", "metadata": { "id": "ai0rn23E3DQ5", "colab_type": "text" }, "source": [ "# **Reconstructing the Underlying CT Scans**" ] }, { "cell_type": "markdown", "metadata": { "id": "TdrbjOxp0-aL", "colab_type": "text" }, "source": [ "Now, we can finally order the images in the sequence that they were taken in. That is, for each patient, we will now reconstruct their brain by arranging the images from their CT scan by increasing scan depth. This will allow us to make use of the **spatial relations** present between a patient's various images (specifically, we do this through the use of a bidirectional LSTM)." ] }, { "cell_type": "code", "metadata": { "id": "c3062b_a3DRC", "colab_type": "code", "outputId": "c221707f-c0b0-49d6-fdf1-1974a6c43785", "colab": {} }, "source": [ "# This is the part of the metadata the we care about:\n", "# \"ImagePositionPatient2\" specifies the slice depth\n", "train_metadata_slice_patient_depth = train_metadata[[\"SOPInstanceUID\", \"PatientID\", \"ImagePositionPatient2\"]]\n", "train_metadata_slice_patient_depth.head()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SOPInstanceUIDPatientIDImagePositionPatient2
0ID_231d901c1ID_b81a287f104.307000
1ID_994bc0470ID_400facde223.572015
2ID_127689cceID_42910d3d124.321068
3ID_25457734aID_329aafa7171.999939
5ID_87e8b2528ID_d6e578fb156.828114
\n", "
" ], "text/plain": [ " SOPInstanceUID PatientID ImagePositionPatient2\n", "0 ID_231d901c1 ID_b81a287f 104.307000 \n", "1 ID_994bc0470 ID_400facde 223.572015 \n", "2 ID_127689cce ID_42910d3d 124.321068 \n", "3 ID_25457734a ID_329aafa7 171.999939 \n", "5 ID_87e8b2528 ID_d6e578fb 156.828114 " ] }, "metadata": { "tags": [] }, "execution_count": 141 } ] }, { "cell_type": "code", "metadata": { "id": "pP3JfpDP3DRF", "colab_type": "code", "colab": {} }, "source": [ "# We create the \"patient_slices\" dictionary:\n", "# keys are patients, values are (slice depth and file names)\n", "patient_slices = dict()\n", "for i, row in tqdm.tqdm(train_metadata_slice_patient_depth.iterrows()):\n", " if row[\"PatientID\"] not in patient_slices:\n", " patient_slices[row[\"PatientID\"]] = [(row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"])]\n", " else:\n", " patient_slices[row[\"PatientID\"]].append((row[\"ImagePositionPatient2\"], row[\"SOPInstanceUID\"]))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "b4qQe0SF3DRI", "colab_type": "code", "colab": {} }, "source": [ "# We now sort the dictionary created above so as to arrange slices in the order that they belong in,\n", "# thereby reconstructing all patient brains\n", "patient_slices_sorted = dict()\n", "for i, (key, val) in enumerate(patient_slices.items()):\n", " val.sort()\n", " patient_slices_sorted[key] = [ID for depth, ID in val]" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0Rq67G_v_wHZ", "colab_type": "text" }, "source": [ "Below is an example of one (1) patient's reconstructed CT scan. Note the increasing slice depth:" ] }, { "cell_type": "code", "metadata": { "id": "-WHINOCm3DRi", "colab_type": "code", "outputId": "15307222-89cf-43c8-ffa4-685749cf0109", "colab": {} }, "source": [ "train_metadata_slice_patient_depth.query(\"PatientID == 'ID_b81a287f'\").sort_values(\"ImagePositionPatient2\")" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SOPInstanceUIDPatientIDImagePositionPatient2
214258ID_9f601fc5dID_b81a287f7.956
427173ID_19cb96474ID_b81a287f13.033
636486ID_496ab2661ID_b81a287f18.110
436756ID_06fe4adc5ID_b81a287f23.187
68576ID_59bc3960fID_b81a287f28.266
331217ID_aef3564acID_b81a287f33.343
503169ID_c1f0895bbID_b81a287f38.420
467952ID_c85081ef5ID_b81a287f43.497
53879ID_079fef6c2ID_b81a287f48.456
245073ID_5ab7d0d9aID_b81a287f53.533
287067ID_54d628968ID_b81a287f58.610
218156ID_d3e4638f6ID_b81a287f63.687
611699ID_21af3f314ID_b81a287f68.766
435682ID_3eb115349ID_b81a287f73.843
10788ID_c489f8a64ID_b81a287f78.920
653377ID_55e73915fID_b81a287f83.997
552646ID_678a1a095ID_b81a287f89.076
492896ID_508bd479eID_b81a287f94.153
297718ID_6609a6357ID_b81a287f99.230
0ID_231d901c1ID_b81a287f104.307
214284ID_826c83df4ID_b81a287f109.386
18784ID_008507574ID_b81a287f114.463
562637ID_f49234b83ID_b81a287f119.540
125258ID_ec9041d07ID_b81a287f124.617
486312ID_bcbee580cID_b81a287f129.696
411439ID_df700e73fID_b81a287f134.773
\n", "
" ], "text/plain": [ " SOPInstanceUID PatientID ImagePositionPatient2\n", "214258 ID_9f601fc5d ID_b81a287f 7.956 \n", "427173 ID_19cb96474 ID_b81a287f 13.033 \n", "636486 ID_496ab2661 ID_b81a287f 18.110 \n", "436756 ID_06fe4adc5 ID_b81a287f 23.187 \n", "68576 ID_59bc3960f ID_b81a287f 28.266 \n", "331217 ID_aef3564ac ID_b81a287f 33.343 \n", "503169 ID_c1f0895bb ID_b81a287f 38.420 \n", "467952 ID_c85081ef5 ID_b81a287f 43.497 \n", "53879 ID_079fef6c2 ID_b81a287f 48.456 \n", "245073 ID_5ab7d0d9a ID_b81a287f 53.533 \n", "287067 ID_54d628968 ID_b81a287f 58.610 \n", "218156 ID_d3e4638f6 ID_b81a287f 63.687 \n", "611699 ID_21af3f314 ID_b81a287f 68.766 \n", "435682 ID_3eb115349 ID_b81a287f 73.843 \n", "10788 ID_c489f8a64 ID_b81a287f 78.920 \n", "653377 ID_55e73915f ID_b81a287f 83.997 \n", "552646 ID_678a1a095 ID_b81a287f 89.076 \n", "492896 ID_508bd479e ID_b81a287f 94.153 \n", "297718 ID_6609a6357 ID_b81a287f 99.230 \n", "0 ID_231d901c1 ID_b81a287f 104.307 \n", "214284 ID_826c83df4 ID_b81a287f 109.386 \n", "18784 ID_008507574 ID_b81a287f 114.463 \n", "562637 ID_f49234b83 ID_b81a287f 119.540 \n", "125258 ID_ec9041d07 ID_b81a287f 124.617 \n", "486312 ID_bcbee580c ID_b81a287f 129.696 \n", "411439 ID_df700e73f ID_b81a287f 134.773 " ] }, "metadata": { "tags": [] }, "execution_count": 162 } ] }, { "cell_type": "code", "metadata": { "id": "-_kOiXjQ3DRt", "colab_type": "code", "colab": {} }, "source": [ "# Save the ordered dictionary to a file\n", "with open(\"ordered_slices_by_patient.pkl\", \"wb\") as f:\n", " pickle.dump(patient_slices_sorted, f)" ], "execution_count": 0, "outputs": [] } ] }