Download this file

248 lines (247 with data), 8.1 kB

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.metrics import confusion_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the set of labels into a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = pd.read_csv('labels.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start with assessing the radiologist's performance:\n",
    "* Assess the _accuracy_ of the radiologist by just looking at the percent of cases that they correctly labeled\n",
    "* Next, look at the true positive and true negative rates of the radiologist by generating a _confusion matrix_ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "radiologist_accuracy = sum(labels.perfect_labeler == labels.radiologist)/len(labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "radiologist_accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(labels.perfect_labeler.values,labels.radiologist.values,labels=[\"cancer\",\"benign\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now look at the algorithm's performance compared to the perfect labeler:\n",
    "* Since the algorithm doesn't create a binary label, it instead returns a _probability_ of cancer, choose a probability cut-off to use for the algorithm's labeling of cancer vs. bening. _(Hint: 0.5 is a reasonable starting place)_\n",
    "* Start with assessing _accuracy_ again here\n",
    "* Generate a confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Here, I'm going to change my entire dataframe to 0's and 1's to make processing easier\n",
    "labels = labels.replace('cancer',1).replace('benign',0)\n",
    "labels.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algorithm_thresh = (labels.algorithm > 0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens now if you change the threshold cut-off for your algorithm's classification to 0.4? What if you raise it to 0.6? How do accuracy, fp, fn, tp, and tn change?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algorithm_thresh = (labels.algorithm > 0.4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algorithm_thresh = (labels.algorithm > 0.6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finally, let's compare our algorithm to the radiologist\n",
    "* A \"perfect labeler\" might not exist in the real world, and in fact, if often does not\n",
    "* In AI for medical imaging, using a radiologist's labels as our \"true\" label is often the standard of practice, and algorithm performance is judged in both an academic setting as well as in the regulated industry landscape based on performance against an expert human\n",
    "\n",
    "* Repeat the steps above using a set threshold for your algorithm (again, 0.5 is perfectly reasonable) but now computing accuracy, tp, tn, fp, fn against the radiologist. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algorithm_thresh = (labels.algorithm > 0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(labels.radiologist.values,algorithm_thresh,labels=[1,0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reflection: \n",
    "* In the above exercise you assess performances of a human as well as of an algorithm against a 'perfect labeler' and also against each other. \n",
    "* Does accuracy seem like the appropriate statistic to use when evaluating these labels? Why or why not? \n",
    "* In what clinical settings does it seem more or less acceptable to have a high level of FNs? FPs? \n",
    "* How did changing the threshold on the algorithm performance change the different performance statistics? \n",
    "* How did your opinion of the algorithm's performance change when you started comparing it to a radiologist instead of the perfect labeler? What does this mean for a real-world scenario when a perfect labeler doesn't exist, and we only have a radiologist's read to base our performance on? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that there are so many fewer cancer cases than benign cases, accuracy would _not_ be a good statistic to use. A human or an algorithm could label _all_ of the cancer cases as benign and still achieve an accuracy of over 80%. \n",
    "\n",
    "Higher levels of false negatives aren't great ever in clinical settings, but they have less of an impact on the patient if we _know_ that there will be a second reader (i.e. our algorithm reads an image first, and then the label is confirmed by a radiologist). This also would only be appropriate if there wasn't a time-sensitive aspect to making the diagnosis. It seems more acceptable to have a high level of false positives in a situation as we saw in the previous exercise, where our algorithm was being used to prioritize emergency reads, in which case we would want to be somewhat liberal. \n",
    "\n",
    "Changing the threshold on the algorithm performance changed the FP and FN rates, one at the expense of the other. \n",
    "\n",
    "The algorithm's true positive rate increased when using the same threshold and comparing it to the human instead of the perfect labeler. This means that without the _true_ ground truth of diagnoses in an image, we may never be able to 100% accurately assess our algorithm, and its results may be inflated based on the quality of the radiologist labels that we use in comparing our outputs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}