a b/tutorials/2_Labeling.ipynb
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "43f4d50c-4e7b-4652-9701-be9366ff70c4",
6
   "metadata": {},
7
   "source": [
8
    "# Labeling\n",
9
    "\n",
10
    "A core component of FEMR is labeling patients.\n",
11
    "\n",
12
    "Labels within FEMR follow the [label schema within MEDS](https://github.com/Medical-Event-Data-Standard/meds/blob/e93f63a2f9642123c49a31ecffcdb84d877dc54a/src/meds/__init__.py#L70).\n",
13
    "\n",
14
    "Per MEDS, each label consists of three attributes:\n",
15
    "\n",
16
    "* `patient_id` (int64): The identifier for the patient to predict on\n",
17
    "* `prediction_time` (datetime.datetime): The timestamp for when the prediction should be made. This indicates what features are allowed to be used for prediction.\n",
18
    "* `boolean_value` (bool): The target to predict\n",
19
    "\n",
20
    "Additional types of labels will be added to MEDS over time, and then supported here."
21
   ]
22
  },
23
  {
24
   "cell_type": "code",
25
   "execution_count": 1,
26
   "id": "c6ac5c41-bc99-4731-ad82-7152274c67e1",
27
   "metadata": {},
28
   "outputs": [],
29
   "source": [
30
    "import shutil\n",
31
    "import os\n",
32
    "\n",
33
    "TARGET_DIR = 'trash/tutorial_2'\n",
34
    "\n",
35
    "if os.path.exists(TARGET_DIR):\n",
36
    "    shutil.rmtree(TARGET_DIR)\n",
37
    "\n",
38
    "os.mkdir(TARGET_DIR)"
39
   ]
40
  },
41
  {
42
   "cell_type": "markdown",
43
   "id": "7e98dd85",
44
   "metadata": {},
45
   "source": [
46
    "# Demonstration of some example labels"
47
   ]
48
  },
49
  {
50
   "cell_type": "code",
51
   "execution_count": 2,
52
   "id": "8d9e2ccd-71c2-4ae0-897b-7ec022f9fdf4",
53
   "metadata": {},
54
   "outputs": [
55
    {
56
     "name": "stderr",
57
     "output_type": "stream",
58
     "text": [
59
      "/home/esteinberg/miniconda3/envs/debug_document_femr/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
60
      "  from .autonotebook import tqdm as notebook_tqdm\n"
61
     ]
62
    }
63
   ],
64
   "source": [
65
    "# We can construct these labels manually\n",
66
    "\n",
67
    "import femr.labelers\n",
68
    "import datetime\n",
69
    "import meds\n",
70
    "\n",
71
    "# Predict False on March 2nd, 1994\n",
72
    "example_label = {'patient_id': 100, 'prediction_time': datetime.datetime(1994, 3, 2), 'boolean_value': False}\n",
73
    "\n",
74
    "# Predict True on March 2nd, 2009\n",
75
    "example_label2 = {'patient_id': 100, 'prediction_time': datetime.datetime(2009, 3, 2), 'boolean_value': True}\n",
76
    "\n",
77
    "\n",
78
    "# Multiple labels are stored using a list\n",
79
    "labels = [example_label, example_label2]"
80
   ]
81
  },
82
  {
83
   "cell_type": "markdown",
84
   "id": "e77b1bfc-8d2d-4f79-b855-f90b3a73736e",
85
   "metadata": {},
86
   "source": [
87
    "# Generating labels programatically within FEMR\n",
88
    "\n",
89
    "One core feature of FEMR is the ability to algorithmically generate labels through the use of a labeling function class.\n",
90
    "\n",
91
    "The core for FEMR's labeling code is the abstract base class [Labeler](https://github.com/som-shahlab/femr/blob/main/src/femr/labelers/core.py#L40).\n",
92
    "\n",
93
    "Labeler has one abstract methods:\n",
94
    "\n",
95
    "```python\n",
96
    "def label(self, patient: meds.Patient) -> List[meds.Label]:\n",
97
    "    Generate a list of labels for a patient\n",
98
    "```\n",
99
    "\n",
100
    "Note that the patient is assumed to be the [MEDS Patient schema](https://github.com/Medical-Event-Data-Standard/meds/blob/e93f63a2f9642123c49a31ecffcdb84d877dc54a/src/meds/__init__.py#L18).\n",
101
    "\n",
102
    "Once this method is implemented, the apply function becomes available for generating labels."
103
   ]
104
  },
105
  {
106
   "cell_type": "code",
107
   "execution_count": 3,
108
   "id": "9ac22dbe-ef34-468a-8ab3-673e58e5a920",
109
   "metadata": {},
110
   "outputs": [
111
    {
112
     "name": "stderr",
113
     "output_type": "stream",
114
     "text": [
115
      "Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3040.98 examples/s]"
116
     ]
117
    },
118
    {
119
     "name": "stdout",
120
     "output_type": "stream",
121
     "text": [
122
      "{'patient_id': 100, 'prediction_time': datetime.datetime(1992, 7, 15, 0, 0), 'boolean_value': False}\n",
123
      "{'patient_id': 101, 'prediction_time': datetime.datetime(1992, 8, 20, 0, 0), 'boolean_value': False}\n",
124
      "{'patient_id': 102, 'prediction_time': datetime.datetime(1991, 4, 13, 0, 0), 'boolean_value': True}\n",
125
      "{'patient_id': 103, 'prediction_time': datetime.datetime(1990, 10, 19, 0, 0), 'boolean_value': False}\n",
126
      "{'patient_id': 104, 'prediction_time': datetime.datetime(1990, 6, 15, 0, 0), 'boolean_value': True}\n",
127
      "{'patient_id': 105, 'prediction_time': datetime.datetime(1990, 6, 29, 0, 0), 'boolean_value': True}\n",
128
      "{'patient_id': 106, 'prediction_time': datetime.datetime(1992, 5, 25, 0, 0), 'boolean_value': True}\n",
129
      "{'patient_id': 107, 'prediction_time': datetime.datetime(1992, 5, 29, 0, 0), 'boolean_value': False}\n",
130
      "{'patient_id': 108, 'prediction_time': datetime.datetime(1991, 10, 20, 0, 0), 'boolean_value': True}\n",
131
      "{'patient_id': 109, 'prediction_time': datetime.datetime(1991, 6, 25, 0, 0), 'boolean_value': True}\n"
132
     ]
133
    },
134
    {
135
     "name": "stderr",
136
     "output_type": "stream",
137
     "text": [
138
      "\n"
139
     ]
140
    }
141
   ],
142
   "source": [
143
    "from typing import List\n",
144
    "import femr.pat_utils\n",
145
    "import datasets\n",
146
    "\n",
147
    "class IsMaleLabeler(femr.labelers.Labeler):\n",
148
    "    # Dummy labeler to predict gender at birth\n",
149
    "    \n",
150
    "    def label(self, patient: meds.Patient) -> List[meds.Label]:\n",
151
    "        is_male = any('Gender/M' == measurement['code'] for event in patient['events'] for measurement in event['measurements'])\n",
152
    "        return [{\n",
153
    "            'patient_id': patient['patient_id'], \n",
154
    "            'prediction_time': femr.pat_utils.get_patient_birthdate(patient),\n",
155
    "            'boolean_value': is_male,\n",
156
    "        }]\n",
157
    "    \n",
158
    "dataset = datasets.Dataset.from_parquet(\"input/meds/data/*\")\n",
159
    "\n",
160
    "labeler = IsMaleLabeler()\n",
161
    "labeled_patients = labeler.apply(dataset)\n",
162
    "\n",
163
    "for i in range(10):\n",
164
    "    print(labeled_patients[100 + i])\n",
165
    "\n"
166
   ]
167
  },
168
  {
169
   "cell_type": "code",
170
   "execution_count": 4,
171
   "id": "20bd7859",
172
   "metadata": {},
173
   "outputs": [],
174
   "source": [
175
    "# We can use pyarrow to save these labels to a csv\n",
176
    "import pyarrow\n",
177
    "import pyarrow.csv\n",
178
    "\n",
179
    "table = pyarrow.Table.from_pylist(labeled_patients, schema=meds.label)\n",
180
    "pyarrow.csv.write_csv(table, \"trash/tutorial_2/labels.csv\")"
181
   ]
182
  }
183
 ],
184
 "metadata": {
185
  "kernelspec": {
186
   "display_name": "Python 3 (ipykernel)",
187
   "language": "python",
188
   "name": "python3"
189
  },
190
  "language_info": {
191
   "codemirror_mode": {
192
    "name": "ipython",
193
    "version": 3
194
   },
195
   "file_extension": ".py",
196
   "mimetype": "text/x-python",
197
   "name": "python",
198
   "nbconvert_exporter": "python",
199
   "pygments_lexer": "ipython3",
200
   "version": "3.10.14"
201
  }
202
 },
203
 "nbformat": 4,
204
 "nbformat_minor": 5
205
}