{ "cells": [ { "cell_type": "code", "execution_count": 17, "id": "3f87254e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last modified by Xiaoqing: 211210\n" ] } ], "source": [ "from datetime import datetime\n", "date = datetime.today().strftime('%y%m%d')\n", "print ('Last modified by Xiaoqing: ' + date)" ] }, { "cell_type": "code", "execution_count": 18, "id": "6114c11a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "d3e710b5", "metadata": {}, "source": [ "# Problem statement\n", "Words such as 'pregnant,' 'breast-feeding,' and 'lactating' frequently appear in clinical trial eligibility. Unfortunately they are not recognized by stanza NER (stanza can however, recognize 'pregnancy' as a problem). \n", "\n", "Let's fix this." ] }, { "cell_type": "markdown", "id": "86e4c32b", "metadata": {}, "source": [ "# Note:\n", " For this notebook to work, all key words must be extracted from a clinical trial. \n", " \n", " One clinical trial can have multiple rows; each row corresponds to a different key word.\n", " \n", " Under criteria, we should have ALL the bullet points, without separting them into different rows." ] }, { "cell_type": "code", "execution_count": 19, "id": "49cd1778", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('input_pregnant_121021.csv')\n", "df['criteria']= df['criteria'].str.lower()\n", "df['key_words']= df['key_words'].str.lower()" ] }, { "cell_type": "markdown", "id": "cb993e1c", "metadata": {}, "source": [ "# The less common scenarios\n", "In the less common scenarios, for example, a clinical trial may want to study pregnancy related diabetes. These studies DO want to recruit pregnant women." ] }, { "cell_type": "code", "execution_count": 20, "id": "c7a7d7d8", "metadata": {}, "outputs": [], "source": [ "# Does a study want to recruit pregnant women? \n", "# If pregnant = 1, it means they want to INCLUDE pregnant women.\n", "# If pregnant = 0, it means they want to EXCLUDE pregnant women.\n", "\n", "df['pregnant'] = np.nan\n", "\n", "for index, row in df.iterrows():\n", " if 'pregnant' in row['key_words'] or 'pregnancy' in row['key_words']:\n", " df.loc[index,'pregnant'] = 1\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "903e2281", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcriteriakey_wordspregnant
1614women pregnant with one fetus between 16 and 2...sedentary timeNaN
1714women pregnant with one fetus between 16 and 2...pregnancy1.0
1814women pregnant with one fetus between 16 and 2...pregnant women1.0
1914women pregnant with one fetus between 16 and 2...weight gainNaN
2015this is a study that does not mention anything...testingNaN
\n", "
" ], "text/plain": [ " id criteria key_words \\\n", "16 14 women pregnant with one fetus between 16 and 2... sedentary time \n", "17 14 women pregnant with one fetus between 16 and 2... pregnancy \n", "18 14 women pregnant with one fetus between 16 and 2... pregnant women \n", "19 14 women pregnant with one fetus between 16 and 2... weight gain \n", "20 15 this is a study that does not mention anything... testing \n", "\n", " pregnant \n", "16 NaN \n", "17 1.0 \n", "18 1.0 \n", "19 NaN \n", "20 NaN " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "code", "execution_count": 22, "id": "14d19db1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idpregnant
1011NaN
1112NaN
1213NaN
13141.0
1415NaN
\n", "
" ], "text/plain": [ " id pregnant\n", "10 11 NaN\n", "11 12 NaN\n", "12 13 NaN\n", "13 14 1.0\n", "14 15 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# grouping by id, if one of the key words contain Pregnancy related words, we label that entire study as pregnant = 1 \n", "df1 = df.groupby(['id'])['pregnant'].agg('max').reset_index()\n", "df1.tail()\n" ] }, { "cell_type": "code", "execution_count": 23, "id": "3f32dd16", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/n3/x7vymc8s2cj1p5yx96j08wlr0000gn/T/ipykernel_1469/2839200337.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only\n", " df = df.drop('pregnant', 1)\n" ] } ], "source": [ "# now merge this with the long format df\n", "df = df.drop('pregnant', 1)\n", "df2 = df.merge(df1, on='id', how='outer')\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "4bcb95f1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcriteriakey_wordspregnant
01for female participants, currently breastfeedi...testingNaN
11for female participants, currently breastfeedi...testingNaN
21for female participants, currently breastfeedi...testingNaN
32patients will be excluded if they are pregnant.testingNaN
43pregnant women or women currently breastfeeding;testingNaN
54females who are pregnant or nursingtestingNaN
65in the case of women of childbearing age, urin...testingNaN
76are pregnant or lactating or planning to becom...testingNaN
87pregnant or breast-feedingtestingNaN
98patients who are pregnant or may be pregnanttestingNaN
109not pregnant or nursingtestingNaN
1110females who are pregnant.testingNaN
1211for females of child-bearing age, current preg...testingNaN
1312for females of child-bearing age, current brea...testingNaN
1413are not pregnant (negative urine test) or brea...testingNaN
1514women pregnant with one fetus between 16 and 2...physical activity1.0
1614women pregnant with one fetus between 16 and 2...sedentary time1.0
1714women pregnant with one fetus between 16 and 2...pregnancy1.0
1814women pregnant with one fetus between 16 and 2...pregnant women1.0
1914women pregnant with one fetus between 16 and 2...weight gain1.0
2015this is a study that does not mention anything...testingNaN
\n", "
" ], "text/plain": [ " id criteria key_words \\\n", "0 1 for female participants, currently breastfeedi... testing \n", "1 1 for female participants, currently breastfeedi... testing \n", "2 1 for female participants, currently breastfeedi... testing \n", "3 2 patients will be excluded if they are pregnant. testing \n", "4 3 pregnant women or women currently breastfeeding; testing \n", "5 4 females who are pregnant or nursing testing \n", "6 5 in the case of women of childbearing age, urin... testing \n", "7 6 are pregnant or lactating or planning to becom... testing \n", "8 7 pregnant or breast-feeding testing \n", "9 8 patients who are pregnant or may be pregnant testing \n", "10 9 not pregnant or nursing testing \n", "11 10 females who are pregnant. testing \n", "12 11 for females of child-bearing age, current preg... testing \n", "13 12 for females of child-bearing age, current brea... testing \n", "14 13 are not pregnant (negative urine test) or brea... testing \n", "15 14 women pregnant with one fetus between 16 and 2... physical activity \n", "16 14 women pregnant with one fetus between 16 and 2... sedentary time \n", "17 14 women pregnant with one fetus between 16 and 2... pregnancy \n", "18 14 women pregnant with one fetus between 16 and 2... pregnant women \n", "19 14 women pregnant with one fetus between 16 and 2... weight gain \n", "20 15 this is a study that does not mention anything... testing \n", "\n", " pregnant \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "10 NaN \n", "11 NaN \n", "12 NaN \n", "13 NaN \n", "14 NaN \n", "15 1.0 \n", "16 1.0 \n", "17 1.0 \n", "18 1.0 \n", "19 1.0 \n", "20 NaN " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "id": "a15339fb", "metadata": {}, "source": [ "# Most common scenarios" ] }, { "cell_type": "markdown", "id": "d8aa02c4", "metadata": {}, "source": [ "In the most common scenario, clinical trials do not want to recruit women who are pregnant or breast-feeding, out of concern for the baby.\n", "\n", "If a clinical trial's key words do not contain pregnancy related words AND the study eligibility mentioned pregnancy related words, we will mark them as a study that wants to AVOID recruiting pregnant women." ] }, { "cell_type": "code", "execution_count": 30, "id": "ca59627d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcriteriakey_wordspregnantlactating
01for female participants, currently breastfeedi...testingNaN0.0
11for female participants, currently breastfeedi...testingNaN0.0
21for female participants, currently breastfeedi...testingNaN0.0
32patients will be excluded if they are pregnant.testing0.0NaN
43pregnant women or women currently breastfeeding;testing0.00.0
54females who are pregnant or nursingtesting0.00.0
65in the case of women of childbearing age, urin...testing0.0NaN
76are pregnant or lactating or planning to becom...testing0.00.0
87pregnant or breast-feedingtesting0.00.0
98patients who are pregnant or may be pregnanttesting0.0NaN
109not pregnant or nursingtesting0.00.0
1110females who are pregnant.testing0.0NaN
1211for females of child-bearing age, current preg...testing0.0NaN
1312for females of child-bearing age, current brea...testingNaN0.0
1413are not pregnant (negative urine test) or brea...testing0.00.0
1514women pregnant with one fetus between 16 and 2...physical activity1.0NaN
1614women pregnant with one fetus between 16 and 2...sedentary time1.0NaN
1714women pregnant with one fetus between 16 and 2...pregnancy1.0NaN
1814women pregnant with one fetus between 16 and 2...pregnant women1.0NaN
1914women pregnant with one fetus between 16 and 2...weight gain1.0NaN
2015this is a study that does not mention anything...testingNaNNaN
\n", "
" ], "text/plain": [ " id criteria key_words \\\n", "0 1 for female participants, currently breastfeedi... testing \n", "1 1 for female participants, currently breastfeedi... testing \n", "2 1 for female participants, currently breastfeedi... testing \n", "3 2 patients will be excluded if they are pregnant. testing \n", "4 3 pregnant women or women currently breastfeeding; testing \n", "5 4 females who are pregnant or nursing testing \n", "6 5 in the case of women of childbearing age, urin... testing \n", "7 6 are pregnant or lactating or planning to becom... testing \n", "8 7 pregnant or breast-feeding testing \n", "9 8 patients who are pregnant or may be pregnant testing \n", "10 9 not pregnant or nursing testing \n", "11 10 females who are pregnant. testing \n", "12 11 for females of child-bearing age, current preg... testing \n", "13 12 for females of child-bearing age, current brea... testing \n", "14 13 are not pregnant (negative urine test) or brea... testing \n", "15 14 women pregnant with one fetus between 16 and 2... physical activity \n", "16 14 women pregnant with one fetus between 16 and 2... sedentary time \n", "17 14 women pregnant with one fetus between 16 and 2... pregnancy \n", "18 14 women pregnant with one fetus between 16 and 2... pregnant women \n", "19 14 women pregnant with one fetus between 16 and 2... weight gain \n", "20 15 this is a study that does not mention anything... testing \n", "\n", " pregnant lactating \n", "0 NaN 0.0 \n", "1 NaN 0.0 \n", "2 NaN 0.0 \n", "3 0.0 NaN \n", "4 0.0 0.0 \n", "5 0.0 0.0 \n", "6 0.0 NaN \n", "7 0.0 0.0 \n", "8 0.0 0.0 \n", "9 0.0 NaN \n", "10 0.0 0.0 \n", "11 0.0 NaN \n", "12 0.0 NaN \n", "13 NaN 0.0 \n", "14 0.0 0.0 \n", "15 1.0 NaN \n", "16 1.0 NaN \n", "17 1.0 NaN \n", "18 1.0 NaN \n", "19 1.0 NaN \n", "20 NaN NaN " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['lactating'] = np.nan\n", "\n", "for index, row in df2.iterrows():\n", " if row['pregnant'] != 1:\n", " if 'pregnant' in row['criteria'] or 'pregnancy' in row['criteria']:\n", " df2.loc[index,'pregnant'] = 0\n", " if 'nursing' in row['criteria'] or 'breast-feeding' in row['criteria'] or 'breastfeeding' in row['criteria'] or 'breast feeding' in row['criteria'] or 'lactating' in row['criteria']:\n", " df2.loc[index,'lactating'] = 0\n", " \n", "df2\n", " " ] }, { "cell_type": "markdown", "id": "a62e65a9", "metadata": {}, "source": [ "Now we see that for each study we are indicating whether they want to...\n", "- exclude women who are pregnant (pregnant = 0)\n", "- include women who are pregnant (pregnant = 1)\n", "- exclude women who are lactating (lactating = 0)\n", "- or they did not specify whether they care about pregnancy (NaN)\n" ] }, { "cell_type": "code", "execution_count": 31, "id": "54038d07", "metadata": {}, "outputs": [], "source": [ "df2.to_csv(('output_pregnant_'+ date + '.csv'),index = False)" ] }, { "cell_type": "markdown", "id": "240c04d5", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }