In [17]:
from datetime import datetime
date = datetime.today().strftime('%y%m%d')
print ('Last modified by Xiaoqing: ' + date)

Last modified by Xiaoqing: 211210


In [18]:
import pandas as pd
import numpy as np

# Problem statement
Words such as 'pregnant,' 'breast-feeding,' and 'lactating' frequently appear in clinical trial eligibility. Unfortunately they are not recognized by stanza NER (stanza can however, recognize 'pregnancy' as a problem). 

Let's fix this.

# Note:
 For this notebook to work, all key words must be extracted from a clinical trial. 
 
 One clinical trial can have multiple rows; each row corresponds to a different key word.
 
 Under criteria, we should have ALL the bullet points, without separting them into different rows.

In [19]:
df = pd.read_csv('input_pregnant_121021.csv')
df['criteria']= df['criteria'].str.lower()
df['key_words']= df['key_words'].str.lower()

#  The less common scenarios
In the less common scenarios, for example, a clinical trial may want to study pregnancy related diabetes. These studies DO want to recruit pregnant women.

In [20]:
# Does a study want to recruit pregnant women? 
# If pregnant = 1, it means they want to INCLUDE pregnant women.
# If pregnant = 0, it means they want to EXCLUDE pregnant women.

df['pregnant'] = np.nan

for index, row in df.iterrows():
    if 'pregnant' in row['key_words'] or 'pregnancy' in row['key_words']:
        df.loc[index,'pregnant'] = 1


In [21]:
df.tail()

Unnamed: 0,id,criteria,key_words,pregnant
16,14,women pregnant with one fetus between 16 and 2...,sedentary time,
17,14,women pregnant with one fetus between 16 and 2...,pregnancy,1.0
18,14,women pregnant with one fetus between 16 and 2...,pregnant women,1.0
19,14,women pregnant with one fetus between 16 and 2...,weight gain,
20,15,this is a study that does not mention anything...,testing,


In [22]:
# grouping by id, if one of the key words contain Pregnancy related words, we label that entire study as pregnant = 1 
df1 = df.groupby(['id'])['pregnant'].agg('max').reset_index()
df1.tail()


Unnamed: 0,id,pregnant
10,11,
11,12,
12,13,
13,14,1.0
14,15,


In [23]:
# now merge this with the long format df
df = df.drop('pregnant', 1)
df2 = df.merge(df1, on='id', how='outer')


  df = df.drop('pregnant', 1)


In [24]:
df2

Unnamed: 0,id,criteria,key_words,pregnant
0,1,"for female participants, currently breastfeedi...",testing,
1,1,"for female participants, currently breastfeedi...",testing,
2,1,"for female participants, currently breastfeedi...",testing,
3,2,patients will be excluded if they are pregnant.,testing,
4,3,pregnant women or women currently breastfeeding;,testing,
5,4,females who are pregnant or nursing,testing,
6,5,"in the case of women of childbearing age, urin...",testing,
7,6,are pregnant or lactating or planning to becom...,testing,
8,7,pregnant or breast-feeding,testing,
9,8,patients who are pregnant or may be pregnant,testing,


# Most common scenarios

In the most common scenario, clinical trials do not want to recruit women who are pregnant or breast-feeding, out of concern for the baby.

If a clinical trial's key words do not contain pregnancy related words AND the study eligibility mentioned pregnancy related words, we will mark them as a study that wants to AVOID recruiting pregnant women.

In [30]:
df2['lactating'] = np.nan

for index, row in df2.iterrows():
    if row['pregnant'] != 1:
        if 'pregnant' in row['criteria'] or 'pregnancy' in row['criteria']:
            df2.loc[index,'pregnant'] = 0
        if 'nursing' in row['criteria'] or 'breast-feeding' in row['criteria'] or 'breastfeeding' in row['criteria'] or 'breast feeding' in row['criteria'] or 'lactating' in row['criteria']:
            df2.loc[index,'lactating'] = 0
            
df2
    

Unnamed: 0,id,criteria,key_words,pregnant,lactating
0,1,"for female participants, currently breastfeedi...",testing,,0.0
1,1,"for female participants, currently breastfeedi...",testing,,0.0
2,1,"for female participants, currently breastfeedi...",testing,,0.0
3,2,patients will be excluded if they are pregnant.,testing,0.0,
4,3,pregnant women or women currently breastfeeding;,testing,0.0,0.0
5,4,females who are pregnant or nursing,testing,0.0,0.0
6,5,"in the case of women of childbearing age, urin...",testing,0.0,
7,6,are pregnant or lactating or planning to becom...,testing,0.0,0.0
8,7,pregnant or breast-feeding,testing,0.0,0.0
9,8,patients who are pregnant or may be pregnant,testing,0.0,


Now we see that for each study we are indicating whether they want to...
- exclude women who are pregnant (pregnant = 0)
- include women who are pregnant (pregnant = 1)
- exclude women who are lactating (lactating = 0)
- or they did not specify whether they care about pregnancy (NaN)


In [31]:
df2.to_csv(('output_pregnant_'+ date + '.csv'),index = False)