1752 lines (1751 with data), 53.5 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NLP Pipeline "
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [],
"source": [
"from pymongo import MongoClient\n",
"import re\n",
"import string\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from nltk.tokenize import TreebankWordTokenizer\n",
"from nltk.stem import PorterStemmer\n",
"from nltk.corpus import stopwords\n",
"from sklearn.decomposition import LatentDirichletAllocation\n",
"from sklearn.preprocessing import Normalizer\n",
"import itertools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to the Mongo clinical_trials database "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def connect_to_mongo(database, collection):\n",
" \n",
" \"\"\"\n",
" Opens a connection to a specified Mongo DB location\n",
" \n",
" Input Parameters:\n",
" database: name of database to connect to or create (str)\n",
" collection: name of collection to connect to or create (str)\n",
" \n",
" Returns:\n",
" The connection object for the database without a collection specified\n",
" The connection object for a specific Mongo location (database & collection)\n",
" \"\"\"\n",
" \n",
" client = MongoClient()\n",
" db = client[database]\n",
" mongo_loc = db[collection]\n",
" return db, mongo_loc"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [],
"source": [
"trials_loc, eligibility_loc = connect_to_mongo('clinical_trials', 'eligibilities')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a Mongo cursor "
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"doc_cursor = eligibility_loc.find()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean the text "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test on 5 entires "
]
},
{
"cell_type": "code",
"execution_count": 282,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NCT00864942\n",
"\n",
"Criteria:\n",
" -------------\n",
" Documented relapsed or refractory B-cell NHL; CD-20 positive tumor. Indolent NHL: follicular B-cell lymphoma, diffuse small lymphocytic lymphoma, lymphoplasmacytic lymphoma, marginal zone lymphoma, transformed aggressive lymphomas, mantle cell lymphoma and chronic lymphocytic leukemia\n",
"\n",
"Criteria:\n",
" -------------\n",
" Maximum of 6 prior chemotherapy regimens. Prior rituximab is allowed.\n",
"\n",
"Criteria:\n",
" -------------\n",
" Bidimensionally measurable disease\n",
"\n",
"Criteria:\n",
" -------------\n",
" ECOG performance status 0-2\n",
"\n",
"Criteria:\n",
" -------------\n",
" Absolute neutrophil count >/= 1000 and platelet count >/= 50,000\n",
"\n",
"Criteria:\n",
" -------------\n",
" Serum creatinine </= 1.5 mg/dL\n",
"\n",
"Criteria:\n",
" -------------\n",
" Adequate hepatic function\n",
"\n",
"Criteria:\n",
" -------------\n",
" Estimated life expectancy of at least 3 months\n",
"\n",
"Criteria:\n",
" -------------\n",
" All study participants must be registered into the mandatory RevAssist program and be willing and able to comply with the requirements of RevAssist\n",
"\n",
"Criteria:\n",
" -------------\n",
" Able to take aspirin 81 mg daily as prophylactic anticoagulation\n",
"\n",
" Cleaned Criteria:\n",
"-----------\n",
" ['documented relapsed or refractory b-cell nhl cd-20 positive tumor indolent nhl follicular b-cell lymphoma diffuse small lymphocytic lymphoma lymphoplasmacytic lymphoma marginal zone lymphoma transformed aggressive lymphomas mantle cell lymphoma and chronic lymphocytic leukemia', 'maximum of 6 prior chemotherapy regimens prior rituximab is allow', 'bidimensionally measurable diseas', 'ecog performance status 0-2', 'absolute neutrophil count > 1000 and platelet count > 50000', 'serum creatinine < 1.5 mg/dl', 'adequate hepatic funct', 'estimated life expectancy of at least 3 month', 'all study participants must be registered into the mandatory revassist program and be willing and able to comply with the requirements of revassist', 'able to take aspirin 81 mg daily as prophylactic anticoagul']\n",
"NCT00864929\n",
"\n",
"Criteria:\n",
" -------------\n",
" Patient was diagnosed as nosocomial infection defined according to criteria established by the US CDC. The diagnosis criteria for ventilator-associated pneumonia are modified from those established by the American College of Chest Physicians.\n",
"\n",
"Criteria:\n",
" -------------\n",
" Patient received empiric antimicrobial therapy within 24 hour from onset of infection and had antimicrobial susceptibility.\n",
"\n",
" Cleaned Criteria:\n",
"-----------\n",
" ['patient was diagnosed as nosocomial infection defined according to criteria established by the us cdc the diagnosis criteria for ventilator-associated pneumonia are modified from those established by the american college of chest physician', 'patient received empiric antimicrobial therapy within 24 hour from onset of infection and had antimicrobial suscept']\n",
"NCT00864916\n",
"\n",
"Criteria:\n",
" -------------\n",
" Documentation of HIV infection with a positive HIV enzyme-linked immunosorbent assay (ELISA) test and confirmatory western blot test\n",
"\n",
"Criteria:\n",
" -------------\n",
" Has not received any antiretroviral therapies in the 6 months before screening\n",
"\n",
"Criteria:\n",
" -------------\n",
" Participant is planning to initiate cART, per the primary HIV caregiver (there is no CD4 or HIV-1 RNA level criteria)\n",
"\n",
" Cleaned Criteria:\n",
"-----------\n",
" ['documentation of hiv infection with a positive hiv enzyme-linked immunosorbent assay elisa test and confirmatory western blot test', 'has not received any antiretroviral therapies in the 6 months before screen', 'participant is planning to initiate cart per the primary hiv caregiver there is no cd4 or hiv-1 rna level criteria']\n",
"NCT00864903\n",
"\n",
"Criteria:\n",
" -------------\n",
" any child undergoing a spinal tap\n",
"\n",
"Criteria:\n",
" -------------\n",
" parents agreed on participation\n",
"\n",
" Cleaned Criteria:\n",
"-----------\n",
" ['any child undergoing a spinal tap', 'parents agreed on particip']\n",
"NCT00864370\n",
"\n",
"Criteria:\n",
" -------------\n",
" Medically healthy adult women (ages 18-45) fulfilling DSM-IV criteria for BPD of any subtype who are ≥ 16 weeks gestation dated by last menstrual period (LMP)\n",
"\n",
"Criteria:\n",
" -------------\n",
" Able to give informed consent and comply with study procedures\n",
"\n",
" Cleaned Criteria:\n",
"-----------\n",
" ['medically healthy adult women ages 18-45 fulfilling dsm-iv criteria for bpd of any subtype who are > 16 weeks gestation dated by last menstrual period lmp', 'able to give informed consent and comply with study procedur']\n",
"\n",
"Done.\n"
]
}
],
"source": [
"doc_cursor = eligibility_loc.find().limit(5)\n",
"stemmer = PorterStemmer()\n",
"doc_count = 1\n",
"\n",
"for doc in doc_cursor:\n",
" inclusion_criteria = doc['inclusion_criteria']\n",
" print(doc['study_id'])\n",
" clean_criteria_list = []\n",
" for criteria in inclusion_criteria:\n",
" print(\"\\nCriteria:\\n -------------\\n\", criteria)\n",
" remove_comma = re.sub(',', '', criteria)\n",
" remove_equals = re.sub('/=', '', remove_comma)\n",
" remove_period_space = re.sub('\\. ', ' ', remove_equals)\n",
" remove_less_than = re.sub('less than', '<', remove_period_space)\n",
" remove_less_than_equal = re.sub('less than or equal to', '<', remove_less_than)\n",
" remove_greater_than = re.sub('greater than', '>', remove_less_than_equal)\n",
" remove_greater_than_equal = re.sub('greater than or equal to', '>', remove_greater_than)\n",
" remove_gt_symbol = re.sub('≥', '>', remove_greater_than_equal)\n",
" remove_lt_symbolr = re.sub('≤', '<', remove_gt_symbol)\n",
" remove_semicolon = re.sub(';', '', remove_gt_symbol)\n",
" remove_colon = re.sub(':', '', remove_semicolon)\n",
" remove_lparen = re.sub('\\(', '', remove_colon)\n",
" remove_rparen = re.sub('\\)', '', remove_lparen)\n",
" same_dash = re.sub('–', '-', remove_rparen)\n",
" clean_crit = re.sub('\\.$', '', same_dash)\n",
" stem_crit = stemmer.stem(clean_crit)\n",
" clean_criteria_list.append(stem_crit)\n",
" print(\"\\n Cleaned Criteria:\\n-----------\\n\", clean_criteria_list)\n",
" eligibility_loc.update_one({'study_id':doc['study_id']}, {\"$set\": {\"cleaned_inclusion\": clean_criteria_list}}, upsert=False)\n",
" doc_count += 1\n",
"# if doc_count%2000 == 0:\n",
"# print(f'\\nCleaning doc {doc_count}')\n",
"print('\\nDone.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Clean inclusion criteria for 20000 documents "
]
},
{
"cell_type": "code",
"execution_count": 292,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Cleaning doc 2000\n",
"\n",
"Cleaning doc 4000\n",
"\n",
"Cleaning doc 6000\n",
"\n",
"Cleaning doc 8000\n",
"\n",
"Cleaning doc 10000\n",
"\n",
"Cleaning doc 12000\n",
"\n",
"Cleaning doc 14000\n",
"\n",
"Cleaning doc 16000\n",
"\n",
"Cleaning doc 18000\n",
"\n",
"Cleaning doc 20000\n",
"\n",
"Done.\n"
]
}
],
"source": [
"doc_cursor = eligibility_loc.find().limit(20000)\n",
"stemmer = PorterStemmer()\n",
"doc_count = 1\n",
"\n",
"for doc in doc_cursor:\n",
" inclusion_criteria = doc['inclusion_criteria']\n",
"# print(doc['study_id'])\n",
" clean_criteria_list = []\n",
" for criteria in inclusion_criteria:\n",
"# print(\"\\nCriteria:\\n -------------\\n\", criteria)\n",
" remove_comma = re.sub(',', '', criteria)\n",
" remove_equals = re.sub('/=', '', remove_comma)\n",
" remove_period_space = re.sub('\\. ', ' ', remove_equals)\n",
" remove_less_than = re.sub('less than', '<', remove_period_space)\n",
" remove_less_than_equal = re.sub('less than or equal to', '<', remove_less_than)\n",
" remove_greater_than = re.sub('greater than', '>', remove_less_than_equal)\n",
" remove_greater_than_equal = re.sub('greater than or equal to', '>', remove_greater_than)\n",
" remove_gt_symbol = re.sub('≥', '>', remove_greater_than_equal)\n",
" remove_lt_symbolr = re.sub('≤', '<', remove_gt_symbol)\n",
" remove_semicolon = re.sub(';', '', remove_gt_symbol)\n",
" remove_colon = re.sub(':', '', remove_semicolon)\n",
" remove_lparen = re.sub('\\(', '', remove_colon)\n",
" remove_rparen = re.sub('\\)', '', remove_lparen)\n",
" same_dash = re.sub('–', '-', remove_rparen)\n",
" clean_crit = re.sub('\\.$', '', same_dash)\n",
" stem_crit = stemmer.stem(clean_crit)\n",
" clean_criteria_list.append(stem_crit)\n",
"# print(\"\\n Cleaned Criteria:\\n-----------\\n\", clean_criteria_list)\n",
" eligibility_loc.update_one({'study_id':doc['study_id']}, {\"$set\": {\"cleaned_inclusion\": clean_criteria_list}}, upsert=False)\n",
" doc_count += 1\n",
" if doc_count%2000 == 0:\n",
" print(f'\\nCleaning doc {doc_count}')\n",
"print('\\nDone.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vectorize "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Unpack inclusion critera so we can fit each one to the vectorizer separately "
]
},
{
"cell_type": "code",
"execution_count": 227,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['crit1', 'crit2', 'crit3', 'cirta', 'critb']"
]
},
"execution_count": 227,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"single_doc = [['crit1', 'crit2', 'crit3'], ['crita', 'critb']]\n",
"list(itertools.chain(*single_doc))\n",
"# use mongo explode next time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test that unpacking works with the cursor to create a list the vectorizer will accept"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['documented relapsed or refractory b-cell nhl cd-20 positive tumor indolent nhl follicular b-cell lymphoma diffuse small lymphocytic lymphoma lymphoplasmacytic lymphoma marginal zone lymphoma transformed aggressive lymphomas mantle cell lymphoma and chronic lymphocytic leukemia',\n",
" 'maximum of 6 prior chemotherapy regimens prior rituximab is allow',\n",
" 'bidimensionally measurable diseas',\n",
" 'ecog performance status 0-2',\n",
" 'absolute neutrophil count > 1000 and platelet count > 50000',\n",
" 'serum creatinine < 1.5 mg/dl',\n",
" 'adequate hepatic funct',\n",
" 'estimated life expectancy of at least 3 month',\n",
" 'all study participants must be registered into the mandatory revassist program and be willing and able to comply with the requirements of revassist',\n",
" 'able to take aspirin 81 mg daily as prophylactic anticoagul',\n",
" 'patient was diagnosed as nosocomial infection defined according to criteria established by the us cdc the diagnosis criteria for ventilator-associated pneumonia are modified from those established by the american college of chest physician',\n",
" 'patient received empiric antimicrobial therapy within 24 hour from onset of infection and had antimicrobial suscept']"
]
},
"execution_count": 234,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trials_loc, eligibility_loc = connect_to_mongo('clinical_trials', 'eligibilities')\n",
"doc_cursor = eligibility_loc.find().limit(2)\n",
"\n",
"unpacked_criteria = list(itertools.chain(*(doc['cleaned_inclusion'] for doc in doc_cursor)))\n",
"unpacked_criteria"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create vectorizer "
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [],
"source": [
"count_vectorizer = CountVectorizer(ngram_range=(2, 4), \n",
" stop_words='english', \n",
" token_pattern=\"[a-zA-Z0-9_\\-ï/><\\.]+\",\n",
" lowercase=True,\n",
" max_df = 0.6,\n",
" min_df = 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit vectorizer on 1000 docs"
]
},
{
"cell_type": "code",
"execution_count": 315,
"metadata": {},
"outputs": [],
"source": [
"trials_loc, eligibility_loc = connect_to_mongo('clinical_trials', 'eligibilities')\n",
"doc_cursor = eligibility_loc.find().limit(1000)\n",
"\n",
"X = count_vectorizer.fit(list(itertools.chain(*(doc['cleaned_inclusion'] for doc in doc_cursor))))"
]
},
{
"cell_type": "code",
"execution_count": 321,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['- 1',\n",
" '- 10',\n",
" '- 2',\n",
" '- 30',\n",
" '- 30 kg/m2',\n",
" '- 35',\n",
" '- 35 kg/m2',\n",
" '- 4.5',\n",
" '- 50',\n",
" '- 60',\n",
" '- 70',\n",
" '- 70 year',\n",
" '- 80',\n",
" '- 80 year',\n",
" '- agree',\n",
" '- agree utilize',\n",
" '- agree utilize following',\n",
" '- revised',\n",
" '- revised cdrs-r',\n",
" '- revised cdrs-r total',\n",
" '- surgically',\n",
" '- surgically sterile',\n",
" '- surgically sterile years',\n",
" '--- agree',\n",
" '--- agree utilize',\n",
" '--- agree utilize following',\n",
" '--- surgically',\n",
" '--- surgically sterile',\n",
" '--- surgically sterile years',\n",
" '-b -drb1',\n",
" '/- additional',\n",
" '0 -',\n",
" '0 - 1',\n",
" '0 - 2',\n",
" '0 1',\n",
" '0 1 2',\n",
" '0 1 screen',\n",
" '0 2',\n",
" '0 screening',\n",
" '0 screening period',\n",
" '0.50 0.81',\n",
" '0.50 0.81 previous',\n",
" '0.81 previous',\n",
" '01 2',\n",
" '0501 nct00412360',\n",
" '1 2',\n",
" '1 2 3',\n",
" '1 21',\n",
" '1 21 years',\n",
" '1 21 years old',\n",
" '1 3',\n",
" '1 alcoholic',\n",
" '1 alcoholic drink',\n",
" '1 alcoholic drink day',\n",
" '1 cm',\n",
" '1 cm spiral',\n",
" '1 cm spiral ct',\n",
" '1 diabetes',\n",
" '1 diabetes >',\n",
" '1 diabetes > 5',\n",
" '1 diabetes using',\n",
" '1 diabetes using usual',\n",
" '1 dose',\n",
" '1 dose escalation',\n",
" '1 elective',\n",
" '1 elective pci',\n",
" '1 elective pci include',\n",
" '1 following',\n",
" '1 month',\n",
" '1 screen',\n",
" '1 type',\n",
" '1 type 2',\n",
" '1 type 2 diabetes',\n",
" '1 week',\n",
" '1 week accru',\n",
" '1 year',\n",
" '1 year post-menopausal',\n",
" '1 year prior',\n",
" '1.0 mg/dl',\n",
" '1.5 cm',\n",
" '1.5 mg/dl',\n",
" '1.5 mg/dl measured',\n",
" '1.5 mg/dl measured creatinine',\n",
" '1.5 mg/dl unless',\n",
" '1.5 pt/ptt',\n",
" '1.5 times',\n",
" '1.5 times uln',\n",
" '1.5 times upper',\n",
" '1.5 times upper limit',\n",
" '1.5 times upper limits',\n",
" '1.5 x',\n",
" '1.5 x 10',\n",
" '1.5 x 10 9/l',\n",
" '1.5 x 109/l',\n",
" '1.5 x 109/l platelets',\n",
" '1.5 x institutional',\n",
" '1.5 x institutional upper',\n",
" '1.5 x uln',\n",
" '1.5 x upper',\n",
" '1.5 x upper limit',\n",
" '1.7 prothrombin',\n",
" '1.7 prothrombin time',\n",
" '1.7 prothrombin time 2',\n",
" '10 9',\n",
" '10 9 cells/l',\n",
" '10 9/l',\n",
" '10 9/l platelet',\n",
" '10 9/l platelet count',\n",
" '10 baseline',\n",
" '10 baseline week',\n",
" '10 baseline week likert',\n",
" '10 day',\n",
" '10 g/dl',\n",
" '10 hours',\n",
" '10 hours final',\n",
" '10 hours final concentration',\n",
" '10 mg',\n",
" '10 mg 20',\n",
" '10 mg 20 mg',\n",
" '10 mm',\n",
" '10 mm spiral',\n",
" '10 normal',\n",
" '10 normal height',\n",
" '10 normal height body',\n",
" '10 pack-years',\n",
" '10 pack-years exposure',\n",
" '10 pack-years exposure obesity',\n",
" '10 year',\n",
" '10 years',\n",
" '100 g/l',\n",
" '100 iu/l',\n",
" '100 kg',\n",
" '100 mg/dl',\n",
" '100 x',\n",
" '100 x 10',\n",
" '100 x 109/l',\n",
" '1000/mm 3',\n",
" '100000/ l',\n",
" '100000/mm 3',\n",
" '109/l 1',\n",
" '109/l 1 week',\n",
" '109/l 1 week accru',\n",
" '109/l bilirubin',\n",
" '109/l bilirubin <',\n",
" '109/l bilirubin < 2',\n",
" '109/l platelet',\n",
" '109/l platelets',\n",
" '109/l platelets >',\n",
" '109/l platelets > 7.5',\n",
" '110 kg',\n",
" '110 lb',\n",
" '12 15',\n",
" '12 17',\n",
" '12 17 years',\n",
" '12 17 years age',\n",
" '12 hours',\n",
" '12 hours age',\n",
" '12 hours prior',\n",
" '12 hours prior study',\n",
" '12 lead',\n",
" '12 lead ecg',\n",
" '12 lead ecg vital',\n",
" '12 month',\n",
" '12 months',\n",
" '12 months prior',\n",
" '12 months prior study',\n",
" '12 week',\n",
" '12 weeks',\n",
" '12 weeks mor',\n",
" '12 weeks prior',\n",
" '12 year',\n",
" '12 years',\n",
" '12 years age',\n",
" '12-lead ecg',\n",
" '12-lead electrocardiogram',\n",
" '12-lead electrocardiogram ecg',\n",
" '135-246 pounds',\n",
" '135-246 pounds individual',\n",
" '135-246 pounds individual weight',\n",
" '14 days',\n",
" '14 days prior',\n",
" '14 years',\n",
" '15 ideal',\n",
" '15 ideal body',\n",
" '15 ideal body weight',\n",
" '15 mm',\n",
" '15 mmhg',\n",
" '15 post',\n",
" '15 post saline',\n",
" '15 post saline value',\n",
" '15 years',\n",
" '150 mg/dl',\n",
" '150 x',\n",
" '150 x 109/l',\n",
" '1500/ l',\n",
" '1500/mm 3',\n",
" '16 weeks',\n",
" '16 years',\n",
" '17 year',\n",
" '17 years',\n",
" '17 years age',\n",
" '17 years age inclusive',\n",
" '18 -',\n",
" '18 - 70',\n",
" '18 - 70 year',\n",
" '18 - 80',\n",
" '18 30',\n",
" '18 30 kg/m2',\n",
" '18 30 kg/m2 total',\n",
" '18 32',\n",
" '18 40',\n",
" '18 45',\n",
" '18 45 inclusive',\n",
" '18 45 inclusive years',\n",
" '18 45 years',\n",
" '18 45 years ag',\n",
" '18 49',\n",
" '18 49 years',\n",
" '18 49 years inclusive',\n",
" '18 50',\n",
" '18 50 years',\n",
" '18 50 years age',\n",
" '18 55',\n",
" '18 55 years',\n",
" '18 55 years ag',\n",
" '18 55 years age',\n",
" '18 55 years inclus',\n",
" '18 55 years inclusive',\n",
" '18 55 years old',\n",
" '18 65',\n",
" '18 65 year',\n",
" '18 65 years',\n",
" '18 65 years age',\n",
" '18 70',\n",
" '18 70 year',\n",
" '18 70 years',\n",
" '18 75',\n",
" '18 75 years',\n",
" '18 75 years ag',\n",
" '18 80',\n",
" '18 80 year',\n",
" '18 80 years',\n",
" '18 80 years ag',\n",
" '18 85',\n",
" '18 <',\n",
" '18 < 60',\n",
" '18 great',\n",
" '18 old',\n",
" '18 older',\n",
" '18 sex',\n",
" '18 sex male',\n",
" '18 sex male femal',\n",
" '18 upper',\n",
" '18 upper age',\n",
" '18 upper age limit',\n",
" '18 year',\n",
" '18 years',\n",
" '18 years 55',\n",
" '18 years 55 year',\n",
" '18 years 65',\n",
" '18 years 65 year',\n",
" '18 years ag',\n",
" '18 years age',\n",
" '18 years age old',\n",
" '18 years age older',\n",
" '18 years age ov',\n",
" '18 years informed',\n",
" '18 years informed consent',\n",
" '18 years legal',\n",
" '18 years legal guardian',\n",
" '18 years old',\n",
" '18 years old old',\n",
" '18 years older',\n",
" '18 years time',\n",
" '18 years time cons',\n",
" '18 yrs',\n",
" '18 yrs old',\n",
" '18-45 years',\n",
" '18-45 years old',\n",
" '18-60 year',\n",
" '18-60 years',\n",
" '18-65 year',\n",
" '18-65 years',\n",
" '18-65 years old',\n",
" '18-75 year',\n",
" '18-75 years',\n",
" '18-75 years old',\n",
" '180 days',\n",
" '189 mg/dl',\n",
" '19 30',\n",
" '19 30 kg/m2',\n",
" '19 30 kg/m2 inclusive',\n",
" '19 years',\n",
" '19.0 <',\n",
" '19.0 < 30.0',\n",
" '19.0 < 30.0 kg/m2',\n",
" '1983 height',\n",
" '1983 height weight',\n",
" '1983 height weight body',\n",
" '1987 revised',\n",
" '1987 revised criteria',\n",
" '1st screen',\n",
" '2 3',\n",
" '2 5',\n",
" '2 acs',\n",
" '2 acs include',\n",
" '2 acs include presented',\n",
" '2 additional',\n",
" '2 adequate',\n",
" '2 appendix',\n",
" '2 appendix 1',\n",
" '2 cm',\n",
" '2 consecutive',\n",
" '2 criteria',\n",
" '2 days',\n",
" '2 determined',\n",
" '2 determined ctcae',\n",
" '2 determined ctcae v3',\n",
" '2 diabet',\n",
" '2 diabetes',\n",
" '2 diabetes hypertension',\n",
" '2 diabetes hypertension hypercholesterolemia',\n",
" '2 diabetes mellitu',\n",
" '2 diabetes mellitus',\n",
" '2 dose',\n",
" '2 dose expansion',\n",
" '2 elective',\n",
" '2 elective pci',\n",
" '2 elective pci subjects',\n",
" '2 following',\n",
" '2 following conditions',\n",
" '2 following conditions type',\n",
" '2 inclus',\n",
" '2 mg/dl',\n",
" '2 mg/dl asat',\n",
" '2 mg/dl asat and/or',\n",
" '2 mg/dl ast',\n",
" '2 mg/dl ast alt',\n",
" '2 month',\n",
" '2 months',\n",
" '2 months age',\n",
" '2 months completion',\n",
" '2 months completion vaccination',\n",
" '2 months prior',\n",
" '2 ng/ml',\n",
" '2 prior',\n",
" '2 seconds',\n",
" '2 seconds control',\n",
" '2 times',\n",
" '2 times upper',\n",
" '2 times upper limit',\n",
" '2 uln',\n",
" '2 week',\n",
" '2 weeks',\n",
" '2 weeks apart',\n",
" '2 weeks apart prior',\n",
" '2 weeks prior',\n",
" '2 weeks prior enrol',\n",
" '2 x',\n",
" '2 x uln',\n",
" '2 x upper',\n",
" '2 x upper limit',\n",
" '2 year',\n",
" '2 years',\n",
" '2-channel infusion',\n",
" '2-channel infusion catheter',\n",
" '2.0 cm',\n",
" '2.0 mg/dl',\n",
" '2.0 times',\n",
" '2.0 x',\n",
" '2.0 x uln',\n",
" '2.27 glucose',\n",
" '2.5 kg',\n",
" '2.5 times',\n",
" '2.5 times uln',\n",
" '2.5 times upper',\n",
" '2.5 times upper limit',\n",
" '2.5 x',\n",
" '2.5 x institutional',\n",
" '2.5 x institutional upper',\n",
" '2.5 x uln',\n",
" '2.5 x uln liver',\n",
" '2.5 x upper',\n",
" '2.5 x upper limit',\n",
" '20 -',\n",
" '20 40',\n",
" '20 40 years',\n",
" '20 40 years old',\n",
" '20 85',\n",
" '20 85 years',\n",
" '20 85 years ag',\n",
" '20 ideal',\n",
" '20 mg',\n",
" '20 mm',\n",
" '20 post',\n",
" '20 post saline',\n",
" '20 post saline value',\n",
" '20 years',\n",
" '200 cells/ul',\n",
" '2006 guidelin',\n",
" '21 55',\n",
" '21 55 years',\n",
" '21 55 years inclus',\n",
" '21 70',\n",
" '21 70 years',\n",
" '21 70 years old',\n",
" '21 days',\n",
" '21 days initiation',\n",
" '21 days initiation study',\n",
" '21 days prior',\n",
" '21 years',\n",
" '21 years age',\n",
" '21 years age old',\n",
" '21 years old',\n",
" '21 years old eligible',\n",
" '24 36',\n",
" '24 consecutive',\n",
" '24 hour',\n",
" '24 hours',\n",
" '24 hours prior',\n",
" '24 month',\n",
" '24 months',\n",
" '24-hour urine',\n",
" '25 30',\n",
" '25 mg',\n",
" '27 kg/m2',\n",
" '28 days',\n",
" '28 days prior',\n",
" '28 mm',\n",
" '28 week',\n",
" '28 weeks',\n",
" '2nd screen',\n",
" '3 12',\n",
" '3 12 months',\n",
" '3 4',\n",
" '3 7',\n",
" '3 7 days',\n",
" '3 7 days prior',\n",
" '3 cm',\n",
" '3 days',\n",
" '3 days prior',\n",
" '3 hours',\n",
" '3 hours prior',\n",
" '3 hours prior 8',\n",
" '3 kg',\n",
" '3 kg weight',\n",
" '3 mm',\n",
" '3 month',\n",
" '3 months',\n",
" '3 months consent',\n",
" '3 months consent date',\n",
" '3 months dos',\n",
" '3 months dur',\n",
" '3 months duration',\n",
" '3 months prior',\n",
" '3 months prior screening',\n",
" '3 months prior study',\n",
" '3 prior',\n",
" '3 times',\n",
" '3 weeks',\n",
" '3 weeks prior',\n",
" '3 year',\n",
" '3.0 g/dl',\n",
" '3.0 x',\n",
" '30 day',\n",
" '30 days',\n",
" '30 days prior',\n",
" '30 days prior study',\n",
" '30 days prior vaccination',\n",
" '30 kg/m2',\n",
" '30 kg/m2 calculated',\n",
" '30 kg/m2 calculated according',\n",
" '30 kg/m2 inclus',\n",
" '30 kg/m2 inclusive',\n",
" '30 kg/m2 inclusive body',\n",
" '30 kg/m2 total',\n",
" '30 kg/m2 total body',\n",
" '30 mg/dl',\n",
" '30 mg/dl test',\n",
" '30 mg/dl test scheduled',\n",
" '30 minutes',\n",
" '30 minutes final',\n",
" '30 minutes final concentration',\n",
" '30 mmhg',\n",
" '30 mononuclear',\n",
" '30 mononuclear cells',\n",
" '30 mononuclear cells having',\n",
" '30 years',\n",
" '30.0 kg/m2',\n",
" '30.5 kg/m2',\n",
" '34 years',\n",
" '34 years age',\n",
" '34 years age time',\n",
" '35 days',\n",
" '35 kg',\n",
" '35 kg/m',\n",
" '35 kg/m 2',\n",
" '35 kg/m2',\n",
" '35 years',\n",
" '36 weeks',\n",
" '37 weeks',\n",
" '4 10',\n",
" '4 10 baseline',\n",
" '4 10 baseline week',\n",
" '4 10 hours',\n",
" '4 10 hours final',\n",
" '4 6',\n",
" '4 6 antigen',\n",
" '4 6 antigen match',\n",
" '4 attempts',\n",
" '4 attempts intercourse',\n",
" '4 days',\n",
" '4 hours',\n",
" '4 hours week',\n",
" '4 months',\n",
" '4 prior',\n",
" '4 screen',\n",
" '4 screen random',\n",
" '4 sexual',\n",
" '4 week',\n",
" '4 weeks',\n",
" '4 weeks discontinuation',\n",
" '4 weeks prior',\n",
" '4 weeks study',\n",
" '4.5 34',\n",
" '4.5 34 years',\n",
" '4.5 34 years age',\n",
" '40 80',\n",
" '40 80 years',\n",
" '40 kg',\n",
" '40 mg/dl',\n",
" '40 mg/dl men',\n",
" '40 screen',\n",
" '40 screen randomization',\n",
" '40 screen randomization clinical',\n",
" '40 years',\n",
" '40 years old',\n",
" '40 yr',\n",
" '40.0 kg/m2',\n",
" '45 -',\n",
" '45 days',\n",
" '45 days prior',\n",
" '45 inclusive',\n",
" '45 inclusive years',\n",
" '45 inclusive years age',\n",
" '45 kg',\n",
" '45 kg/m2',\n",
" '45 minut',\n",
" '45 year',\n",
" '45 years',\n",
" '45 years ag',\n",
" '45 years age',\n",
" '450 msec',\n",
" '48 hour',\n",
" '48 hours',\n",
" '48 hours age',\n",
" '48 hours age inclus',\n",
" '48 hours prior',\n",
" '48 hours prior study',\n",
" '48-hours prior',\n",
" '48-hours prior blood',\n",
" '48-hours prior blood draw',\n",
" '49 years',\n",
" '49 years inclusive',\n",
" '5 30',\n",
" '5 30 minutes',\n",
" '5 30 minutes final',\n",
" '5 6',\n",
" '5 6 hla',\n",
" '5 6 hla antigens',\n",
" '5 <',\n",
" '5 cm',\n",
" '5 cups',\n",
" '5 cups caffeine-containing',\n",
" '5 cups caffeine-containing beverages',\n",
" '5 days',\n",
" '5 mg',\n",
" '5 mg corticosteroid',\n",
" '5 ng/ml',\n",
" '5 times',\n",
" '5 times institutional',\n",
" '5 times institutional upper',\n",
" '5 times upper',\n",
" '5 times upper limit',\n",
" '5 unl',\n",
" '5 unl serum',\n",
" '5 unl serum creatinine',\n",
" '5 x',\n",
" '5 x uln',\n",
" '5 year',\n",
" '5 years',\n",
" '5 years negative',\n",
" '5 years negative c',\n",
" '5 years prior',\n",
" '5 years sci',\n",
" '5.0 times',\n",
" '5.0 times upper',\n",
" '5.0 times upper limits',\n",
" '50 copies/ml',\n",
" '50 kg',\n",
" '50 kg 110',\n",
" '50 kg 110 lb',\n",
" '50 mg/dl',\n",
" '50 ml/min',\n",
" '50 ng/dl',\n",
" '50 predict',\n",
" '50 predicted',\n",
" '50 year',\n",
" '50 years',\n",
" '50 years ag',\n",
" '50 years age',\n",
" '50 years age history',\n",
" '50 years age inclus',\n",
" '52 studi',\n",
" '55 year',\n",
" '55 years',\n",
" '55 years ag',\n",
" '55 years age',\n",
" '55 years age inclus',\n",
" '55 years inclus',\n",
" '55 years inclusive',\n",
" '55 years inclusive healthy',\n",
" '55 years old',\n",
" '59 months',\n",
" '6 12',\n",
" '6 8',\n",
" '6 8 weeks',\n",
" '6 8 weeks referral',\n",
" '6 antigen',\n",
" '6 antigen match',\n",
" '6 antigen match hla',\n",
" '6 capsules',\n",
" '6 capsules day',\n",
" '6 hla',\n",
" '6 hla antigens',\n",
" '6 hla antigens b',\n",
" '6 hour',\n",
" '6 month',\n",
" '6 months',\n",
" '6 months ago',\n",
" '6 months completion',\n",
" '6 months confirmed',\n",
" '6 months confirmed baseline',\n",
" '6 months minimum',\n",
" '6 months minimum vasectomi',\n",
" '6 months prior',\n",
" '6 months prior initiation',\n",
" '6 months prior screen',\n",
" '6 months prior screening',\n",
" '6 months prior study',\n",
" '6 months study',\n",
" '6 months study treat',\n",
" '6 week',\n",
" '6 weeks',\n",
" '6 weeks prior',\n",
" '6 weeks prior study',\n",
" '6 weeks treatment',\n",
" '6 weeks treatment discontinu',\n",
" '60 days',\n",
" '60 great',\n",
" '60 ml/min/1.73',\n",
" '60 ml/min/1.73 m2',\n",
" '60 year',\n",
" '60 years',\n",
" '60 years ag',\n",
" '60 years old',\n",
" '65 year',\n",
" '65 years',\n",
" '65 years ag',\n",
" '65 years age',\n",
" '65 years age inclus',\n",
" '65 years old',\n",
" '7 8',\n",
" '7 day',\n",
" '7 days',\n",
" '7 days prior',\n",
" '7 days prior starting',\n",
" '7 g/dl',\n",
" '7.5 -',\n",
" '7.5 x',\n",
" '7.5 x 109/l',\n",
" '7.5 x 109/l bilirubin',\n",
" '70 ml/min',\n",
" '70 year',\n",
" '70 years',\n",
" '70 years age',\n",
" '70 years old',\n",
" '70 years old participants',\n",
" '72 hour',\n",
" '72 hours',\n",
" '72 hours prior',\n",
" '75 year',\n",
" '75 years',\n",
" '75 years ag',\n",
" '75 years old',\n",
" '8 <',\n",
" '8 g/dl',\n",
" '8 hours',\n",
" '8 hours post',\n",
" '8 hours post study',\n",
" '8 week',\n",
" '8 weeks',\n",
" '8 weeks prior',\n",
" '8 weeks prior study',\n",
" '8 weeks referral',\n",
" '8 weeks referral transplant',\n",
" '80 predict',\n",
" '80 predicted',\n",
" '80 year',\n",
" '80 years',\n",
" '80 years ag',\n",
" '80 years age',\n",
" '80 years old',\n",
" '85 years',\n",
" '85 years ag',\n",
" '9 cells/l',\n",
" '9 g/dl',\n",
" '9 month',\n",
" '9.0 g/dl',\n",
" '9/l platelet',\n",
" '9/l platelet count',\n",
" '90 days',\n",
" '90 days following',\n",
" '90 days following dose',\n",
" '90 days prior',\n",
" '< 0.7',\n",
" '< 1',\n",
" '< 1 year',\n",
" '< 1.0',\n",
" '< 1.0 mg/dl',\n",
" '< 1.5',\n",
" '< 1.5 mg/dl',\n",
" '< 1.5 times',\n",
" '< 1.5 times upper',\n",
" '< 1.5 x',\n",
" '< 1.5 x uln',\n",
" '< 1.5 x upper',\n",
" '< 10',\n",
" '< 100',\n",
" '< 12',\n",
" '< 12.0',\n",
" '< 14',\n",
" '< 15',\n",
" '< 150/90',\n",
" '< 160/100',\n",
" '< 18',\n",
" '< 18 year',\n",
" '< 18 years',\n",
" '< 18 years age',\n",
" '< 1mg/dl',\n",
" '< 2',\n",
" '< 2 mg/dl',\n",
" '< 2 mg/dl asat',\n",
" '< 2 times',\n",
" '< 2 times upper',\n",
" '< 2 x',\n",
" '< 2 x uln',\n",
" '< 2.0',\n",
" '< 2.0 mg/dl',\n",
" '< 2.0 x',\n",
" '< 2.0 x uln',\n",
" '< 2.5',\n",
" '< 2.5 x',\n",
" '< 2.5 x institutional',\n",
" '< 2.5 x uln',\n",
" '< 20',\n",
" '< 200',\n",
" '< 28',\n",
" '< 3',\n",
" '< 3 cm',\n",
" '< 30',\n",
" '< 30 kg/m2',\n",
" '< 30.0',\n",
" '< 30.0 kg/m2',\n",
" '< 35',\n",
" '< 4',\n",
" '< 40',\n",
" '< 40 mg/dl',\n",
" '< 40 mg/dl men',\n",
" '< 400',\n",
" '< 45',\n",
" '< 45 kg/m2',\n",
" '< 450',\n",
" '< 450 msec',\n",
" '< 5',\n",
" '< 5 unl',\n",
" '< 5 unl serum',\n",
" '< 50',\n",
" '< 50 mg/dl',\n",
" '< 50 ng/dl',\n",
" '< 6',\n",
" '< 6 months',\n",
" '< 6 months ago',\n",
" '< 60',\n",
" '< 60 years',\n",
" '< 65',\n",
" '< 7',\n",
" '< 7 days',\n",
" '< 70',\n",
" '< 75',\n",
" '< 80',\n",
" '< 80 predicted',\n",
" '< equal',\n",
" '< equal 1.5',\n",
" '< equal 1.5 x',\n",
" '< equal 15',\n",
" '< equal 2.5',\n",
" '< equal 2.5 x',\n",
" '< equal 30',\n",
" '< equal 5',\n",
" '< equal 5.0',\n",
" '< equal 5.0 times',\n",
" '< equal 9.0',\n",
" '< equal 9.0 g/dl',\n",
" '< equal 90',\n",
" '<25 kg/m2',\n",
" '<grade 2',\n",
" '<grade 2 determined',\n",
" '<grade 2 determined ctcae',\n",
" '> 0.8',\n",
" '> 1',\n",
" '> 1 cm',\n",
" '> 1 cm spiral',\n",
" '> 1 year',\n",
" '> 1.0',\n",
" '> 1.5',\n",
" '> 1.5 x',\n",
" '> 1.5 x 10',\n",
" '> 1.5 x 109/l',\n",
" '> 1.7',\n",
" '> 10',\n",
" '> 10 g/dl',\n",
" '> 10 mm',\n",
" '> 10 mm spiral',\n",
" '> 100',\n",
" '> 100 g/l',\n",
" '> 100 mg/dl',\n",
" '> 100 x',\n",
" '> 100 x 10',\n",
" '> 100 x 109/l',\n",
" '> 1000',\n",
" '> 1000/mm',\n",
" '> 100000',\n",
" '> 100000/',\n",
" '> 100000/ l',\n",
" '> 100000/mm',\n",
" '> 100000/mm 3',\n",
" '> 12',\n",
" '> 12 week',\n",
" '> 13',\n",
" '> 14',\n",
" '> 140',\n",
" '> 15',\n",
" '> 150',\n",
" '> 150 mg/dl',\n",
" '> 1500',\n",
" '> 1500/',\n",
" '> 1500/ l',\n",
" '> 1500/mm',\n",
" '> 1500/mm 3',\n",
" '> 16',\n",
" '> 16 weeks',\n",
" '> 17',\n",
" '> 18',\n",
" '> 18 <',\n",
" '> 18 < 60',\n",
" '> 18 year',\n",
" '> 18 years',\n",
" '> 18 years ag',\n",
" '> 18 years age',\n",
" '> 18 years legal',\n",
" '> 18 years old',\n",
" '> 18 years time',\n",
" '> 2',\n",
" '> 2 cm',\n",
" '> 2 weeks',\n",
" '> 2 weeks apart',\n",
" '> 2.5',\n",
" '> 2.5 kg',\n",
" '> 20',\n",
" '> 20 mm',\n",
" '> 200',\n",
" '> 21',\n",
" '> 25',\n",
" '> 27',\n",
" '> 27 kg/m2',\n",
" '> 3',\n",
" '> 3 mm',\n",
" '> 3 month',\n",
" '> 3 months',\n",
" '> 3 months prior',\n",
" '> 3.0',\n",
" '> 3.0 g/dl',\n",
" '> 30',\n",
" '> 30 kg/m2',\n",
" '> 30 mmhg',\n",
" '> 30 mononuclear',\n",
" '> 30 mononuclear cells',\n",
" '> 35',\n",
" '> 36',\n",
" '> 38',\n",
" '> 4',\n",
" '> 4 10',\n",
" '> 4 10 baseline',\n",
" '> 4 6',\n",
" '> 4 6 antigen',\n",
" '> 4 weeks',\n",
" '> 40',\n",
" '> 40 years',\n",
" '> 45',\n",
" '> 48',\n",
" '> 5',\n",
" '> 5 ng/ml',\n",
" '> 5 year',\n",
" '> 5 years',\n",
" '> 5 years negative',\n",
" '> 50',\n",
" '> 50 ml/min',\n",
" '> 50 predict',\n",
" '> 50 predicted',\n",
" '> 50000',\n",
" '> 6',\n",
" '> 6 month',\n",
" '> 6 months',\n",
" '> 60',\n",
" '> 60 ml/min/1.73',\n",
" '> 60 year',\n",
" '> 65',\n",
" '> 65 year',\n",
" '> 7',\n",
" '> 7.5',\n",
" '> 7.5 x',\n",
" '> 7.5 x 109/l',\n",
" '> 70',\n",
" '> 70 ml/min',\n",
" '> 8',\n",
" '> 8 <',\n",
" '> 8 g/dl',\n",
" '> 80',\n",
" '> 9',\n",
" '> 9 g/dl',\n",
" '> 9.0',\n",
" '> 9.0 g/dl',\n",
" '> 90',\n",
" '> 90 days',\n",
" '> 95',\n",
" '> equal',\n",
" '> equal 12',\n",
" '> equal 13',\n",
" '> equal 18',\n",
" '> equal 18 year',\n",
" '> equal 19.0',\n",
" '> equal 19.0 <',\n",
" '> equal 2',\n",
" '> equal 4',\n",
" '> equal 4 screen',\n",
" '> equal 40',\n",
" '> equal 40 screen',\n",
" '> equal 50',\n",
" '> equal 50 ml/min',\n",
" '> equal 6',\n",
" '> equal 60',\n",
" '> lower',\n",
" '> lower limit',\n",
" '> uln',\n",
" '> uln >',\n",
" '> uln > 2',\n",
" '>1 year',\n",
" '>100000 cells/',\n",
" '>100000 cells/ l',\n",
" '>100000 cells/ l total',\n",
" '>12 48',\n",
" '>12 48 hours',\n",
" '>12 48 hours age',\n",
" '>12 week',\n",
" '>1500 cells/',\n",
" '>1500 cells/ l',\n",
" '>1500 cells/ l platelets',\n",
" '>18 year',\n",
" '>18 years',\n",
" '>18 years age',\n",
" '>18 years age histologically',\n",
" '>18 years old',\n",
" '>2cm physical',\n",
" '>2cm physical examin',\n",
" '>30 minutes',\n",
" '>30 years',\n",
" '>4 weeks',\n",
" '>4 weeks prior',\n",
" '>4 weeks prior enrol',\n",
" '>45 minutes',\n",
" '>50 kg',\n",
" '>50 kg 110',\n",
" '>50 kg 110 lb',\n",
" '>50 ml/min',\n",
" '>50 ml/min negative',\n",
" '>50 ml/min negative hcg',\n",
" '>6 months',\n",
" '>6 months prior',\n",
" '>60 ml/min',\n",
" ...]"
]
},
"execution_count": 321,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.get_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit vectorizer on 20000 docs "
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=0.6, max_features=None, min_df=1,\n",
" ngram_range=(2, 4), preprocessor=None, stop_words='english',\n",
" strip_accents=None, token_pattern='[a-zA-Z0-9_\\\\-ï/><\\\\.]+',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 245,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trials_loc, eligibility_loc = connect_to_mongo('clinical_trials', 'eligibilities')\n",
"doc_cursor = eligibility_loc.find().limit(20000)\n",
"\n",
"count_vectorizer.fit(list(itertools.chain(*(doc['cleaned_inclusion'] for doc in doc_cursor))))"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1050130"
]
},
"execution_count": 249,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(count_vectorizer.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [],
"source": [
"doc_cursor.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transform 1000 docs"
]
},
{
"cell_type": "code",
"execution_count": 317,
"metadata": {},
"outputs": [],
"source": [
"trials_loc, eligibility_loc = connect_to_mongo('clinical_trials', 'eligibilities')\n",
"doc_cursor = eligibility_loc.find().limit(1000)\n",
"\n",
"X_trans = count_vectorizer.transform(' '.join(doc['cleaned_inclusion']) for doc in doc_cursor)"
]
},
{
"cell_type": "code",
"execution_count": 318,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1000x8907 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 29687 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 318,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_trans"
]
},
{
"cell_type": "code",
"execution_count": 256,
"metadata": {},
"outputs": [],
"source": [
"X_array = X.toarray()"
]
},
{
"cell_type": "code",
"execution_count": 260,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" ..., \n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0]])"
]
},
"execution_count": 260,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fit model "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit model to 1k transformed dataset "
]
},
{
"cell_type": "code",
"execution_count": 319,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/courtney/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21\n",
" DeprecationWarning)\n"
]
},
{
"data": {
"text/plain": [
"array([ 0.98714798, 0.00256454, 0.00257188, 0.00257726, 0.00256543,\n",
" 0.00257291])"
]
},
"execution_count": 319,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n_topics = 6\n",
"n_iter = 10\n",
"lda = LatentDirichletAllocation(n_topics=n_topics,\n",
" max_iter=n_iter,\n",
" random_state=42,\n",
" learning_method='online')\n",
"data = lda.fit_transform(X_trans)\n",
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": 320,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic 0\n",
"performance status 18 years years old pregnancy test informed consent written informed informed cons 18 years old ecog performance > 18 ecog performance status male female childbearing potential count > years age age > x uln written informed consent upper limit measurable disease\n",
"Topic 1\n",
"18 years informed consent years ag < equal informed cons years age > 18 score > total score male female > equal 18 years ag > 18 years consent form informed consent form sign informed patients > legal guardian female patients sign informed cons\n",
"Topic 2\n",
"informed cons written informed informed consent years old written informed cons age 18 type 2 signed informed written informed consent 2 diabetes type 2 diabetes years ag 3 month signed informed cons consent form men women signed informed consent patient s 3 months 6 month\n",
"Topic 3\n",
"body mass index mass index body mass informed consent index bmi mass index bmi body mass index bmi medical history informed cons written informed bmi > birth control prior study history physical 30 kg/m2 body weight healthy male able comply medical history physical ages 18\n",
"Topic 4\n",
"body weight informed consent written informed years ag provide written informed provide written written informed consent forms contraception provide written informed consent intensive care subject 18 subject willing sexually active written informed consent prior subject male informed consent prior consent prior female subjects years age 18 years\n",
"Topic 5\n",
"years age age old years age old 6 months months prior times upper limit normal screening visit upper limit normal limit normal upper limit informed consent visit 1 18 years age old prior screening 12 months times upper times upper limit 18 years vitamin d 18 years age\n"
]
}
],
"source": [
"def display_topics(model, feature_names, no_top_words):\n",
" for ix, topic in enumerate(model.components_):\n",
" print(\"Topic \", ix)\n",
" print(\" \".join([feature_names[i]\n",
" for i in topic.argsort()[:-no_top_words - 1:-1]]))\n",
" \n",
"display_topics(lda,X.get_feature_names(),20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next steps "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Improving the pipeline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Refactor `connect_to_mongo()` function so we don't have to reconnect when switching databases (might be able to pull this into other functions later on)\n",
"* Scrutinize stopwords - currently using default, there may be some words worth including such as 'not'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Error checking and production aspects to add "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Error messages\n",
"* docstrings\n",
"* create functions\n",
"* comments to explain hard-coding choices or why an approach was used\n",
"* Add test cases for cleaning"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}