{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "nlp-pubmed-rct-classification.ipynb",
"provenance": [],
"collapsed_sections": [],
"mount_file_id": "1TGU5mtAosYsHxXnaw9_uAP91Pg9czPX1",
"authorship_tag": "ABX9TyOt8iFW5V7RdJakaMJ73Mwq",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"source": [
"# Medical Abstract Classification using Natural Language Processing"
],
"metadata": {
"id": "Uq_ssycxR8Ar"
}
},
{
"cell_type": "markdown",
"source": [
"### The objective is to build a deep learning model which makes medical research paper abstract easier to read.\n",
" - Dataset used in this project is the `PubMed 200k RCT Dataset for Sequential Sentence Classification in Medical Abstract`: https://arxiv.org/abs/1710.06071\n",
" - The initial deep learning research paper was built with the PubMed 200k RCT.\n",
" - Dataset has about `200,000 labelled Randomized Control Trial abstracts`.\n",
" - The goal of the project was build NLP models with the dataset to classify sentences in sequential order."
],
"metadata": {
"id": "bOGykK3Ad-wp"
}
},
{
"cell_type": "markdown",
"source": [
" - As the RCT research papers with unstructured abstracts slows down researchers navigating the literature. \n",
" - The unstructured abstracts are sometimes hard to read and understand especially when it can disrupt time management and deadlines.\n",
" - This NLP model can classify the abstract sentences into its respective roles:\n",
" - Such as `Objective`, `Methods`, `Results` and `Conclusions`.\n",
" "
],
"metadata": {
"id": "RgYVNMflgo86"
}
},
{
"cell_type": "markdown",
"source": [
"#### The PubMed 200k RCT Dataset - https://github.com/Franck-Dernoncourt/pubmed-rct"
],
"metadata": {
"id": "1Lmr_h4FipFw"
}
},
{
"cell_type": "markdown",
"source": [
"#### Similar projects using the dataset:\n",
" - Claim Extraction for Scientific Publications 2018: https://github.com/titipata/detecting-scientific-claim"
],
"metadata": {
"id": "N2d_bTBIi6of"
}
},
{
"cell_type": "markdown",
"source": [
"**Abstract** \n",
"\n",
"PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field."
],
"metadata": {
"id": "WXb2Qlu8k2Fk"
}
},
{
"cell_type": "markdown",
"source": [
"**Data Dictionary**\n",
"\n",
"- `PubMed 20k` is a subset of `PubMed 200k`. I.e., any abstract present in `PubMed 20k` is also present in `PubMed 200k`.\n",
"- `PubMed_200k_RCT` is the same as `PubMed_200k_RCT_numbers_replaced_with_at_sign`, except that in the latter all numbers had been replaced by `@`. (same for `PubMed_20k_RCT` vs. `PubMed_20k_RCT_numbers_replaced_with_at_sign``).\n",
"- Since Github file size limit is 100 MiB, we had to compress `PubMed_200k_RCT\\train.7z` and `PubMed_200k_RCT_numbers_replaced_with_at_sign\\train.zip`. \n",
"- To uncompress `train.7z`, you may use `7-Zip` on Windows, `Keka` on Mac OS X, or `p7zip` on Linux."
],
"metadata": {
"id": "MvAjGdF0lACB"
}
},
{
"cell_type": "markdown",
"source": [
"# Importing data and EDA"
],
"metadata": {
"id": "T1JweqbejVX4"
}
},
{
"cell_type": "code",
"source": [
"!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git\n",
"!ls pubmed-rct"
],
"metadata": {
"id": "zM12MSlAnKCo",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "bb4adf23-506b-4a92-8e86-010efcf4f61f"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Cloning into 'pubmed-rct'...\n",
"remote: Enumerating objects: 33, done.\u001b[K\n",
"remote: Counting objects: 100% (3/3), done.\u001b[K\n",
"remote: Compressing objects: 100% (3/3), done.\u001b[K\n",
"remote: Total 33 (delta 0), reused 0 (delta 0), pack-reused 30\u001b[K\n",
"Unpacking objects: 100% (33/33), done.\n",
"Checking out files: 100% (13/13), done.\n",
"PubMed_200k_RCT\n",
"PubMed_200k_RCT_numbers_replaced_with_at_sign\n",
"PubMed_20k_RCT\n",
"PubMed_20k_RCT_numbers_replaced_with_at_sign\n",
"README.md\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Initial Data exploration and modelling with PubMed_20k dataset"
],
"metadata": {
"id": "FFpPwS_SqS48"
}
},
{
"cell_type": "code",
"source": [
"!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "I3mLk-7tqk3K",
"outputId": "9ceb3826-9977-4664-bf2d-de3927555bdf"
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"dev.txt test.txt train.txt\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# imports\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"import random\n",
"import tensorflow as tf"
],
"metadata": {
"id": "TZiWPqoaqzBL"
},
"execution_count": 3,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# functions pre-written for workflow\n",
"!wget https://raw.githubusercontent.com/hecshzye/nlp-medical-abstract-pubmed-rct/main/helper_functions.py"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wRZJgoCeFBsX",
"outputId": "0619c26c-096f-410b-f9c6-aae7ec239b53"
},
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2022-01-17 02:17:42-- https://raw.githubusercontent.com/hecshzye/nlp-medical-abstract-pubmed-rct/main/helper_functions.py\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 6442 (6.3K) [text/plain]\n",
"Saving to: ‘helper_functions.py’\n",
"\n",
"helper_functions.py 100%[===================>] 6.29K --.-KB/s in 0s \n",
"\n",
"2022-01-17 02:17:42 (59.4 MB/s) - ‘helper_functions.py’ saved [6442/6442]\n",
"\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"from helper_functions import create_tensorboard_callback, calculate_results, plot_loss_curves"
],
"metadata": {
"id": "i6KrTZnaFKtu"
},
"execution_count": 5,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Function for reading the document\n",
"def get_doc(filename):\n",
" with open(filename, \"r\") as f:\n",
" return f.readlines()"
],
"metadata": {
"id": "JMifgqkxsC5w"
},
"execution_count": 6,
"outputs": []
},
{
"cell_type": "code",
"source": [
"data_dir = \"pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/\"\n",
"filenames = [data_dir + filename for filename in os.listdir(data_dir)]\n",
"filenames"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RGZi11CDrVxa",
"outputId": "6a971c96-a1c9-43af-9936-98f0dfc75436"
},
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',\n",
" 'pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',\n",
" 'pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt']"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "code",
"source": [
"# Preprocessing \n",
"train_lines = get_doc(data_dir+\"train.txt\")\n",
"train_lines[:30]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3ZBDfw7gsX-8",
"outputId": "ffc2d907-985f-4a5f-9659-f03d3126390d"
},
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['###24293578\\n',\n",
" 'OBJECTIVE\\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\\n',\n",
" 'METHODS\\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\\n',\n",
" 'METHODS\\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\\n',\n",
" 'METHODS\\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\\n',\n",
" 'METHODS\\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\\n',\n",
" 'METHODS\\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured .\\n',\n",
" 'RESULTS\\tThere was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks .\\n',\n",
" 'RESULTS\\tThe mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively .\\n',\n",
" 'RESULTS\\tFurther , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group .\\n',\n",
" 'RESULTS\\tThese differences remained significant at @ weeks .\\n',\n",
" 'RESULTS\\tThe Outcome Measures in Rheumatology Clinical Trials-Osteoarthritis Research Society International responder rate was @ % in the intervention group and @ % in the placebo group ( p < @ ) .\\n',\n",
" 'CONCLUSIONS\\tLow-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain , better physical function , and attenuation of systemic inflammation in older patients with knee OA ( ClinicalTrials.gov identifier NCT@ ) .\\n',\n",
" '\\n',\n",
" '###24854809\\n',\n",
" 'BACKGROUND\\tEmotional eating is associated with overeating and the development of obesity .\\n',\n",
" 'BACKGROUND\\tYet , empirical evidence for individual ( trait ) differences in emotional eating and cognitive mechanisms that contribute to eating during sad mood remain equivocal .\\n',\n",
" 'OBJECTIVE\\tThe aim of this study was to test if attention bias for food moderates the effect of self-reported emotional eating during sad mood ( vs neutral mood ) on actual food intake .\\n',\n",
" 'OBJECTIVE\\tIt was expected that emotional eating is predictive of elevated attention for food and higher food intake after an experimentally induced sad mood and that attentional maintenance on food predicts food intake during a sad versus a neutral mood .\\n',\n",
" 'METHODS\\tParticipants ( N = @ ) were randomly assigned to one of the two experimental mood induction conditions ( sad/neutral ) .\\n',\n",
" 'METHODS\\tAttentional biases for high caloric foods were measured by eye tracking during a visual probe task with pictorial food and neutral stimuli .\\n',\n",
" 'METHODS\\tSelf-reported emotional eating was assessed with the Dutch Eating Behavior Questionnaire ( DEBQ ) and ad libitum food intake was tested by a disguised food offer .\\n',\n",
" 'RESULTS\\tHierarchical multivariate regression modeling showed that self-reported emotional eating did not account for changes in attention allocation for food or food intake in either condition .\\n',\n",
" 'RESULTS\\tYet , attention maintenance on food cues was significantly related to increased intake specifically in the neutral condition , but not in the sad mood condition .\\n',\n",
" 'CONCLUSIONS\\tThe current findings show that self-reported emotional eating ( based on the DEBQ ) might not validly predict who overeats when sad , at least not in a laboratory setting with healthy women .\\n',\n",
" 'CONCLUSIONS\\tResults further suggest that attention maintenance on food relates to eating motivation when in a neutral affective state , and might therefore be a cognitive mechanism contributing to increased food intake in general , but maybe not during sad mood .\\n',\n",
" '\\n',\n",
" '###25165090\\n',\n",
" 'BACKGROUND\\tAlthough working smoke alarms halve deaths in residential fires , many households do not keep alarms operational .\\n',\n",
" 'BACKGROUND\\tWe tested whether theory-based education increases alarm operability .\\n']"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"source": [
"#### **Data dictionary**\n",
"\n",
"`\\t` = tab seperator\n",
"\n",
"`\\n` = new line\n",
"\n",
"`###` = abstract ID\n",
"\n",
"`\"line_number\"` = line position\n",
"\n",
"`\"text\"` = text line\n",
"\n",
"`\"total_lines\"` = total number of lines in one abstract\n",
"\n",
"`\"target\"` = objective of the abstract\n",
"\n"
],
"metadata": {
"id": "6ZOKbxQAsyNT"
}
},
{
"cell_type": "code",
"source": [
"# Function for preprocessing the data\n",
"\n",
"def preprocessing_text_with_line_number(filename):\n",
" input_lines = get_doc(filename)\n",
" abstract_lines = \"\"\n",
" abstract_samples = []\n",
"\n",
" for line in input_lines:\n",
" if line.startswith(\"###\"):\n",
" abstract_id = line\n",
" abstract_lines = \"\"\n",
" elif line.isspace():\n",
" abstract_line_split = abstract_lines.splitlines()\n",
" \n",
" for abstract_line_number, abstract_line in enumerate(abstract_line_split):\n",
" line_data = {}\n",
" target_text_split = abstract_line.split(\"\\t\")\n",
" line_data[\"target\"] = target_text_split[0]\n",
" line_data[\"text\"] = target_text_split[1].lower()\n",
" line_data[\"line_number\"] = abstract_line_number\n",
" line_data[\"total_lines\"] = len(abstract_line_split) - 1\n",
" abstract_samples.append(line_data)\n",
"\n",
" else:\n",
" abstract_lines += line \n",
" return abstract_samples "
],
"metadata": {
"id": "AEl-XZPQttN7"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extracting data using the function\n",
"train_samples = preprocessing_text_with_line_number(data_dir + \"train.txt\")\n",
"val_samples = preprocessing_text_with_line_number(data_dir + \"dev.txt\")\n",
"test_samples = preprocessing_text_with_line_number(data_dir + \"test.txt\")\n",
"len(train_samples), len(test_samples), len(val_samples)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SCD9DYIU17GH",
"outputId": "252fdd28-96fa-4cee-d94f-9065a1aef0b3"
},
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(180040, 30135, 30212)"
]
},
"metadata": {},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"source": [
"train_samples[:10]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Hacxnrvq2r7n",
"outputId": "f756d753-5bd9-401f-db37-f4359c104962"
},
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'line_number': 0,\n",
" 'target': 'OBJECTIVE',\n",
" 'text': 'to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .',\n",
" 'total_lines': 11},\n",
" {'line_number': 1,\n",
" 'target': 'METHODS',\n",
" 'text': 'a total of @ patients with primary knee oa were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .',\n",
" 'total_lines': 11},\n",
" {'line_number': 2,\n",
" 'target': 'METHODS',\n",
" 'text': 'outcome measures included pain reduction and improvement in function scores and systemic inflammation markers .',\n",
" 'total_lines': 11},\n",
" {'line_number': 3,\n",
" 'target': 'METHODS',\n",
" 'text': 'pain was assessed using the visual analog pain scale ( @-@ mm ) .',\n",
" 'total_lines': 11},\n",
" {'line_number': 4,\n",
" 'target': 'METHODS',\n",
" 'text': 'secondary outcome measures included the western ontario and mcmaster universities osteoarthritis index scores , patient global assessment ( pga ) of the severity of knee oa , and @-min walk distance ( @mwd ) .',\n",
" 'total_lines': 11},\n",
" {'line_number': 5,\n",
" 'target': 'METHODS',\n",
" 'text': 'serum levels of interleukin @ ( il-@ ) , il-@ , tumor necrosis factor ( tnf ) - , and high-sensitivity c-reactive protein ( hscrp ) were measured .',\n",
" 'total_lines': 11},\n",
" {'line_number': 6,\n",
" 'target': 'RESULTS',\n",
" 'text': 'there was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , pga , and @mwd at @ weeks .',\n",
" 'total_lines': 11},\n",
" {'line_number': 7,\n",
" 'target': 'RESULTS',\n",
" 'text': 'the mean difference between treatment arms ( @ % ci ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively .',\n",
" 'total_lines': 11},\n",
" {'line_number': 8,\n",
" 'target': 'RESULTS',\n",
" 'text': 'further , there was a clinically relevant reduction in the serum levels of il-@ , il-@ , tnf - , and hscrp at @ weeks in the intervention group when compared to the placebo group .',\n",
" 'total_lines': 11},\n",
" {'line_number': 9,\n",
" 'target': 'RESULTS',\n",
" 'text': 'these differences remained significant at @ weeks .',\n",
" 'total_lines': 11}]"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "code",
"source": [
"# Creating a DataFrame\n",
"train_df = pd.DataFrame(train_samples)\n",
"val_df = pd.DataFrame(val_samples)\n",
"test_df = pd.DataFrame(test_samples)\n",
"train_df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "gEQRm8jZ40lD",
"outputId": "dd84beae-2b55-49ad-a5a8-9692705471a6"
},
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"
"
]
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"source": [
"# Saving model_6\n",
"model_6.save(\"nlp_pubmed_model_6\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BXwJDHsbCvzR",
"outputId": "4bddef98-00bf-4b65-fe86-6f6a88e384da"
},
"execution_count": 87,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"WARNING:absl:Found untraced functions such as lstm_cell_19_layer_call_fn, lstm_cell_19_layer_call_and_return_conditional_losses, lstm_cell_20_layer_call_fn, lstm_cell_20_layer_call_and_return_conditional_losses, lstm_cell_19_layer_call_fn while saving (showing 5 of 10). These functions will not be directly callable after loading.\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"INFO:tensorflow:Assets written to: nlp_pubmed_model_6/assets\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"INFO:tensorflow:Assets written to: nlp_pubmed_model_6/assets\n",
"WARNING:absl: has the same name 'LSTMCell' as a built-in Keras object. Consider renaming to avoid naming conflicts when loading with `tf.keras.models.load_model`. If renaming is not possible, pass the object in the `custom_objects` parameter of the load function.\n",
"WARNING:absl: has the same name 'LSTMCell' as a built-in Keras object. Consider renaming to avoid naming conflicts when loading with `tf.keras.models.load_model`. If renaming is not possible, pass the object in the `custom_objects` parameter of the load function.\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!cp nlp_pubmed_model_6 -r /content/drive/MyDrive/NLP-projects/nlp_pubmed_model_6"
],
"metadata": {
"id": "vZDM_8TQC_rL"
},
"execution_count": 88,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Evaluating on test data"
],
"metadata": {
"id": "5NrfgQJeDwAl"
}
},
{
"cell_type": "code",
"source": [
"# Preprocessing test dataset, evaluation & predictions\n",
"test_pos_character_token_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot,\n",
" test_total_lines_one_hot,\n",
" test_sentences,\n",
" test_char))\n",
"test_pos_character_token_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)\n",
"test_pos_character_token_dataset = tf.data.Dataset.zip((test_pos_character_token_data, test_pos_character_token_labels))\n",
"test_pos_character_token_dataset = test_pos_character_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)\n",
"\n",
"# Predictions\n",
"test_pred_probs = model_6.predict(test_pos_character_token_dataset,\n",
" verbose=1)\n",
"test_preds = tf.argmax(test_pred_probs, axis=1)\n",
"\n",
"# Evaluation results\n",
"model_6_test_results = calculate_results(y_true=test_labels_encoded,\n",
" y_pred=test_preds)\n",
"model_6_test_results\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "N5c_1WXSFS7g",
"outputId": "2c2b334e-1471-4b90-8913-bae75d8c002f"
},
"execution_count": 89,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"942/942 [==============================] - 45s 48ms/step\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'accuracy': 82.87041645926664,\n",
" 'f1': 0.8274737505797032,\n",
" 'precision': 0.8272332364828473,\n",
" 'recall': 0.8287041645926664}"
]
},
"metadata": {},
"execution_count": 89
}
]
},
{
"cell_type": "code",
"source": [
"# Evaluating the most wrong predictions\n",
"test_pred_classes = [label_encoder.classes_[pred] for pred in test_preds]\n",
"\n",
"# Integrating prediction in test_df \n",
"test_df[\"prediction\"] = test_pred_classes\n",
"test_df[\"pred_prob\"] = tf.reduce_max(test_pred_probs, axis=1).numpy()\n",
"test_df[\"correct\"] = test_df[\"prediction\"] == test_df[\"target\"]\n",
"\n",
"# 200 most wrong predictions\n",
"most_wrong_200 = test_df[test_df[\"correct\"] == False].sort_values(\"pred_prob\", ascending=False)[:200]\n",
"most_wrong_200"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "g1Qo6kn8HJgg",
"outputId": "604f73d7-4c6d-426e-c0ab-68b00c2087b8"
},
"execution_count": 94,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
target
\n",
"
text
\n",
"
line_number
\n",
"
total_lines
\n",
"
prediction
\n",
"
pred_prob
\n",
"
correct
\n",
"
\n",
" \n",
" \n",
"
\n",
"
8545
\n",
"
METHODS
\n",
"
pretest-posttest .
\n",
"
1
\n",
"
11
\n",
"
BACKGROUND
\n",
"
0.947419
\n",
"
False
\n",
"
\n",
"
\n",
"
13874
\n",
"
CONCLUSIONS
\n",
"
symptom outcomes will be assessed and estimate...
\n",
"
4
\n",
"
6
\n",
"
METHODS
\n",
"
0.943379
\n",
"
False
\n",
"
\n",
"
\n",
"
16633
\n",
"
CONCLUSIONS
\n",
"
clinicaltrials.gov identifier : nct@ .
\n",
"
19
\n",
"
19
\n",
"
BACKGROUND
\n",
"
0.931800
\n",
"
False
\n",
"
\n",
"
\n",
"
16347
\n",
"
BACKGROUND
\n",
"
to evaluate the effects of the lactic acid bac...
\n",
"
0
\n",
"
12
\n",
"
OBJECTIVE
\n",
"
0.929503
\n",
"
False
\n",
"
\n",
"
\n",
"
2388
\n",
"
RESULTS
\n",
"
the primary endpoint is the cumulative three-y...
\n",
"
4
\n",
"
13
\n",
"
METHODS
\n",
"
0.926239
\n",
"
False
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
24794
\n",
"
RESULTS
\n",
"
we judged that informed consent would undermin...
\n",
"
11
\n",
"
13
\n",
"
CONCLUSIONS
\n",
"
0.791870
\n",
"
False
\n",
"
\n",
"
\n",
"
13921
\n",
"
RESULTS
\n",
"
primary outcome was in-hospital all-cause mort...
\n",
"
7
\n",
"
16
\n",
"
METHODS
\n",
"
0.791018
\n",
"
False
\n",
"
\n",
"
\n",
"
18003
\n",
"
RESULTS
\n",
"
this formulation produced highly significant a...
\n",
"
12
\n",
"
20
\n",
"
CONCLUSIONS
\n",
"
0.790558
\n",
"
False
\n",
"
\n",
"
\n",
"
29638
\n",
"
METHODS
\n",
"
significance for all tests was set at p < @ .
\n",
"
8
\n",
"
12
\n",
"
RESULTS
\n",
"
0.790246
\n",
"
False
\n",
"
\n",
"
\n",
"
19047
\n",
"
RESULTS
\n",
"
a randomized controlled trial and resting-stat...
\n",
"
2
\n",
"
10
\n",
"
METHODS
\n",
"
0.789534
\n",
"
False
\n",
"
\n",
" \n",
"
\n",
"
200 rows × 7 columns
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" target ... correct\n",
"8545 METHODS ... False\n",
"13874 CONCLUSIONS ... False\n",
"16633 CONCLUSIONS ... False\n",
"16347 BACKGROUND ... False\n",
"2388 RESULTS ... False\n",
"... ... ... ...\n",
"24794 RESULTS ... False\n",
"13921 RESULTS ... False\n",
"18003 RESULTS ... False\n",
"29638 METHODS ... False\n",
"19047 RESULTS ... False\n",
"\n",
"[200 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 94
}
]
},
{
"cell_type": "code",
"source": [
"# most commonly wrong predictions\n",
"for row in most_wrong_200[0:20].itertuples():\n",
" _, target, text, line_number, total_lines, prediction, pred_prob, _ = row\n",
" print(f\"Target: {target}, Prediction: {prediction}, Probability: {pred_prob}, Line Number: {line_number}, Total Lines: {total_lines}\\n\")\n",
" print(f\"Text:\\n{text}\\n\")\n",
" print(\"- - - - -\\n\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cA8GkWujL_e7",
"outputId": "0630e122-f255-40eb-ac1f-76328962b0dd"
},
"execution_count": 96,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Target: METHODS, Prediction: BACKGROUND, Probability: 0.9474185705184937, Line Number: 1, Total Lines: 11\n",
"\n",
"Text:\n",
"pretest-posttest .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: METHODS, Probability: 0.9433786869049072, Line Number: 4, Total Lines: 6\n",
"\n",
"Text:\n",
"symptom outcomes will be assessed and estimates of cost-effectiveness made .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: BACKGROUND, Probability: 0.9317997097969055, Line Number: 19, Total Lines: 19\n",
"\n",
"Text:\n",
"clinicaltrials.gov identifier : nct@ .\n",
"\n",
"- - - - -\n",
"\n",
"Target: BACKGROUND, Prediction: OBJECTIVE, Probability: 0.9295032620429993, Line Number: 0, Total Lines: 12\n",
"\n",
"Text:\n",
"to evaluate the effects of the lactic acid bacterium lactobacillus salivarius on caries risk factors .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: METHODS, Probability: 0.9262389540672302, Line Number: 4, Total Lines: 13\n",
"\n",
"Text:\n",
"the primary endpoint is the cumulative three-year hiv incidence .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: BACKGROUND, Probability: 0.9260937571525574, Line Number: 18, Total Lines: 18\n",
"\n",
"Text:\n",
"nct@ ( clinicaltrials.gov ) .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: BACKGROUND, Probability: 0.9218314290046692, Line Number: 8, Total Lines: 15\n",
"\n",
"Text:\n",
"non-diffuse-trickling '' ) .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: BACKGROUND, Probability: 0.9202677607536316, Line Number: 15, Total Lines: 15\n",
"\n",
"Text:\n",
"-lsb- netherlands trial register ( http://www.trialregister.nl/trialreg/index.asp ) , nr @ , date of registration @ december @ . -rsb-\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: METHODS, Probability: 0.9188950657844543, Line Number: 3, Total Lines: 16\n",
"\n",
"Text:\n",
"a cluster randomised trial was implemented with @,@ children in @ government primary schools on the south coast of kenya in @-@ .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: BACKGROUND, Probability: 0.9168384671211243, Line Number: 13, Total Lines: 13\n",
"\n",
"Text:\n",
"( clinicaltrials.gov : nct@ ) .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: METHODS, Probability: 0.9142534136772156, Line Number: 4, Total Lines: 14\n",
"\n",
"Text:\n",
"a screening questionnaire for moh was sent to all @-@ year old patients on these gps ` list .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: METHODS, Probability: 0.9101817011833191, Line Number: 6, Total Lines: 14\n",
"\n",
"Text:\n",
"the primary outcome was to evaluate changes in abdominal and shoulder-tip pain via a @-mm visual analog scale at @ , @ , and @hours postoperatively .\n",
"\n",
"- - - - -\n",
"\n",
"Target: METHODS, Prediction: RESULTS, Probability: 0.9094158411026001, Line Number: 6, Total Lines: 9\n",
"\n",
"Text:\n",
"-@ % vs. fish : -@ % vs. fish + s : -@ % ; p < @ ) but there were no significant differences between groups .\n",
"\n",
"- - - - -\n",
"\n",
"Target: METHODS, Prediction: BACKGROUND, Probability: 0.908202588558197, Line Number: 4, Total Lines: 9\n",
"\n",
"Text:\n",
"clinicaltrials.gov identifier : nct@ .\n",
"\n",
"- - - - -\n",
"\n",
"Target: BACKGROUND, Prediction: OBJECTIVE, Probability: 0.9057267904281616, Line Number: 0, Total Lines: 9\n",
"\n",
"Text:\n",
"to compare the efficacy of the newcastle infant dialysis and ultrafiltration system ( nidus ) with peritoneal dialysis ( pd ) and conventional haemodialysis ( hd ) in infants weighing < @ kg .\n",
"\n",
"- - - - -\n",
"\n",
"Target: BACKGROUND, Prediction: OBJECTIVE, Probability: 0.9013079404830933, Line Number: 0, Total Lines: 11\n",
"\n",
"Text:\n",
"to compare the safety and efficacy of dexmedetomidine/propofol ( dp ) - total i.v. anaesthesia ( tiva ) vs remifentanil/propofol ( rp ) - tiva , both with spontaneous breathing , during airway foreign body ( fb ) removal in children .\n",
"\n",
"- - - - -\n",
"\n",
"Target: METHODS, Prediction: OBJECTIVE, Probability: 0.8997651934623718, Line Number: 0, Total Lines: 7\n",
"\n",
"Text:\n",
"to determine whether the insulin resistance that exists in metabolic syndrome ( mets ) patients is modulated by dietary fat composition .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: CONCLUSIONS, Probability: 0.8978807330131531, Line Number: 13, Total Lines: 15\n",
"\n",
"Text:\n",
"additionally , intervention effects were observed for information gathering in women with high genetic literacy , but not in women with low genetic literacy .\n",
"\n",
"- - - - -\n",
"\n",
"Target: CONCLUSIONS, Prediction: BACKGROUND, Probability: 0.8975597023963928, Line Number: 10, Total Lines: 10\n",
"\n",
"Text:\n",
"clinicaltrials.gov : nct@ .\n",
"\n",
"- - - - -\n",
"\n",
"Target: RESULTS, Prediction: METHODS, Probability: 0.8973494172096252, Line Number: 3, Total Lines: 11\n",
"\n",
"Text:\n",
"family practices were randomly assigned to receive the educational toolkit in june @ ( intervention group ) or may @ ( control group ) .\n",
"\n",
"- - - - -\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Predicting on PubMed NCBI research paper"
],
"metadata": {
"id": "aNXIEQDdN6xp"
}
},
{
"cell_type": "markdown",
"source": [
"**Source** - `https://pubmed.ncbi.nlm.nih.gov/20232240/`\n",
"\n",
"Using the research paper from `PubMed NCBI` by `Christopher Lopata, Marcus L Thomeer, etc`.\n",
"\n",
"**Name** - `Randomized Controlled Trial: RCT of a manualized social treatment for high-functioning autism spectrum disorders`\n",
"\n",
"\n",
"\n",
"**Abstract**:\n",
"\"This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.\"\n",
"\n",
"**File** - Using the `Abstract` of the paper in `.json` format for readability.\n",
"\n",
"**Link**: `https://raw.githubusercontent.com/hecshzye/nlp-medical-abstract-pubmed-rct/main/pubmed_ncbi_autism_disorder.json`"
],
"metadata": {
"id": "RZfQ3OCURW2V"
}
},
{
"cell_type": "code",
"source": [
"import json\n",
"\n",
"# Loading the NCBI paper\n",
"!wget https://raw.githubusercontent.com/hecshzye/nlp-medical-abstract-pubmed-rct/main/pubmed_ncbi_autism_disorder.json\n",
"\n",
"with open(\"pubmed_ncbi_autism_disorder.json\", \"r\") as f:\n",
" ncbi_abstract = json.load(f)\n",
"\n",
"ncbi_abstract"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VqCJ0IoYTvv9",
"outputId": "405f4b6c-c8bd-4dfd-8c91-a39f86450f4d"
},
"execution_count": 106,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2022-01-17 04:46:50-- https://raw.githubusercontent.com/hecshzye/nlp-medical-abstract-pubmed-rct/main/pubmed_ncbi_autism_disorder.json\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 6737 (6.6K) [text/plain]\n",
"Saving to: ‘pubmed_ncbi_autism_disorder.json’\n",
"\n",
"\r pubmed_nc 0%[ ] 0 --.-KB/s \rpubmed_ncbi_autism_ 100%[===================>] 6.58K --.-KB/s in 0s \n",
"\n",
"2022-01-17 04:46:50 (76.7 MB/s) - ‘pubmed_ncbi_autism_disorder.json’ saved [6737/6737]\n",
"\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'abstract': 'This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.',\n",
" 'details': 'RCT of a manualized social treatment for high-functioning autism spectrum disorders',\n",
" 'source': 'https://pubmed.ncbi.nlm.nih.gov/20232240/'},\n",
" {'abstract': \"Postpartum depression (PPD) is the most prevalent mood disorder associated with childbirth. No single cause of PPD has been identified, however the increased risk of nutritional deficiencies incurred through the high nutritional requirements of pregnancy may play a role in the pathology of depressive symptoms. Three nutritional interventions have drawn particular interest as possible non-invasive and cost-effective prevention and/or treatment strategies for PPD; omega-3 (n-3) long chain polyunsaturated fatty acids (LCPUFA), vitamin D and overall diet. We searched for meta-analyses of randomised controlled trials (RCT's) of nutritional interventions during the perinatal period with PPD as an outcome, and checked for any trials published subsequently to the meta-analyses. Fish oil: Eleven RCT's of prenatal fish oil supplementation RCT's show null and positive effects on PPD symptoms. Vitamin D: no relevant RCT's were identified, however seven observational studies of maternal vitamin D levels with PPD outcomes showed inconsistent associations. Diet: Two Australian RCT's with dietary advice interventions in pregnancy had a positive and null result on PPD. With the exception of fish oil, few RCT's with nutritional interventions during pregnancy assess PPD. Further research is needed to determine whether nutritional intervention strategies during pregnancy can protect against symptoms of PPD. Given the prevalence of PPD and ease of administering PPD measures, we recommend future prenatal nutritional RCT's include PPD as an outcome.\",\n",
" 'details': 'Formatting removed (can be used to compare model to actual example)',\n",
" 'source': 'https://pubmed.ncbi.nlm.nih.gov/28012571/'},\n",
" {'abstract': 'Mental illness, including depression, anxiety and bipolar disorder, accounts for a significant proportion of global disability and poses a substantial social, economic and heath burden. Treatment is presently dominated by pharmacotherapy, such as antidepressants, and psychotherapy, such as cognitive behavioural therapy; however, such treatments avert less than half of the disease burden, suggesting that additional strategies are needed to prevent and treat mental disorders. There are now consistent mechanistic, observational and interventional data to suggest diet quality may be a modifiable risk factor for mental illness. This review provides an overview of the nutritional psychiatry field. It includes a discussion of the neurobiological mechanisms likely modulated by diet, the use of dietary and nutraceutical interventions in mental disorders, and recommendations for further research. Potential biological pathways related to mental disorders include inflammation, oxidative stress, the gut microbiome, epigenetic modifications and neuroplasticity. Consistent epidemiological evidence, particularly for depression, suggests an association between measures of diet quality and mental health, across multiple populations and age groups; these do not appear to be explained by other demographic, lifestyle factors or reverse causality. Our recently published intervention trial provides preliminary clinical evidence that dietary interventions in clinically diagnosed populations are feasible and can provide significant clinical benefit. Furthermore, nutraceuticals including n-3 fatty acids, folate, S-adenosylmethionine, N-acetyl cysteine and probiotics, among others, are promising avenues for future research. Continued research is now required to investigate the efficacy of intervention studies in large cohorts and within clinically relevant populations, particularly in patients with schizophrenia, bipolar and anxiety disorders.',\n",
" 'details': 'Effect of nutrition on mental health',\n",
" 'source': 'https://pubmed.ncbi.nlm.nih.gov/28942748/'},\n",
" {'abstract': \"Hepatitis C virus (HCV) and alcoholic liver disease (ALD), either alone or in combination, count for more than two thirds of all liver diseases in the Western world. There is no safe level of drinking in HCV-infected patients and the most effective goal for these patients is total abstinence. Baclofen, a GABA(B) receptor agonist, represents a promising pharmacotherapy for alcohol dependence (AD). Previously, we performed a randomized clinical trial (RCT), which demonstrated the safety and efficacy of baclofen in patients affected by AD and cirrhosis. The goal of this post-hoc analysis was to explore baclofen's effect in a subgroup of alcohol-dependent HCV-infected cirrhotic patients. Any patient with HCV infection was selected for this analysis. Among the 84 subjects randomized in the main trial, 24 alcohol-dependent cirrhotic patients had a HCV infection; 12 received baclofen 10mg t.i.d. and 12 received placebo for 12-weeks. With respect to the placebo group (3/12, 25.0%), a significantly higher number of patients who achieved and maintained total alcohol abstinence was found in the baclofen group (10/12, 83.3%; p=0.0123). Furthermore, in the baclofen group, compared to placebo, there was a significantly higher increase in albumin values from baseline (p=0.0132) and a trend toward a significant reduction in INR levels from baseline (p=0.0716). In conclusion, baclofen was safe and significantly more effective than placebo in promoting alcohol abstinence, and improving some Liver Function Tests (LFTs) (i.e. albumin, INR) in alcohol-dependent HCV-infected cirrhotic patients. Baclofen may represent a clinically relevant alcohol pharmacotherapy for these patients.\",\n",
" 'details': 'Baclofen promotes alcohol abstinence in alcohol dependent cirrhotic patients with hepatitis C virus (HCV) infection',\n",
" 'source': 'https://pubmed.ncbi.nlm.nih.gov/22244707/'}]"
]
},
"metadata": {},
"execution_count": 106
}
]
},
{
"cell_type": "code",
"source": [
"abstracts = pd.DataFrame(ncbi_abstract)\n",
"# Using spacy sentencizer\n",
"from spacy.lang.en import English\n",
"nlp = English()\n",
"sentencizer = nlp.create_pipe(\"sentencizer\")\n",
"nlp.add_pipe(sentencizer)\n",
"doc = nlp(ncbi_abstract[0][\"abstract\"])\n",
"abstract_lines = [str(sent) for sent in list(doc.sents)]\n",
"abstract_lines"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "fX_OB4o-UcU9",
"outputId": "fe0ee7bf-fb17-4359-88b8-b6137c7bc46a"
},
"execution_count": 107,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['This RCT examined the efficacy of a manualized social intervention for children with HFASDs.',\n",
" 'Participants were randomly assigned to treatment or wait-list conditions.',\n",
" 'Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.',\n",
" 'A response-cost program was applied to reduce problem behaviors and foster skills acquisition.',\n",
" 'Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).',\n",
" 'Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.',\n",
" 'High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.',\n",
" 'Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.']"
]
},
"metadata": {},
"execution_count": 107
}
]
},
{
"cell_type": "code",
"source": [
"# Preprocessing the ncbi_abstract\n",
"total_lines_in_ncbi_abstract = len(abstract_lines)\n",
"sample_lines = []\n",
"for i, line in enumerate(abstract_lines):\n",
" sample_dict = {}\n",
" sample_dict[\"text\"] = str(line)\n",
" sample_dict[\"line_number\"] = i\n",
" sample_dict[\"total_lines\"] = total_lines_in_ncbi_abstract - 1\n",
" sample_lines.append(sample_dict)\n",
"sample_lines "
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "byX76JhsU9gB",
"outputId": "5c25f915-d13c-41bd-d0d4-fd7f5f372494"
},
"execution_count": 108,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'line_number': 0,\n",
" 'text': 'This RCT examined the efficacy of a manualized social intervention for children with HFASDs.',\n",
" 'total_lines': 7},\n",
" {'line_number': 1,\n",
" 'text': 'Participants were randomly assigned to treatment or wait-list conditions.',\n",
" 'total_lines': 7},\n",
" {'line_number': 2,\n",
" 'text': 'Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.',\n",
" 'total_lines': 7},\n",
" {'line_number': 3,\n",
" 'text': 'A response-cost program was applied to reduce problem behaviors and foster skills acquisition.',\n",
" 'total_lines': 7},\n",
" {'line_number': 4,\n",
" 'text': 'Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).',\n",
" 'total_lines': 7},\n",
" {'line_number': 5,\n",
" 'text': 'Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.',\n",
" 'total_lines': 7},\n",
" {'line_number': 6,\n",
" 'text': 'High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.',\n",
" 'total_lines': 7},\n",
" {'line_number': 7,\n",
" 'text': 'Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.',\n",
" 'total_lines': 7}]"
]
},
"metadata": {},
"execution_count": 108
}
]
},
{
"cell_type": "code",
"source": [
"# Encoding\n",
"test_abstract_line_numbers = [line[\"line_number\"] for line in sample_lines]\n",
"test_abstract_line_numbers_one_hot = tf.one_hot(test_abstract_line_numbers, depth=15)\n",
"test_abstract_total_lines = [line[\"total_lines\"] for line in sample_lines]\n",
"test_abstract_total_lines_one_hot = tf.one_hot(test_abstract_total_lines, depth=20)\n",
"# Spliting into characters\n",
"abstract_characters = [split_character(sentence) for sentence in abstract_lines]\n",
"\n",
"# Predictions \n",
"test_abstract_pred_probs = model_6.predict(x=(test_abstract_line_numbers_one_hot,\n",
" test_abstract_total_lines_one_hot,\n",
" tf.constant(abstract_lines),\n",
" tf.constant(abstract_characters)))\n",
"test_abstract_preds = tf.argmax(test_abstract_pred_probs, axis=1)\n",
"test_abstract_pred_classes = [label_encoder.classes_[i] for i in test_abstract_preds]\n",
"\n",
"for i, line in enumerate(abstract_lines):\n",
" print(f\"{test_abstract_pred_classes[i]}: {line}\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8IwaeIwWYDgb",
"outputId": "db7abe9f-c18b-4d27-c12a-3e377957d176"
},
"execution_count": 115,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"OBJECTIVE: This RCT examined the efficacy of a manualized social intervention for children with HFASDs.\n",
"METHODS: Participants were randomly assigned to treatment or wait-list conditions.\n",
"METHODS: Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.\n",
"METHODS: A response-cost program was applied to reduce problem behaviors and foster skills acquisition.\n",
"RESULTS: Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).\n",
"METHODS: Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.\n",
"RESULTS: High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.\n",
"RESULTS: Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Model Predictions"
],
"metadata": {
"id": "-19LUH0waM9C"
}
},
{
"cell_type": "markdown",
"source": [
"**Original Abstract**:\n",
"\n",
"\"This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.\""
],
"metadata": {
"id": "wet8KQcAbqqY"
}
},
{
"cell_type": "markdown",
"source": [
"**Model's `Predicted` Abstract which makes Abstract easier to read**\n",
"\n",
"`Abstract` after `Natural Language Processing` (`model_6`):\n",
"\n",
"\n",
"\n",
"\n",
"**OBJECTIVE**: This RCT examined the efficacy of a manualized social intervention for children with HFASDs.\n",
"\n",
"**METHODS**: Participants were randomly assigned to treatment or wait-list conditions.\n",
"\n",
"**METHODS**: Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.\n",
"\n",
"**METHODS**: A response-cost program was applied to reduce problem behaviors and foster skills acquisition.\n",
"\n",
"**RESULTS**: Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).\n",
"\n",
"**METHODS**: Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.\n",
"\n",
"**RESULTS**: High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.\n",
"\n",
"**RESULTS**: Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.\n",
"\n"
],
"metadata": {
"id": "7OmmHuG0b5xq"
}
},
{
"cell_type": "code",
"source": [
""
],
"metadata": {
"id": "-fyMruj0cwPS"
},
"execution_count": null,
"outputs": []
}
]
}