|
a |
|
b/README.md |
|
|
1 |
# Disease Detection in Biomedical Free-Text |
|
|
2 |
|
|
|
3 |
## 1. Topic |
|
|
4 |
|
|
|
5 |
The project revolves around a unique natural language processing challenge where |
|
|
6 |
the primary goal is to recognize specific named entities, especially in |
|
|
7 |
the context of medical conditions and diagnoses. The project used **Python 3.11.6**. |
|
|
8 |
|
|
|
9 |
## 2. Type of Project |
|
|
10 |
|
|
|
11 |
A key issue of this project is gathering and annotating data, considering there's |
|
|
12 |
a scarcity of publicly available datasets online for this particular problem. |
|
|
13 |
Nonetheless, existing medical lexicons, such as Snomed-CT and ICD-10, could |
|
|
14 |
potentially enhance the model's vocabulary, although I do not know about the |
|
|
15 |
feasibility of this idea. |
|
|
16 |
|
|
|
17 |
While there are pre-trained models on Huggingface suitable for similar NER tasks, |
|
|
18 |
they haven't publicly disclosed their training datasets. |
|
|
19 |
|
|
|
20 |
## 3. Summary |
|
|
21 |
|
|
|
22 |
### a. Idea and Approach |
|
|
23 |
|
|
|
24 |
The main idea is free-text processing for extracting diagnoses and diseases from |
|
|
25 |
medical notes. This type of named entity recognition might be useful for e.g. |
|
|
26 |
converting unstructured data to specially constructed standards, suitable for deployment |
|
|
27 |
in Hospital Information Systems. |
|
|
28 |
|
|
|
29 |
For this kind of NER project, BERT models are anticipated to be effective. As |
|
|
30 |
such, the plan is to harness BERT base models and adapt them for this specialized task. |
|
|
31 |
|
|
|
32 |
### b. Dataset Description |
|
|
33 |
|
|
|
34 |
**ESSENTIALS** |
|
|
35 |
|
|
|
36 |
The primary dataset originates from the TREC CT topics, publicly accessible |
|
|
37 |
here: http://trec-cds.org/ |
|
|
38 |
|
|
|
39 |
Each topic has a similar structure, including several diagnoses in free text |
|
|
40 |
format. The topics represent admission notes - notes with the most important |
|
|
41 |
patient details, which a doctor takes as soon as a person is admitted to a |
|
|
42 |
hospital. This includes personal information and demographics, such as gender and age, but also |
|
|
43 |
and most importantly the current medical conditions, personal medical history and |
|
|
44 |
family medical history. For simplification purposes, the focus lies on |
|
|
45 |
detecting diseases/diagnoses present in the text, covering conditions such as diabetes |
|
|
46 |
mellitus or high blood pressure. |
|
|
47 |
|
|
|
48 |
This dataset makes a total of 255 entries (topics). This includes: |
|
|
49 |
|
|
|
50 |
- **topics2016.xml** - valuable information in *note*, *description* and *summary*. |
|
|
51 |
30 topics in total. The fields could be processed individually, though, creating |
|
|
52 |
a total of 90 topics. |
|
|
53 |
- **topics2021.xml** - 75 topics in total. |
|
|
54 |
- **topics2022.xml** - 50 topics in total. |
|
|
55 |
- **topics2023.xml** - preprocessed to free text in admission note style via LLM - 40 |
|
|
56 |
topics in total. |
|
|
57 |
|
|
|
58 |
**ADDITIONALS** |
|
|
59 |
|
|
|
60 |
Should the topic-dataset not be enough in |
|
|
61 |
case of inference (e.g. error metrics too high), I will include more data from the |
|
|
62 |
[*ClinicalTrials.gov*](https://clinicaltrials.gov/) database. It contains information on clinical trials, including |
|
|
63 |
free text descriptions on said trials. This may be useful to further enhance the |
|
|
64 |
model's performance - given the complexity of annotating this kind of data, I |
|
|
65 |
would consider this only if the model's vocabulary does not suffice. |
|
|
66 |
|
|
|
67 |
Since vocabulary in the medical world is complex and diverse, it might be |
|
|
68 |
incredibly useful to enhance the model's vocabulary with already existing |
|
|
69 |
medical thesauri. Some of which (such as ICD-10) are publicly available and |
|
|
70 |
continuously updated by medical professionals. However, I am yet uncertain on |
|
|
71 |
how to incorporate the thesaurus in a BERT's model vocabulary. |
|
|
72 |
|
|
|
73 |
**Language**: All data (text) being used in this project will be in English. |
|
|
74 |
|
|
|
75 |
### c. Work-Breakdown Structure |
|
|
76 |
|
|
|
77 |
- **requirements engineering** |
|
|
78 |
*Goal*: Study and collect requirements for both the project and BERT architecture. |
|
|
79 |
Possibly find tools on how to efficiently annotate data. Find error metrics suitable |
|
|
80 |
for this task. |
|
|
81 |
*Time*: 5h |
|
|
82 |
*Deadline*: 29th Oct. |
|
|
83 |
- **capturing and annotating data** |
|
|
84 |
*Goal*: Collect all necessary data and make it |
|
|
85 |
ready for training the model. |
|
|
86 |
*Time*: 25h |
|
|
87 |
*Deadline*: 12th Nov. |
|
|
88 |
- **describing data** |
|
|
89 |
*Goal*: Describe data, make plots for data visualization (e.g. wordclouds) for |
|
|
90 |
better understanding of the data we are working with. |
|
|
91 |
*Time*: 5h |
|
|
92 |
*Deadline*: 14th Nov. |
|
|
93 |
- **implementing BERT** |
|
|
94 |
*Goal*: Implement and train a working BERT model. |
|
|
95 |
*Time*: 15h |
|
|
96 |
*Deadline*: 5th Dec. |
|
|
97 |
- **tuning BERT** |
|
|
98 |
*Goal*: Perform Hyperparameter Tuning and detect possible defects. |
|
|
99 |
*Time*: 10h |
|
|
100 |
*Deadline*: 19th Dec. |
|
|
101 |
- **report, presentation and application** |
|
|
102 |
*Goal*: Write a finished report and make a visually appealing presentation. Include a small Angular webapp. |
|
|
103 |
*Time*: 10h |
|
|
104 |
*Deadline*: 16th Jan. |
|
|
105 |
|
|
|
106 |
|
|
|
107 |
## 4. Related Papers and Models |
|
|
108 |
|
|
|
109 |
- Named entity recognition and normalization in biomedical literature: a practical case in SARS-CoV-2 literature (https://oa.upm.es/67933/) |
|
|
110 |
|
|
|
111 |
This led to the conception of BioBERT, a model refined for disease recognition in biomedical texts. Available at: https://huggingface.co/alvaroalon2/biobert_diseases_ner |
|
|
112 |
|
|
|
113 |
- Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python (https://arxiv.org/abs/2106.07799) |
|
|
114 |
|
|
|
115 |
The result was a Python package used by medical experts for processing of biomedical texts. |
|
|
116 |
Interestingly enough, this tool can capture many entities and fields that are derived |
|
|
117 |
via rudimentary regular expressions as seen in its source code. Available at: |
|
|
118 |
https://github.com/medspacy/medspacy |
|
|
119 |
|
|
|
120 |
# Notes during Development |
|
|
121 |
|
|
|
122 |
## Data Labelling |
|
|
123 |
|
|
|
124 |
Data Labelling has been done via doccano. |
|
|
125 |
|
|
|
126 |
I have encountered several interesting issues while labelling data as 'medical conditions', since the definiton of a medical condition is not clear and is subject |
|
|
127 |
to interpretation. For instance, it is uncertain whether 'fever' should be classified as a medical condition (i.e. disease) or a symptom. The same counts for |
|
|
128 |
fracture of bone etc. For the purpose of this exercise, I have looked up several medical ontologies and websites, in order to see how specific medical lingo |
|
|
129 |
is classified. As an example, 'fever' or 'dyspnea' have, in fact, not been listed as medical conditions, but rather as symptoms. |
|
|
130 |
|
|
|
131 |
Since doctors use many abbreviations for admission notes (e.g. 'CAD' for 'Coronary Artery Disease'), this website was very helpful: https://www.allacronyms.com/ |
|
|
132 |
|
|
|
133 |
## Data Visualization |
|
|
134 |
|
|
|
135 |
I have decided for wordclouds and word-frequency tables to get an overlook over the text data as well as the classified data. |
|
|
136 |
|
|
|
137 |
**Wordcloud for free-text admission notes** |
|
|
138 |
|
|
|
139 |
 |
|
|
140 |
|
|
|
141 |
**Word-Freqency Table for free-text admission notes** |
|
|
142 |
|
|
|
143 |
 |
|
|
144 |
|
|
|
145 |
**Wordcloud for medical conditions** |
|
|
146 |
|
|
|
147 |
 |
|
|
148 |
|
|
|
149 |
**Word-Frequency Table for medical conditions** |
|
|
150 |
|
|
|
151 |
 |
|
|
152 |
|
|
|
153 |
## Metrics |
|
|
154 |
|
|
|
155 |
For this task (named entity recognition), we want both high precision and high recall. The precision measures the percentage of the model's predicted entities that are correct. High precision indicates a low false positive rate, which is important in medical application, as a means to avoid false alarms. On the other hand, recall measures the percentage of actual entities that the model correcly identified. High recall is important in medical NER, to ensure that no critical information is missed. |
|
|
156 |
|
|
|
157 |
The logical conclusion is using the harmoic mean of both, which is the f1 score. It provides a balance between precision and recall. An ideal model will have both high precision and high recall, leading to a high f1-score - this is often times the primary metric for NER tasks, and this is the metric we will focus on during hyperparameter-tuning. |
|
|
158 |
|
|
|
159 |
Using ambiguous parameters with SGD already provided great results, with an f1-score of ~0.85. I believe that with proper hyperparameter tuning we can exceed 0.9. The goal would be an f1-score of at least 0.92. |
|
|
160 |
|
|
|
161 |
## Observations during Hyperparameter-Tuning |
|
|
162 |
|
|
|
163 |
- A higher learning rate leads to worse results. (lr: 0.1, f1: ~0.3) |
|
|
164 |
- The Adam optimizer performs poorly in comparison to SGD with Momentum |
|
|
165 |
- The f1-score does no longer significantly change/gets worse after ~5 epochs. |
|
|
166 |
- Too little learning rate leads to worse results. (lr: 0.0001, f1: ~0.3) |
|
|
167 |
- Slightly higher rate reads to better results (lr: 0.001, f1: ~0.5) |
|
|
168 |
- Higher batch size leads to slightly worse f1 score ({batch size: 16, f1: 0.92} -> {batch size: 32, f1: 0.83}) |
|
|
169 |
|
|
|
170 |
Currently best parameters: |
|
|
171 |
|
|
|
172 |
{'batch_size': 8, 'learning_rate': 0.01, 'epochs': 5, 'optimizer': 'SGD', 'max_tokens': 128} -> 0.93 (with base BERT) |
|
|
173 |
|
|
|
174 |
## Actual Amount of Time Spent |
|
|
175 |
|
|
|
176 |
- **requirements engineering** |
|
|
177 |
*Time Planned*: 5h |
|
|
178 |
*Time Spent*: 4h |
|
|
179 |
*Notes*: Thanks to a colleague, finding appropriate tools for data annotation was easy. |
|
|
180 |
- **capturing and annotating data** |
|
|
181 |
*Time Planned*: 25h |
|
|
182 |
*Time Spent*: 35h |
|
|
183 |
*Notes*: Data Annotation was way harder than I anticipated, vastly due to the very complicated medical lingo and unclearly defined differences between symptoms and medical conditions. |
|
|
184 |
- **describing data** |
|
|
185 |
*Time Planned*: 5h |
|
|
186 |
*Time Spent*: 3.5h |
|
|
187 |
*Notes*: Insights were incredibly useful for finding a good value for maximum tokens. |
|
|
188 |
- **implementing BERT** |
|
|
189 |
*Time Planned*: 15h |
|
|
190 |
*Time Spent*: 22h |
|
|
191 |
*Notes*: Unexpected difficulties utilizing BioBERT for transfer-learning. |
|
|
192 |
- **tuning BERT** |
|
|
193 |
*Time Planned*: 10h |
|
|
194 |
*Time Spent*: 19h (active) |
|
|
195 |
*Notes*: Setting up the environment for hyperparameter-tuning was not as hard as expected, but the tuning itself was way more computationally intense than anticipated. Furthermore, there initially was weird behaviour with the error metrics (explosive error rate and rapid forgetting) - fixing the bug was rather expensive. |