Diff of /README.md [000000] .. [0eda78]

Switch to unified view

a b/README.md
1
# Disease Detection in Biomedical Free-Text
2
3
## 1. Topic
4
5
The project revolves around a unique natural language processing challenge where
6
the primary goal is to recognize specific named entities, especially in
7
the context of medical conditions and diagnoses. The project used **Python 3.11.6**.
8
9
## 2. Type of Project
10
11
A key issue of this project is gathering and annotating data, considering there's
12
a scarcity of publicly available datasets online for this particular problem. 
13
Nonetheless, existing medical lexicons, such as Snomed-CT and ICD-10, could 
14
potentially enhance the model's vocabulary, although I do not know about the
15
feasibility of this idea.
16
17
While there are pre-trained models on Huggingface suitable for similar NER tasks,
18
they haven't publicly disclosed their training datasets.
19
20
## 3. Summary
21
22
### a. Idea and Approach
23
24
The main idea is free-text processing for extracting diagnoses and diseases from
25
medical notes. This type of named entity recognition might be useful for e.g.
26
converting unstructured data to specially constructed standards, suitable for deployment
27
in Hospital Information Systems.
28
29
For this kind of NER project, BERT models are anticipated to be effective. As
30
such, the plan is to harness BERT base models and adapt them for this specialized task.
31
32
### b. Dataset Description
33
34
**ESSENTIALS**
35
36
The primary dataset originates from the TREC CT topics, publicly accessible
37
here: http://trec-cds.org/
38
39
Each topic has a similar structure, including several diagnoses in free text
40
format. The topics represent admission notes - notes with the most important
41
patient details, which a doctor takes as soon as a person is admitted to a 
42
hospital. This includes personal information and demographics, such as gender and age, but also
43
and most importantly the current medical conditions, personal medical history and
44
family medical history. For simplification purposes, the focus lies on
45
detecting diseases/diagnoses present in the text, covering conditions such as diabetes
46
mellitus or high blood pressure.
47
48
This dataset makes a total of 255 entries (topics). This includes:
49
50
- **topics2016.xml** - valuable information in *note*, *description* and *summary*.
51
30 topics in total. The fields could be processed individually, though, creating
52
a total of 90 topics.
53
- **topics2021.xml** - 75 topics in total.
54
- **topics2022.xml** - 50 topics in total.
55
- **topics2023.xml** - preprocessed to free text in admission note style via LLM - 40
56
topics in total.
57
58
**ADDITIONALS**
59
60
Should the topic-dataset not be enough in
61
case of inference (e.g. error metrics too high), I will include more data from the 
62
[*ClinicalTrials.gov*](https://clinicaltrials.gov/) database. It contains information on clinical trials, including
63
free text descriptions on said trials. This may be useful to further enhance the
64
model's performance - given the complexity of annotating this kind of data, I
65
would consider this only if the model's vocabulary does not suffice.
66
67
Since vocabulary in the medical world is complex and diverse, it might be
68
incredibly useful to enhance the model's vocabulary with already existing
69
medical thesauri. Some of which (such as ICD-10) are publicly available and
70
continuously updated by medical professionals. However, I am yet uncertain on
71
how to incorporate the thesaurus in a BERT's model vocabulary.
72
73
**Language**: All data (text) being used in this project will be in English.
74
75
### c. Work-Breakdown Structure
76
77
- **requirements engineering**
78
*Goal*: Study and collect requirements for both the project and BERT architecture. 
79
Possibly find tools on how to efficiently annotate data. Find error metrics suitable
80
for this task.
81
*Time*: 5h
82
*Deadline*: 29th Oct.
83
- **capturing and annotating data**
84
*Goal*: Collect all necessary data and make it
85
ready for training the model.
86
*Time*: 25h
87
*Deadline*: 12th Nov.
88
- **describing data**
89
*Goal*: Describe data, make plots for data visualization (e.g. wordclouds) for
90
better understanding of the data we are working with.
91
*Time*: 5h
92
*Deadline*: 14th Nov.
93
- **implementing BERT**
94
*Goal*: Implement and train a working BERT model.
95
*Time*: 15h
96
*Deadline*: 5th Dec.
97
- **tuning BERT**
98
*Goal*: Perform Hyperparameter Tuning and detect possible defects.
99
*Time*: 10h
100
*Deadline*: 19th Dec.
101
- **report, presentation and application**
102
*Goal*: Write a finished report and make a visually appealing presentation. Include a small Angular webapp.
103
*Time*: 10h
104
*Deadline*: 16th Jan.
105
  
106
107
## 4. Related Papers and Models
108
109
- Named entity recognition and normalization in biomedical literature: a practical case in SARS-CoV-2 literature (https://oa.upm.es/67933/)
110
111
This led to the conception of BioBERT, a model refined for disease recognition in biomedical texts. Available at: https://huggingface.co/alvaroalon2/biobert_diseases_ner
112
113
- Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python (https://arxiv.org/abs/2106.07799)
114
115
The result was a Python package used by medical experts for processing of biomedical texts.
116
Interestingly enough, this tool can capture many entities and fields that are derived
117
via rudimentary regular expressions as seen in its source code. Available at: 
118
https://github.com/medspacy/medspacy
119
120
# Notes during Development
121
122
## Data Labelling
123
124
Data Labelling has been done via doccano.
125
126
I have encountered several interesting issues while labelling data as 'medical conditions', since the definiton of a medical condition is not clear and is subject
127
to interpretation. For instance, it is uncertain whether 'fever' should be classified as a medical condition (i.e. disease) or a symptom. The same counts for
128
fracture of bone etc. For the purpose of this exercise, I have looked up several medical ontologies and websites, in order to see how specific medical lingo
129
is classified. As an example, 'fever' or 'dyspnea' have, in fact, not been listed as medical conditions, but rather as symptoms.
130
131
Since doctors use many abbreviations for admission notes (e.g. 'CAD' for 'Coronary Artery Disease'), this website was very helpful: https://www.allacronyms.com/
132
133
## Data Visualization
134
135
I have decided for wordclouds and word-frequency tables to get an overlook over the text data as well as the classified data.
136
137
**Wordcloud for free-text admission notes**
138
139
![wordcloud_text](https://github.com/Padraig20/Applied-Deep-Learning-VU/assets/111874815/839b1ac2-4050-4118-8280-c526c8a7d525)
140
141
**Word-Freqency Table for free-text admission notes**
142
143
![wordfreq_text](https://github.com/Padraig20/Applied-Deep-Learning-VU/assets/111874815/103ff040-7beb-4611-a0de-14b00f0d79ed)
144
145
**Wordcloud for medical conditions**
146
147
![wordcloud_entities](https://github.com/Padraig20/Applied-Deep-Learning-VU/assets/111874815/d7e82fb8-530a-4aa6-b585-3e719a07def9)
148
149
**Word-Frequency Table for medical conditions**
150
151
![wordfreq_entities](https://github.com/Padraig20/Applied-Deep-Learning-VU/assets/111874815/f19901a2-bbc4-4dbb-895d-3609709da594)
152
153
## Metrics
154
155
For this task (named entity recognition), we want both high precision and high recall. The precision measures the percentage of the model's predicted entities that are correct. High precision indicates a low false positive rate, which is important in medical application, as a means to avoid false alarms. On the other hand, recall measures the percentage of actual entities that the model correcly identified. High recall is important in medical NER, to ensure that no critical information is missed.
156
157
The logical conclusion is using the harmoic mean of both, which is the f1 score. It provides a balance between precision and recall. An ideal model will have both high precision and high recall, leading to a high f1-score - this is often times the primary metric for NER tasks, and this is the metric we will focus on during hyperparameter-tuning.
158
159
Using ambiguous parameters with SGD already provided great results, with an f1-score of ~0.85. I believe that with proper hyperparameter tuning we can exceed 0.9. The goal would be an f1-score of at least 0.92.
160
161
## Observations during Hyperparameter-Tuning
162
163
- A higher learning rate leads to worse results. (lr: 0.1, f1: ~0.3)
164
- The Adam optimizer performs poorly in comparison to SGD with Momentum
165
- The f1-score does no longer significantly change/gets worse after ~5 epochs.
166
- Too little learning rate leads to worse results. (lr: 0.0001, f1: ~0.3)
167
- Slightly higher rate reads to better results (lr: 0.001, f1: ~0.5)
168
- Higher batch size leads to slightly worse f1 score ({batch size: 16, f1: 0.92} -> {batch size: 32, f1: 0.83}) 
169
170
Currently best parameters:
171
172
{'batch_size': 8, 'learning_rate': 0.01, 'epochs': 5, 'optimizer': 'SGD', 'max_tokens': 128} -> 0.93 (with base BERT)
173
174
## Actual Amount of Time Spent
175
176
- **requirements engineering**
177
*Time Planned*: 5h
178
*Time Spent*: 4h
179
*Notes*: Thanks to a colleague, finding appropriate tools for data annotation was easy.
180
- **capturing and annotating data**
181
*Time Planned*: 25h
182
*Time Spent*: 35h
183
*Notes*: Data Annotation was way harder than I anticipated, vastly due to the very complicated medical lingo and unclearly defined differences between symptoms and medical conditions.
184
- **describing data**
185
*Time Planned*: 5h
186
*Time Spent*: 3.5h
187
*Notes*: Insights were incredibly useful for finding a good value for maximum tokens.
188
- **implementing BERT**
189
*Time Planned*: 15h
190
*Time Spent*: 22h
191
*Notes*: Unexpected difficulties utilizing BioBERT for transfer-learning.
192
- **tuning BERT**
193
*Time Planned*: 10h
194
*Time Spent*: 19h (active)
195
*Notes*: Setting up the environment for hyperparameter-tuning was not as hard as expected, but the tuning itself was way more computationally intense than anticipated. Furthermore, there initially was weird behaviour with the error metrics (explosive error rate and rapid forgetting) - fixing the bug was rather expensive.