Switch to unified view

a/README.md b/README.md
1
# Active Learning for Relation Extraction Task from Medical Text
1
# Active Learning for Relation Extraction Task from Medical Text
2
2
3
Master Thesis <br>
3
Master Thesis <br>
4
M.Sc. in Data Science and Artificial Intelligence <br>
4
M.Sc. in Data Science and Artificial Intelligence <br>
5
Saarland University <br>
5
Saarland University <br>
6
6
7
Author: **Pablo Valdunciel Sánchez**
7
Author: **Pablo Valdunciel Sánchez**
8
8
9
9
10
## Table of Contents
10
## Table of Contents
11
11
12
- [Overview](#overview)
12
- [Overview](#overview)
13
- [Publication](#publication)
13
- [Publication](#publication)
14
- [Motivation](#motivation)
14
- [Motivation](#motivation)
15
- [Methods and Results](#method-and-results)
15
- [Methods and Results](#method-and-results)
16
- [Repository Content](#repository-structure)
16
- [Repository Content](#repository-structure)
17
- [Reproducibility](#reproducibility)
17
- [Reproducibility](#reproducibility)
18
- [Resources](#resources)
18
- [Resources](#resources)
19
- [About](#about)
19
- [About](#about)
20
20
21
21
22
## Overview
22
## Overview
23
23
24
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques.
24
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques.
25
25
26
![Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.](https://github.com/user-attachments/assets/dc3ae018-26f9-4963-ba0b-7eb9b1818a08)
26
![Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.](https://github.com/user-attachments/assets/dc3ae018-26f9-4963-ba0b-7eb9b1818a08?raw=true)
27
27
28
<figcaption style="text-align:center; padding: 100px">
28
<figcaption style="text-align:center; padding: 100px">
29
  Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.
29
  Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.
30
</figcaption>
30
</figcaption>
31
31
32
## Publication
32
## Publication
33
33
34
The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024).
34
The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024).
35
35
36
---
36
---
37
37
38
### 📄 Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs
38
### 📄 Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs
39
39
40
**Authors:** Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag  
40
**Authors:** Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag  
41
**Published:** March 2024  
41
**Published:** March 2024  
42
**Journal/Conference:** Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)  
42
**Journal/Conference:** Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)  
43
**DOI:** [10.18653/v1/2024.uncertainlp-1.3](https://aclanthology.org/2024.uncertainlp-1.3/)
43
**DOI:** [10.18653/v1/2024.uncertainlp-1.3](https://aclanthology.org/2024.uncertainlp-1.3/)
44
44
45
---
45
---
46
46
47
## Motivation 
47
## Motivation 
48
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods. 
48
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods. 
49
49
50
50
51
## Method and Results 
51
## Method and Results 
52
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context.
52
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context.
53
53
54
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values $< 10^{-9}$), whereas the batch-based strategies yielded poor results. We propose that LM-based methods are advantageous for interactive annotation processes due to their effective generalisation across diverse corpora, requiring minimal or no adjustment of input features.
54
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values $< 10^{-9}$), whereas the batch-based strategies yielded poor results. We propose that LM-based methods are advantageous for interactive annotation processes due to their effective generalisation across diverse corpora, requiring minimal or no adjustment of input features.
55
55
56
## Repository Structure:
56
## Repository Structure:
57
57
58
- `./data/`: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model.
58
- `./data/`: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model.
59
59
60
- `./doc/`: Holds documentation related to the project, such as license and created figures.
60
- `./doc/`: Holds documentation related to the project, such as license and created figures.
61
61
62
- `./results/`: Stores the results of experiments, including .csv files with metrics and generated plots.
62
- `./results/`: Stores the results of experiments, including .csv files with metrics and generated plots.
63
63
64
- `./src/`: The central location for all source files for the project. 
64
- `./src/`: The central location for all source files for the project. 
65
65
66
- `./tests/`: Contains PyTest tests for the feature generation and  the data models.
66
- `./tests/`: Contains PyTest tests for the feature generation and  the data models.
67
67
68
- `./pyproject.toml`: The project configuration file.
68
- `./pyproject.toml`: The project configuration file.
69
69
70
- `./requirements.txt`: Lists the packages required to run the dataset preprocessing and active learning experiments.
70
- `./requirements.txt`: Lists the packages required to run the dataset preprocessing and active learning experiments.
71
71
72
A more detailed `README` can be found in each of the subdirectories.
72
A more detailed `README` can be found in each of the subdirectories.
73
73
74
74
75
## Reproducibility:
75
## Reproducibility:
76
76
77
### 1. Python Setup
77
### 1. Python Setup
78
78
79
First, create a virtual environment and activate it.
79
First, create a virtual environment and activate it.
80
```bash
80
```bash
81
python -m venv .venv
81
python -m venv .venv
82
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows)
82
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows)
83
```
83
```
84
84
85
Second, install the dependencies.
85
Second, install the dependencies.
86
86
87
```bash
87
```bash
88
pip install -r requirements.txt
88
pip install -r requirements.txt
89
```
89
```
90
90
91
### 2. Data download
91
### 2. Data download
92
Go to [data/ddi](data/ddi), [data/n2c2](data/n2c2), [data/bioword2vec](data/bioword2vec/) and [data/ml_models](data/ml_models/) and follow the instructions in the `README` files.
92
Go to [data/ddi](data/ddi), [data/n2c2](data/n2c2), [data/bioword2vec](data/bioword2vec/) and [data/ml_models](data/ml_models/) and follow the instructions in the `README` files.
93
93
94
94
95
### 3. Data preprocessing
95
### 3. Data preprocessing
96
96
97
The following code will do the preprocessing of the desired corpus:
97
The following code will do the preprocessing of the desired corpus:
98
98
99
```Python
99
```Python
100
from preprocessing import * 
100
from preprocessing import * 
101
101
102
corpus = "n2c2" # or "ddi"
102
corpus = "n2c2" # or "ddi"
103
103
104
# Split documents into sentences (only does something for n2c2)
104
# Split documents into sentences (only does something for n2c2)
105
split_sentences(corpus)
105
split_sentences(corpus)
106
106
107
# Generate relation collections
107
# Generate relation collections
108
collections = generate_relations(corpus, save_to_disk=True)
108
collections = generate_relations(corpus, save_to_disk=True)
109
109
110
# Generate statistics
110
# Generate statistics
111
generate_statistics(corpus, collections) # will be printed to console
111
generate_statistics(corpus, collections) # will be printed to console
112
```
112
```
113
113
114
For more information on how the preprocessing is implemented, please refer to the `README` in the  [preprocessing](src/preprocessing/README.md) module.
114
For more information on how the preprocessing is implemented, please refer to the `README` in the  [preprocessing](src/preprocessing/README.md) module.
115
115
116
116
117
###  4. Generation of training datasets
117
###  4. Generation of training datasets
118
118
119
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the `./data/` folder and can be loaded at runtime. Computing this at runtime would be too slow. 
119
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the `./data/` folder and can be loaded at runtime. Computing this at runtime would be too slow. 
120
120
121
The following code generates the HF Datasets for the desired corpus: 
121
The following code generates the HF Datasets for the desired corpus: 
122
122
123
```Python
123
```Python
124
# Local Dependencies
124
# Local Dependencies
125
# ------------------
125
# ------------------
126
from vocab import Vocabulary
126
from vocab import Vocabulary
127
from models import RelationCollection
127
from models import RelationCollection
128
from re_datasets import BilstmDatasetFactory, BertDatasetFactory
128
from re_datasets import BilstmDatasetFactory, BertDatasetFactory
129
129
130
corpus = "n2c2" # or "ddi"
130
corpus = "n2c2" # or "ddi"
131
131
132
# Load collections 
132
# Load collections 
133
collection = RelationCollection.load_collections(corpus)
133
collection = RelationCollection.load_collections(corpus)
134
134
135
# Generate vocabulary
135
# Generate vocabulary
136
train_collection = collections["train"]
136
train_collection = collections["train"]
137
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True)
137
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True)
138
138
139
# Generate HF Datasets for BiLSTM and store in disk
139
# Generate HF Datasets for BiLSTM and store in disk
140
BilstmDatasetFactory.create_datasets(corpus, collections, vocab)
140
BilstmDatasetFactory.create_datasets(corpus, collections, vocab)
141
141
142
# Generate HF Datasets for BERT and store in disk
142
# Generate HF Datasets for BERT and store in disk
143
BertDatasetFactory.create_datasets(corpus, collections, vocab)
143
BertDatasetFactory.create_datasets(corpus, collections, vocab)
144
```
144
```
145
 
145
 
146
146
147
### 5. Running experiments
147
### 5. Running experiments
148
148
149
To run the experiments, it is necessary to have the HF Datasets generated in the previous step.
149
To run the experiments, it is necessary to have the HF Datasets generated in the previous step.
150
150
151
The `./src/experiments` folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example: 
151
The `./src/experiments` folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example: 
152
152
153
```Python   
153
```Python   
154
# Local Dependencies
154
# Local Dependencies
155
from experiments import *
155
from experiments import *
156
156
157
# run experiments with BiLSTM on n2c2
157
# run experiments with BiLSTM on n2c2
158
bilstm_passive_learning_n2c2()
158
bilstm_passive_learning_n2c2()
159
bilstm_active_learning_n2c2()
159
bilstm_active_learning_n2c2()
160
160
161
# or run experiments with Paired Clinical BERT on DDI 
161
# or run experiments with Paired Clinical BERT on DDI 
162
bert_passive_learning_ddi(pairs=True)
162
bert_passive_learning_ddi(pairs=True)
163
bert_active_learning_ddi(pairs=True)
163
bert_active_learning_ddi(pairs=True)
164
```
164
```
165
165
166
For more information on how the experimental setting is implemented, please refer to the `README` in the [experiments](src/experiments/README.md) module.
166
For more information on how the experimental setting is implemented, please refer to the `README` in the [experiments](src/experiments/README.md) module.
167
167
168
### 6. Tracking experiments 
168
### 6. Tracking experiments 
169
169
170
To track the experiments we have chosen [neptune.ai](https://docs.neptune.ai/). Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface. 
170
To track the experiments we have chosen [neptune.ai](https://docs.neptune.ai/). Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface. 
171
171
172
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the [config.py](./src/config.py) file to the `NEPTUNE_PROJECT` and `NEPTUNE_API_TOKEN` variables, respectively. 
172
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the [config.py](./src/config.py) file to the `NEPTUNE_PROJECT` and `NEPTUNE_API_TOKEN` variables, respectively. 
173
173
174
```Python
174
```Python
175
# config.py
175
# config.py
176
NEPTUNE_PROJECT = "your_username/your_project_name"
176
NEPTUNE_PROJECT = "your_username/your_project_name"
177
NEPTUNE_API_TOKEN = "your_api_token"
177
NEPTUNE_API_TOKEN = "your_api_token"
178
```
178
```
179
179
180
If you don't want to use neptune, you can set the `logging` parameter to `False` when running an experiment. 
180
If you don't want to use neptune, you can set the `logging` parameter to `False` when running an experiment. 
181
181
182
```Python
182
```Python
183
bilstm_passive_learning_n2c2(logging=False)
183
bilstm_passive_learning_n2c2(logging=False)
184
```
184
```
185
185
186
186
187
## Resources 
187
## Resources 
188
188
189
### Corpora
189
### Corpora
190
 - [2018 n2c2 callenge](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/)
190
 - [2018 n2c2 callenge](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/)
191
 - [DDI Extraction corpus](https://github.com/isegura/DDICorpus)
191
 - [DDI Extraction corpus](https://github.com/isegura/DDICorpus)
192
192
193
### Word Embeddings
193
### Word Embeddings
194
 - [BioWord2vec](https://github.com/ncbi-nlp/BioWordVec)
194
 - [BioWord2vec](https://github.com/ncbi-nlp/BioWordVec)
195
195
196
### Pre-trained Models
196
### Pre-trained Models
197
 - [Clinical BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT?text=He+was+administered+10+mg+of+%5BMASK%5D.)
197
 - [Clinical BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT?text=He+was+administered+10+mg+of+%5BMASK%5D.)
198
198
199
199
200
### Libraries
200
### Libraries
201
201
202
- NLP:
202
- NLP:
203
    - [spaCy](https://spacy.io/)
203
    - [spaCy](https://spacy.io/)
204
    - [negspaCy](https://github.com/jenojp/negspacy)
204
    - [negspaCy](https://github.com/jenojp/negspacy)
205
    - [ScispaCy](https://allenai.github.io/scispacy/)
205
    - [ScispaCy](https://allenai.github.io/scispacy/)
206
    - [gensim](https://radimrehurek.com/gensim/)
206
    - [gensim](https://radimrehurek.com/gensim/)
207
    - [PyRuSH](https://github.com/jianlins/PyRuSH)
207
    - [PyRuSH](https://github.com/jianlins/PyRuSH)
208
208
209
<br>
209
<br>
210
210
211
- Machine Learning: 
211
- Machine Learning: 
212
    - [sklearn](https://scikit-learn.org/stable/)
212
    - [sklearn](https://scikit-learn.org/stable/)
213
    - [imbalanced-learn](https://imbalanced-learn.org/stable/)
213
    - [imbalanced-learn](https://imbalanced-learn.org/stable/)
214
214
215
<br>
215
<br>
216
216
217
- Deep Learning:
217
- Deep Learning:
218
    - [PyTorch](https://pytorch.org/)
218
    - [PyTorch](https://pytorch.org/)
219
    - [TorchMetrics](https://torchmetrics.readthedocs.io/en/latest/)
219
    - [TorchMetrics](https://torchmetrics.readthedocs.io/en/latest/)
220
    - [HuggingFace Transformers](https://huggingface.co/transformers/)
220
    - [HuggingFace Transformers](https://huggingface.co/transformers/)
221
    - [HuggingFace Datasets](https://huggingface.co/docs/datasets/)
221
    - [HuggingFace Datasets](https://huggingface.co/docs/datasets/)
222
    - [Joey NMT](https://github.com/joeynmt/joeynmt)
222
    - [Joey NMT](https://github.com/joeynmt/joeynmt)
223
<br>
223
<br>
224
224
225
- Active Learning:
225
- Active Learning:
226
    - [Baal](https://baal.readthedocs.io/en/latest/)
226
    - [Baal](https://baal.readthedocs.io/en/latest/)
227
    - [modAL](https://modal-python.readthedocs.io/en/latest/)
227
    - [modAL](https://modal-python.readthedocs.io/en/latest/)
228
228
229
<br>
229
<br>
230
230
231
- Visualisations 
231
- Visualisations 
232
    - [seaborn](https://seaborn.pydata.org/)
232
    - [seaborn](https://seaborn.pydata.org/)
233
    - [bertviz](https://github.com/jessevig/bertviz)
233
    - [bertviz](https://github.com/jessevig/bertviz)
234
    - [dtreeviz](https://github.com/parrt/dtreeviz)
234
    - [dtreeviz](https://github.com/parrt/dtreeviz)
235
    
235
    
236
<br>
236
<br>
237
237
238
- Experiments metadata store: 
238
- Experiments metadata store: 
239
    - [neptune.ai](https://neptune.ai/)
239
    - [neptune.ai](https://neptune.ai/)
240
240
241
    
241
    
242
## About 
242
## About 
243
243
244
This project was developed as part of **Pablo Valdunciel Sánchez**'s master's thesis in the *Data Science and Artificial Intelligence* master's programme at Saarland University (Germany).  The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by **Prof. Dr. Antonio Krüger**,  CEO and scientific director of the DFKI, and **Prof. Dr. Daniel Sonntag**, director of the Interactive Machine Learning (IML) department at DFKI, and advised by **Siting Liang**, researcher in the IML department.
244
This project was developed as part of **Pablo Valdunciel Sánchez**'s master's thesis in the *Data Science and Artificial Intelligence* master's programme at Saarland University (Germany).  The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by **Prof. Dr. Antonio Krüger**,  CEO and scientific director of the DFKI, and **Prof. Dr. Daniel Sonntag**, director of the Interactive Machine Learning (IML) department at DFKI, and advised by **Siting Liang**, researcher in the IML department.