Diff of /README.md [000000] .. [735bb5]

Switch to unified view

a b/README.md
1
# Active Learning for Relation Extraction Task from Medical Text
2
3
Master Thesis <br>
4
M.Sc. in Data Science and Artificial Intelligence <br>
5
Saarland University <br>
6
7
Author: **Pablo Valdunciel Sánchez**
8
9
10
## Table of Contents
11
12
- [Overview](#overview)
13
- [Publication](#publication)
14
- [Motivation](#motivation)
15
- [Methods and Results](#method-and-results)
16
- [Repository Content](#repository-structure)
17
- [Reproducibility](#reproducibility)
18
- [Resources](#resources)
19
- [About](#about)
20
21
22
## Overview
23
24
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques.
25
26
![Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.](https://github.com/user-attachments/assets/dc3ae018-26f9-4963-ba0b-7eb9b1818a08)
27
28
<figcaption style="text-align:center; padding: 100px">
29
  Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.
30
</figcaption>
31
32
## Publication
33
34
The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024).
35
36
---
37
38
### đź“„ Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs
39
40
**Authors:** Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag  
41
**Published:** March 2024  
42
**Journal/Conference:** Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)  
43
**DOI:** [10.18653/v1/2024.uncertainlp-1.3](https://aclanthology.org/2024.uncertainlp-1.3/)
44
45
---
46
47
## Motivation 
48
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods. 
49
50
51
## Method and Results 
52
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context.
53
54
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values $< 10^{-9}$), whereas the batch-based strategies yielded poor results. We propose that LM-based methods are advantageous for interactive annotation processes due to their effective generalisation across diverse corpora, requiring minimal or no adjustment of input features.
55
56
## Repository Structure:
57
58
- `./data/`: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model.
59
60
- `./doc/`: Holds documentation related to the project, such as license and created figures.
61
62
- `./results/`: Stores the results of experiments, including .csv files with metrics and generated plots.
63
64
- `./src/`: The central location for all source files for the project. 
65
66
- `./tests/`: Contains PyTest tests for the feature generation and  the data models.
67
68
- `./pyproject.toml`: The project configuration file.
69
70
- `./requirements.txt`: Lists the packages required to run the dataset preprocessing and active learning experiments.
71
72
A more detailed `README` can be found in each of the subdirectories.
73
74
75
## Reproducibility:
76
77
### 1. Python Setup
78
79
First, create a virtual environment and activate it.
80
```bash
81
python -m venv .venv
82
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows)
83
```
84
85
Second, install the dependencies.
86
87
```bash
88
pip install -r requirements.txt
89
```
90
91
### 2. Data download
92
Go to [data/ddi](data/ddi), [data/n2c2](data/n2c2), [data/bioword2vec](data/bioword2vec/) and [data/ml_models](data/ml_models/) and follow the instructions in the `README` files.
93
94
95
### 3. Data preprocessing
96
97
The following code will do the preprocessing of the desired corpus:
98
99
```Python
100
from preprocessing import * 
101
102
corpus = "n2c2" # or "ddi"
103
104
# Split documents into sentences (only does something for n2c2)
105
split_sentences(corpus)
106
107
# Generate relation collections
108
collections = generate_relations(corpus, save_to_disk=True)
109
110
# Generate statistics
111
generate_statistics(corpus, collections) # will be printed to console
112
```
113
114
For more information on how the preprocessing is implemented, please refer to the `README` in the  [preprocessing](src/preprocessing/README.md) module.
115
116
117
###  4. Generation of training datasets
118
119
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the `./data/` folder and can be loaded at runtime. Computing this at runtime would be too slow. 
120
121
The following code generates the HF Datasets for the desired corpus: 
122
123
```Python
124
# Local Dependencies
125
# ------------------
126
from vocab import Vocabulary
127
from models import RelationCollection
128
from re_datasets import BilstmDatasetFactory, BertDatasetFactory
129
130
corpus = "n2c2" # or "ddi"
131
132
# Load collections 
133
collection = RelationCollection.load_collections(corpus)
134
135
# Generate vocabulary
136
train_collection = collections["train"]
137
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True)
138
139
# Generate HF Datasets for BiLSTM and store in disk
140
BilstmDatasetFactory.create_datasets(corpus, collections, vocab)
141
142
# Generate HF Datasets for BERT and store in disk
143
BertDatasetFactory.create_datasets(corpus, collections, vocab)
144
```
145
 
146
147
### 5. Running experiments
148
149
To run the experiments, it is necessary to have the HF Datasets generated in the previous step.
150
151
The `./src/experiments` folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example: 
152
153
```Python   
154
# Local Dependencies
155
from experiments import *
156
157
# run experiments with BiLSTM on n2c2
158
bilstm_passive_learning_n2c2()
159
bilstm_active_learning_n2c2()
160
161
# or run experiments with Paired Clinical BERT on DDI 
162
bert_passive_learning_ddi(pairs=True)
163
bert_active_learning_ddi(pairs=True)
164
```
165
166
For more information on how the experimental setting is implemented, please refer to the `README` in the [experiments](src/experiments/README.md) module.
167
168
### 6. Tracking experiments 
169
170
To track the experiments we have chosen [neptune.ai](https://docs.neptune.ai/). Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface. 
171
172
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the [config.py](./src/config.py) file to the `NEPTUNE_PROJECT` and `NEPTUNE_API_TOKEN` variables, respectively. 
173
174
```Python
175
# config.py
176
NEPTUNE_PROJECT = "your_username/your_project_name"
177
NEPTUNE_API_TOKEN = "your_api_token"
178
```
179
180
If you don't want to use neptune, you can set the `logging` parameter to `False` when running an experiment. 
181
182
```Python
183
bilstm_passive_learning_n2c2(logging=False)
184
```
185
186
187
## Resources 
188
189
### Corpora
190
 - [2018 n2c2 callenge](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/)
191
 - [DDI Extraction corpus](https://github.com/isegura/DDICorpus)
192
193
### Word Embeddings
194
 - [BioWord2vec](https://github.com/ncbi-nlp/BioWordVec)
195
196
### Pre-trained Models
197
 - [Clinical BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT?text=He+was+administered+10+mg+of+%5BMASK%5D.)
198
199
200
### Libraries
201
202
- NLP:
203
    - [spaCy](https://spacy.io/)
204
    - [negspaCy](https://github.com/jenojp/negspacy)
205
    - [ScispaCy](https://allenai.github.io/scispacy/)
206
    - [gensim](https://radimrehurek.com/gensim/)
207
    - [PyRuSH](https://github.com/jianlins/PyRuSH)
208
209
<br>
210
211
- Machine Learning: 
212
    - [sklearn](https://scikit-learn.org/stable/)
213
    - [imbalanced-learn](https://imbalanced-learn.org/stable/)
214
215
<br>
216
217
- Deep Learning:
218
    - [PyTorch](https://pytorch.org/)
219
    - [TorchMetrics](https://torchmetrics.readthedocs.io/en/latest/)
220
    - [HuggingFace Transformers](https://huggingface.co/transformers/)
221
    - [HuggingFace Datasets](https://huggingface.co/docs/datasets/)
222
    - [Joey NMT](https://github.com/joeynmt/joeynmt)
223
<br>
224
225
- Active Learning:
226
    - [Baal](https://baal.readthedocs.io/en/latest/)
227
    - [modAL](https://modal-python.readthedocs.io/en/latest/)
228
229
<br>
230
231
- Visualisations 
232
    - [seaborn](https://seaborn.pydata.org/)
233
    - [bertviz](https://github.com/jessevig/bertviz)
234
    - [dtreeviz](https://github.com/parrt/dtreeviz)
235
    
236
<br>
237
238
- Experiments metadata store: 
239
    - [neptune.ai](https://neptune.ai/)
240
241
    
242
## About 
243
244
This project was developed as part of **Pablo Valdunciel Sánchez**'s master's thesis in the *Data Science and Artificial Intelligence* master's programme at Saarland University (Germany).  The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by **Prof. Dr. Antonio Krüger**,  CEO and scientific director of the DFKI, and **Prof. Dr. Daniel Sonntag**, director of the Interactive Machine Learning (IML) department at DFKI, and advised by **Siting Liang**, researcher in the IML department.