|
a/README.md |
|
b/README.md |
1 |
# Active Learning for Relation Extraction Task from Medical Text |
1 |
# Active Learning for Relation Extraction Task from Medical Text |
2 |
|
2 |
|
3 |
Master Thesis <br> |
3 |
Master Thesis <br>
|
4 |
M.Sc. in Data Science and Artificial Intelligence <br> |
4 |
M.Sc. in Data Science and Artificial Intelligence <br>
|
5 |
Saarland University <br> |
5 |
Saarland University <br> |
6 |
|
6 |
|
7 |
Author: **Pablo Valdunciel Sánchez** |
7 |
Author: **Pablo Valdunciel Sánchez** |
8 |
|
8 |
|
9 |
|
9 |
|
10 |
## Table of Contents |
10 |
## Table of Contents |
11 |
|
11 |
|
12 |
- [Overview](#overview) |
12 |
- [Overview](#overview)
|
13 |
- [Publication](#publication) |
13 |
- [Publication](#publication)
|
14 |
- [Motivation](#motivation) |
14 |
- [Motivation](#motivation)
|
15 |
- [Methods and Results](#method-and-results) |
15 |
- [Methods and Results](#method-and-results)
|
16 |
- [Repository Content](#repository-structure) |
16 |
- [Repository Content](#repository-structure)
|
17 |
- [Reproducibility](#reproducibility) |
17 |
- [Reproducibility](#reproducibility)
|
18 |
- [Resources](#resources) |
18 |
- [Resources](#resources)
|
19 |
- [About](#about) |
19 |
- [About](#about) |
20 |
|
20 |
|
21 |
|
21 |
|
22 |
## Overview |
22 |
## Overview |
23 |
|
23 |
|
24 |
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques. |
24 |
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques. |
25 |
|
25 |
|
26 |
 |
26 |
 |
27 |
|
27 |
|
28 |
<figcaption style="text-align:center; padding: 100px"> |
28 |
<figcaption style="text-align:center; padding: 100px">
|
29 |
Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities. |
29 |
Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.
|
30 |
</figcaption> |
30 |
</figcaption> |
31 |
|
31 |
|
32 |
## Publication |
32 |
## Publication |
33 |
|
33 |
|
34 |
The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). |
34 |
The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). |
35 |
|
35 |
|
36 |
--- |
36 |
--- |
37 |
|
37 |
|
38 |
### 📄 Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs |
38 |
### 📄 Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs |
39 |
|
39 |
|
40 |
**Authors:** Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag |
40 |
**Authors:** Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag
|
41 |
**Published:** March 2024 |
41 |
**Published:** March 2024
|
42 |
**Journal/Conference:** Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024) |
42 |
**Journal/Conference:** Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)
|
43 |
**DOI:** [10.18653/v1/2024.uncertainlp-1.3](https://aclanthology.org/2024.uncertainlp-1.3/) |
43 |
**DOI:** [10.18653/v1/2024.uncertainlp-1.3](https://aclanthology.org/2024.uncertainlp-1.3/) |
44 |
|
44 |
|
45 |
--- |
45 |
--- |
46 |
|
46 |
|
47 |
## Motivation |
47 |
## Motivation
|
48 |
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods. |
48 |
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods. |
49 |
|
49 |
|
50 |
|
50 |
|
51 |
## Method and Results |
51 |
## Method and Results
|
52 |
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context. |
52 |
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context. |
53 |
|
53 |
|
54 |
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values $< 10^{-9}$), whereas the batch-based strategies yielded poor results. We propose that LM-based methods are advantageous for interactive annotation processes due to their effective generalisation across diverse corpora, requiring minimal or no adjustment of input features. |
54 |
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values $< 10^{-9}$), whereas the batch-based strategies yielded poor results. We propose that LM-based methods are advantageous for interactive annotation processes due to their effective generalisation across diverse corpora, requiring minimal or no adjustment of input features. |
55 |
|
55 |
|
56 |
## Repository Structure: |
56 |
## Repository Structure: |
57 |
|
57 |
|
58 |
- `./data/`: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model. |
58 |
- `./data/`: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model. |
59 |
|
59 |
|
60 |
- `./doc/`: Holds documentation related to the project, such as license and created figures. |
60 |
- `./doc/`: Holds documentation related to the project, such as license and created figures. |
61 |
|
61 |
|
62 |
- `./results/`: Stores the results of experiments, including .csv files with metrics and generated plots. |
62 |
- `./results/`: Stores the results of experiments, including .csv files with metrics and generated plots. |
63 |
|
63 |
|
64 |
- `./src/`: The central location for all source files for the project. |
64 |
- `./src/`: The central location for all source files for the project. |
65 |
|
65 |
|
66 |
- `./tests/`: Contains PyTest tests for the feature generation and the data models. |
66 |
- `./tests/`: Contains PyTest tests for the feature generation and the data models. |
67 |
|
67 |
|
68 |
- `./pyproject.toml`: The project configuration file. |
68 |
- `./pyproject.toml`: The project configuration file. |
69 |
|
69 |
|
70 |
- `./requirements.txt`: Lists the packages required to run the dataset preprocessing and active learning experiments. |
70 |
- `./requirements.txt`: Lists the packages required to run the dataset preprocessing and active learning experiments. |
71 |
|
71 |
|
72 |
A more detailed `README` can be found in each of the subdirectories. |
72 |
A more detailed `README` can be found in each of the subdirectories. |
73 |
|
73 |
|
74 |
|
74 |
|
75 |
## Reproducibility: |
75 |
## Reproducibility: |
76 |
|
76 |
|
77 |
### 1. Python Setup |
77 |
### 1. Python Setup |
78 |
|
78 |
|
79 |
First, create a virtual environment and activate it. |
79 |
First, create a virtual environment and activate it.
|
80 |
```bash |
80 |
```bash
|
81 |
python -m venv .venv |
81 |
python -m venv .venv
|
82 |
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows) |
82 |
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows)
|
83 |
``` |
83 |
``` |
84 |
|
84 |
|
85 |
Second, install the dependencies. |
85 |
Second, install the dependencies. |
86 |
|
86 |
|
87 |
```bash |
87 |
```bash
|
88 |
pip install -r requirements.txt |
88 |
pip install -r requirements.txt
|
89 |
``` |
89 |
``` |
90 |
|
90 |
|
91 |
### 2. Data download |
91 |
### 2. Data download
|
92 |
Go to [data/ddi](data/ddi), [data/n2c2](data/n2c2), [data/bioword2vec](data/bioword2vec/) and [data/ml_models](data/ml_models/) and follow the instructions in the `README` files. |
92 |
Go to [data/ddi](data/ddi), [data/n2c2](data/n2c2), [data/bioword2vec](data/bioword2vec/) and [data/ml_models](data/ml_models/) and follow the instructions in the `README` files. |
93 |
|
93 |
|
94 |
|
94 |
|
95 |
### 3. Data preprocessing |
95 |
### 3. Data preprocessing |
96 |
|
96 |
|
97 |
The following code will do the preprocessing of the desired corpus: |
97 |
The following code will do the preprocessing of the desired corpus: |
98 |
|
98 |
|
99 |
```Python |
99 |
```Python
|
100 |
from preprocessing import * |
100 |
from preprocessing import * |
101 |
|
101 |
|
102 |
corpus = "n2c2" # or "ddi" |
102 |
corpus = "n2c2" # or "ddi" |
103 |
|
103 |
|
104 |
# Split documents into sentences (only does something for n2c2) |
104 |
# Split documents into sentences (only does something for n2c2)
|
105 |
split_sentences(corpus) |
105 |
split_sentences(corpus) |
106 |
|
106 |
|
107 |
# Generate relation collections |
107 |
# Generate relation collections
|
108 |
collections = generate_relations(corpus, save_to_disk=True) |
108 |
collections = generate_relations(corpus, save_to_disk=True) |
109 |
|
109 |
|
110 |
# Generate statistics |
110 |
# Generate statistics
|
111 |
generate_statistics(corpus, collections) # will be printed to console |
111 |
generate_statistics(corpus, collections) # will be printed to console
|
112 |
``` |
112 |
``` |
113 |
|
113 |
|
114 |
For more information on how the preprocessing is implemented, please refer to the `README` in the [preprocessing](src/preprocessing/README.md) module. |
114 |
For more information on how the preprocessing is implemented, please refer to the `README` in the [preprocessing](src/preprocessing/README.md) module. |
115 |
|
115 |
|
116 |
|
116 |
|
117 |
### 4. Generation of training datasets |
117 |
### 4. Generation of training datasets |
118 |
|
118 |
|
119 |
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the `./data/` folder and can be loaded at runtime. Computing this at runtime would be too slow. |
119 |
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the `./data/` folder and can be loaded at runtime. Computing this at runtime would be too slow. |
120 |
|
120 |
|
121 |
The following code generates the HF Datasets for the desired corpus: |
121 |
The following code generates the HF Datasets for the desired corpus: |
122 |
|
122 |
|
123 |
```Python |
123 |
```Python
|
124 |
# Local Dependencies |
124 |
# Local Dependencies
|
125 |
# ------------------ |
125 |
# ------------------
|
126 |
from vocab import Vocabulary |
126 |
from vocab import Vocabulary
|
127 |
from models import RelationCollection |
127 |
from models import RelationCollection
|
128 |
from re_datasets import BilstmDatasetFactory, BertDatasetFactory |
128 |
from re_datasets import BilstmDatasetFactory, BertDatasetFactory |
129 |
|
129 |
|
130 |
corpus = "n2c2" # or "ddi" |
130 |
corpus = "n2c2" # or "ddi" |
131 |
|
131 |
|
132 |
# Load collections |
132 |
# Load collections
|
133 |
collection = RelationCollection.load_collections(corpus) |
133 |
collection = RelationCollection.load_collections(corpus) |
134 |
|
134 |
|
135 |
# Generate vocabulary |
135 |
# Generate vocabulary
|
136 |
train_collection = collections["train"] |
136 |
train_collection = collections["train"]
|
137 |
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True) |
137 |
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True) |
138 |
|
138 |
|
139 |
# Generate HF Datasets for BiLSTM and store in disk |
139 |
# Generate HF Datasets for BiLSTM and store in disk
|
140 |
BilstmDatasetFactory.create_datasets(corpus, collections, vocab) |
140 |
BilstmDatasetFactory.create_datasets(corpus, collections, vocab) |
141 |
|
141 |
|
142 |
# Generate HF Datasets for BERT and store in disk |
142 |
# Generate HF Datasets for BERT and store in disk
|
143 |
BertDatasetFactory.create_datasets(corpus, collections, vocab) |
143 |
BertDatasetFactory.create_datasets(corpus, collections, vocab)
|
144 |
``` |
144 |
```
|
145 |
|
145 |
|
146 |
|
146 |
|
147 |
### 5. Running experiments |
147 |
### 5. Running experiments |
148 |
|
148 |
|
149 |
To run the experiments, it is necessary to have the HF Datasets generated in the previous step. |
149 |
To run the experiments, it is necessary to have the HF Datasets generated in the previous step. |
150 |
|
150 |
|
151 |
The `./src/experiments` folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example: |
151 |
The `./src/experiments` folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example: |
152 |
|
152 |
|
153 |
```Python |
153 |
```Python
|
154 |
# Local Dependencies |
154 |
# Local Dependencies
|
155 |
from experiments import * |
155 |
from experiments import * |
156 |
|
156 |
|
157 |
# run experiments with BiLSTM on n2c2 |
157 |
# run experiments with BiLSTM on n2c2
|
158 |
bilstm_passive_learning_n2c2() |
158 |
bilstm_passive_learning_n2c2()
|
159 |
bilstm_active_learning_n2c2() |
159 |
bilstm_active_learning_n2c2() |
160 |
|
160 |
|
161 |
# or run experiments with Paired Clinical BERT on DDI |
161 |
# or run experiments with Paired Clinical BERT on DDI
|
162 |
bert_passive_learning_ddi(pairs=True) |
162 |
bert_passive_learning_ddi(pairs=True)
|
163 |
bert_active_learning_ddi(pairs=True) |
163 |
bert_active_learning_ddi(pairs=True)
|
164 |
``` |
164 |
``` |
165 |
|
165 |
|
166 |
For more information on how the experimental setting is implemented, please refer to the `README` in the [experiments](src/experiments/README.md) module. |
166 |
For more information on how the experimental setting is implemented, please refer to the `README` in the [experiments](src/experiments/README.md) module. |
167 |
|
167 |
|
168 |
### 6. Tracking experiments |
168 |
### 6. Tracking experiments |
169 |
|
169 |
|
170 |
To track the experiments we have chosen [neptune.ai](https://docs.neptune.ai/). Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface. |
170 |
To track the experiments we have chosen [neptune.ai](https://docs.neptune.ai/). Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface. |
171 |
|
171 |
|
172 |
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the [config.py](./src/config.py) file to the `NEPTUNE_PROJECT` and `NEPTUNE_API_TOKEN` variables, respectively. |
172 |
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the [config.py](./src/config.py) file to the `NEPTUNE_PROJECT` and `NEPTUNE_API_TOKEN` variables, respectively. |
173 |
|
173 |
|
174 |
```Python |
174 |
```Python
|
175 |
# config.py |
175 |
# config.py
|
176 |
NEPTUNE_PROJECT = "your_username/your_project_name" |
176 |
NEPTUNE_PROJECT = "your_username/your_project_name"
|
177 |
NEPTUNE_API_TOKEN = "your_api_token" |
177 |
NEPTUNE_API_TOKEN = "your_api_token"
|
178 |
``` |
178 |
``` |
179 |
|
179 |
|
180 |
If you don't want to use neptune, you can set the `logging` parameter to `False` when running an experiment. |
180 |
If you don't want to use neptune, you can set the `logging` parameter to `False` when running an experiment. |
181 |
|
181 |
|
182 |
```Python |
182 |
```Python
|
183 |
bilstm_passive_learning_n2c2(logging=False) |
183 |
bilstm_passive_learning_n2c2(logging=False)
|
184 |
``` |
184 |
``` |
185 |
|
185 |
|
186 |
|
186 |
|
187 |
## Resources |
187 |
## Resources |
188 |
|
188 |
|
189 |
### Corpora |
189 |
### Corpora
|
190 |
- [2018 n2c2 callenge](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/) |
190 |
- [2018 n2c2 callenge](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/)
|
191 |
- [DDI Extraction corpus](https://github.com/isegura/DDICorpus) |
191 |
- [DDI Extraction corpus](https://github.com/isegura/DDICorpus) |
192 |
|
192 |
|
193 |
### Word Embeddings |
193 |
### Word Embeddings
|
194 |
- [BioWord2vec](https://github.com/ncbi-nlp/BioWordVec) |
194 |
- [BioWord2vec](https://github.com/ncbi-nlp/BioWordVec) |
195 |
|
195 |
|
196 |
### Pre-trained Models |
196 |
### Pre-trained Models
|
197 |
- [Clinical BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT?text=He+was+administered+10+mg+of+%5BMASK%5D.) |
197 |
- [Clinical BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT?text=He+was+administered+10+mg+of+%5BMASK%5D.) |
198 |
|
198 |
|
199 |
|
199 |
|
200 |
### Libraries |
200 |
### Libraries |
201 |
|
201 |
|
202 |
- NLP: |
202 |
- NLP:
|
203 |
- [spaCy](https://spacy.io/) |
203 |
- [spaCy](https://spacy.io/)
|
204 |
- [negspaCy](https://github.com/jenojp/negspacy) |
204 |
- [negspaCy](https://github.com/jenojp/negspacy)
|
205 |
- [ScispaCy](https://allenai.github.io/scispacy/) |
205 |
- [ScispaCy](https://allenai.github.io/scispacy/)
|
206 |
- [gensim](https://radimrehurek.com/gensim/) |
206 |
- [gensim](https://radimrehurek.com/gensim/)
|
207 |
- [PyRuSH](https://github.com/jianlins/PyRuSH) |
207 |
- [PyRuSH](https://github.com/jianlins/PyRuSH) |
208 |
|
208 |
|
209 |
<br> |
209 |
<br> |
210 |
|
210 |
|
211 |
- Machine Learning: |
211 |
- Machine Learning:
|
212 |
- [sklearn](https://scikit-learn.org/stable/) |
212 |
- [sklearn](https://scikit-learn.org/stable/)
|
213 |
- [imbalanced-learn](https://imbalanced-learn.org/stable/) |
213 |
- [imbalanced-learn](https://imbalanced-learn.org/stable/) |
214 |
|
214 |
|
215 |
<br> |
215 |
<br> |
216 |
|
216 |
|
217 |
- Deep Learning: |
217 |
- Deep Learning:
|
218 |
- [PyTorch](https://pytorch.org/) |
218 |
- [PyTorch](https://pytorch.org/)
|
219 |
- [TorchMetrics](https://torchmetrics.readthedocs.io/en/latest/) |
219 |
- [TorchMetrics](https://torchmetrics.readthedocs.io/en/latest/)
|
220 |
- [HuggingFace Transformers](https://huggingface.co/transformers/) |
220 |
- [HuggingFace Transformers](https://huggingface.co/transformers/)
|
221 |
- [HuggingFace Datasets](https://huggingface.co/docs/datasets/) |
221 |
- [HuggingFace Datasets](https://huggingface.co/docs/datasets/)
|
222 |
- [Joey NMT](https://github.com/joeynmt/joeynmt) |
222 |
- [Joey NMT](https://github.com/joeynmt/joeynmt)
|
223 |
<br> |
223 |
<br> |
224 |
|
224 |
|
225 |
- Active Learning: |
225 |
- Active Learning:
|
226 |
- [Baal](https://baal.readthedocs.io/en/latest/) |
226 |
- [Baal](https://baal.readthedocs.io/en/latest/)
|
227 |
- [modAL](https://modal-python.readthedocs.io/en/latest/) |
227 |
- [modAL](https://modal-python.readthedocs.io/en/latest/) |
228 |
|
228 |
|
229 |
<br> |
229 |
<br> |
230 |
|
230 |
|
231 |
- Visualisations |
231 |
- Visualisations
|
232 |
- [seaborn](https://seaborn.pydata.org/) |
232 |
- [seaborn](https://seaborn.pydata.org/)
|
233 |
- [bertviz](https://github.com/jessevig/bertviz) |
233 |
- [bertviz](https://github.com/jessevig/bertviz)
|
234 |
- [dtreeviz](https://github.com/parrt/dtreeviz) |
234 |
- [dtreeviz](https://github.com/parrt/dtreeviz)
|
235 |
|
235 |
|
236 |
<br> |
236 |
<br> |
237 |
|
237 |
|
238 |
- Experiments metadata store: |
238 |
- Experiments metadata store:
|
239 |
- [neptune.ai](https://neptune.ai/) |
239 |
- [neptune.ai](https://neptune.ai/) |
240 |
|
240 |
|
241 |
|
241 |
|
242 |
## About |
242 |
## About |
243 |
|
243 |
|
244 |
This project was developed as part of **Pablo Valdunciel Sánchez**'s master's thesis in the *Data Science and Artificial Intelligence* master's programme at Saarland University (Germany). The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by **Prof. Dr. Antonio Krüger**, CEO and scientific director of the DFKI, and **Prof. Dr. Daniel Sonntag**, director of the Interactive Machine Learning (IML) department at DFKI, and advised by **Siting Liang**, researcher in the IML department. |
244 |
This project was developed as part of **Pablo Valdunciel Sánchez**'s master's thesis in the *Data Science and Artificial Intelligence* master's programme at Saarland University (Germany). The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by **Prof. Dr. Antonio Krüger**, CEO and scientific director of the DFKI, and **Prof. Dr. Daniel Sonntag**, director of the Interactive Machine Learning (IML) department at DFKI, and advised by **Siting Liang**, researcher in the IML department.
|