|
a |
|
b/README.md |
|
|
1 |
# Disease-NER |
|
|
2 |
Given a medical diagnosis, identifying medical conditions within the text and mapping them to standardized medical encodings. |
|
|
3 |
|
|
|
4 |
## Data |
|
|
5 |
The data directory contains: |
|
|
6 |
- The disease mentions from the text files stored in entities.tsv. |
|
|
7 |
- Text files containing the medical textual data in the text directory. |
|
|
8 |
|
|
|
9 |
The data is taken from the English version of multilingual resources of the DisTEMIST 2022 task: https://zenodo.org/record/6532684 |
|
|
10 |
|
|
|
11 |
## Pre-processing |
|
|
12 |
The pre-processing stage involves: |
|
|
13 |
- Splitting medical text in each file into sentences. |
|
|
14 |
- Tokenizing the sentences into words/tokens. |
|
|
15 |
- Calculating IOB tags for the tokens for named entity recognition (NER) task. |
|
|
16 |
|
|
|
17 |
- Code: [Pre-processing.ipynb](Pre-processing.ipynb) |
|
|
18 |
|
|
|
19 |
## NER Task |
|
|
20 |
- Two Types of Models are built: |
|
|
21 |
- The entire clinical case / document is given as input |
|
|
22 |
- Sentence based Tokenization and the sentences are given as input |
|
|
23 |
|
|
|
24 |
- The basic models used are : |
|
|
25 |
- https://huggingface.co/d4data/biomedical-ner-all |
|
|
26 |
- https://huggingface.co/pucpr/clinicalnerpt-medical |
|
|
27 |
|
|
|
28 |
- Disease mentions identification is built as a Token classification problem. |
|
|
29 |
|
|
|
30 |
- Code: [Entities_NER.ipynb](Entities_NER.ipynb) |
|
|
31 |
|
|
|
32 |
## Entity Linking Task |
|
|
33 |
- The disease mentions are linked to SNOMED CT codes. |
|
|
34 |
|
|
|
35 |
- The models used are: |
|
|
36 |
- SapBERT: https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext |
|
|
37 |
- Roberta-Large: https://huggingface.co/raynardj/pmc-med-bio-mlm-roberta-large |
|
|
38 |
- PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract |
|
|
39 |
|
|
|
40 |
- Code: [EL.ipynb](EL.ipynb) (SapBERT), [EL_roberta.ipynb](EL_roberta.ipynb) (Roberta-Large), [EL_pubmedbert.ipynb](EL_pubmedbert.ipynb) (PubMedBERT) |