Diff of /README.md [000000] .. [ce2cbf]

Switch to unified view

a b/README.md
1
# Disease-NER
2
Given a medical diagnosis, identifying medical conditions within the text and mapping them to standardized medical encodings.
3
4
## Data
5
The data directory contains:
6
- The disease mentions from the text files stored in entities.tsv.
7
- Text files containing the medical textual data in the text directory.
8
9
The data is taken from the English version of multilingual resources of the DisTEMIST 2022 task: https://zenodo.org/record/6532684  
10
11
## Pre-processing
12
The pre-processing stage involves:
13
- Splitting medical text in each file into sentences.
14
- Tokenizing the sentences into words/tokens.
15
- Calculating IOB tags for the tokens for named entity recognition (NER) task.
16
17
- Code: [Pre-processing.ipynb](Pre-processing.ipynb)
18
19
## NER Task
20
- Two Types of Models are built:
21
  - The entire clinical case / document is given as input
22
  - Sentence based Tokenization and the sentences are given as input
23
24
- The basic models used are : 
25
  - https://huggingface.co/d4data/biomedical-ner-all
26
  - https://huggingface.co/pucpr/clinicalnerpt-medical
27
28
- Disease mentions identification is built as a Token classification problem.
29
30
- Code: [Entities_NER.ipynb](Entities_NER.ipynb)
31
32
## Entity Linking Task
33
- The disease mentions are linked to SNOMED CT codes.
34
35
- The models used are: 
36
  - SapBERT: https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext
37
  - Roberta-Large: https://huggingface.co/raynardj/pmc-med-bio-mlm-roberta-large
38
  - PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
39
  
40
- Code: [EL.ipynb](EL.ipynb) (SapBERT), [EL_roberta.ipynb](EL_roberta.ipynb) (Roberta-Large), [EL_pubmedbert.ipynb](EL_pubmedbert.ipynb) (PubMedBERT)