Disease-NER

Given a medical diagnosis, identifying medical conditions within the text and mapping them to standardized medical encodings.

Data

The data directory contains:
- The disease mentions from the text files stored in entities.tsv.
- Text files containing the medical textual data in the text directory.

The data is taken from the English version of multilingual resources of the DisTEMIST 2022 task: https://zenodo.org/record/6532684

Pre-processing

The pre-processing stage involves:
- Splitting medical text in each file into sentences.
- Tokenizing the sentences into words/tokens.
- Calculating IOB tags for the tokens for named entity recognition (NER) task.

Code: Pre-processing.ipynb

NER Task

Two Types of Models are built:
The entire clinical case / document is given as input
Sentence based Tokenization and the sentences are given as input
The basic models used are :
https://huggingface.co/d4data/biomedical-ner-all
https://huggingface.co/pucpr/clinicalnerpt-medical
Disease mentions identification is built as a Token classification problem.
Code: Entities_NER.ipynb

Entity Linking Task

The disease mentions are linked to SNOMED CT codes.
The models used are:
SapBERT: https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext
Roberta-Large: https://huggingface.co/raynardj/pmc-med-bio-mlm-roberta-large
PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
Code: EL.ipynb (SapBERT), EL_roberta.ipynb (Roberta-Large), EL_pubmedbert.ipynb (PubMedBERT)