Given a medical diagnosis, identifying medical conditions within the text and mapping them to standardized medical encodings.
The data directory contains:
- The disease mentions from the text files stored in entities.tsv.
- Text files containing the medical textual data in the text directory.
The data is taken from the English version of multilingual resources of the DisTEMIST 2022 task: https://zenodo.org/record/6532684
The pre-processing stage involves:
- Splitting medical text in each file into sentences.
- Tokenizing the sentences into words/tokens.
- Calculating IOB tags for the tokens for named entity recognition (NER) task.
Sentence based Tokenization and the sentences are given as input
The basic models used are :
Disease mentions identification is built as a Token classification problem.
Code: Entities_NER.ipynb
The disease mentions are linked to SNOMED CT codes.
The models used are:
PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
Code: EL.ipynb (SapBERT), EL_roberta.ipynb (Roberta-Large), EL_pubmedbert.ipynb (PubMedBERT)