Our exploration into Social Determinants of Health (SDOH) classification using AI models has led to several insightful findings:
Our research involves the application of two primary models for the classification tasks:
And you can find the detailed annotation guideline under resource
BERT model serves as our baseline in this project. For comparative analysis, we use advanced sequence-to-sequence models. You can find the script for training and evaluation in bert_train_predict.py
.
Our sequence-to-sequence model, used for the classification tasks, is trained using the FLAN-T5 madel family ranging from base to XXL sizes. The training and prediction scripts are provided in t5_train.py
and t5_predict.py
, respectively. The main libraries used for these tasks are transformers
and peft
.
Our research uses synthetic data for model training. The synthetic data, available in CSV format, was developed through several iterations with the script under synthetic_data
:
The figure below demostrates the creation process of the sythetic SDoH Human Annotated Demographic Robustness dataset (SHADR) Partial_Iteration_2_demographic_annotated.csv
.
If you want to evaluate your model on this, you should first inference on the original sentence, then use the same model to inference on the demographic modified sentences for robustness comparisons as shown in the figure below.
synthetic_data_generation_GPT.ipynb
.A comparison of model performance on a human-validated subset of synthetic data is demonstrated in the gpt_vs_ftt5.ipynb
notebook.
How to Cite:
@misc{guevara&chen2024large,
title={Large Language Models to Identify Social Determinants of Health in Electronic Health Records},
author={Marco Guevara and Shan Chen and Spencer Thomas and Tafadzwa L. Chaunzwa and Idalid Franco and Benjamin Kann and Shalini Moningi and Jack Qian and Madeleine Goldstein and Susan Harper and Hugo JWL Aerts and Guergana K. Savova and Raymond H. Mak and Danielle S. Bitterman},
year={2024},
eprint={npj Digit. Med. 7, 6}
doi={https://doi.org/10.1038/s41746-023-00970-0}
}
For further information or queries, please contact our lab at https://aim.hms.harvard.edu/contact.