--- a +++ b/README.md @@ -0,0 +1,125 @@ +# FEMR +### Framework for Electronic Medical Records + +**FEMR** is a Python package for manipulating longitudinal EHR data for machine learning, with a focus on supporting the creation of foundation models and verifying their [presumed benefits](https://hai.stanford.edu/news/how-foundation-models-can-advance-ai-healthcare) in healthcare. Such a framework is needed given the [current state of large language models in healthcare](https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare) and the need for better evaluation frameworks. + +The currently supported foundation models are [CLMBR](https://arxiv.org/pdf/2001.05295.pdf) and [MOTOR](https://arxiv.org/abs/2301.03150). + +**FEMR** works with data that has been converted to the [MEDS](https://github.com/Medical-Event-Data-Standard/) schema, a simple schema that supports a wide variety of EHR / claims datasets. Please see the MEDS documentation, and in particular its [provided ETLs](https://github.com/Medical-Event-Data-Standard/meds_etl) for help converting your data to MEDS. + +**FEMR** helps users: +1. [Use ontologies to better understand / featurize medical codes](http://github.com/som-shahlab/femr/blob/main/tutorials/1_Ontology.ipynb) +2. [Algorithmically label patient records based on structured data](https://github.com/som-shahlab/femr/blob/main/tutorials/2_Labeling.ipynb) +3. [Generate tabular features from patient timelines for use with traditional gradient boosted tree models](https://github.com/som-shahlab/femr/blob/main/tutorials/3_Count%20Featurization%20And%20Modeling.ipynb) +4. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/4_Train%20CLMBR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/5_CLMBR%20Featurization%20And%20Modeling.ipynb) CLMBR-derived models for binary classification and prediction tasks. +5. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/6_Train%20MOTOR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/7_MOTOR%20Featurization%20And%20Modeling.ipynb) MOTOR-derived models for binary classification and prediction tasks. + +We recommend users start with our [tutorial folder](https://github.com/som-shahlab/femr/tree/main/tutorials) + +# Installation + +```bash +pip install femr + +# If you are using deep learning, you also need to install xformers +# +# Note that xformers has some known issues with MacOS. +# If you are using MacOS you might also need to install llvm. See https://stackoverflow.com/questions/60005176/how-to-deal-with-clang-error-unsupported-option-fopenmp-on-travis +pip install xformers + +``` +# Getting Started + +The first step of using **FEMR** is to convert your patient data into [MEDS](https://github.com/Medical-Event-Data-Standard), the standard input format expected by **FEMR** codebase. + +**Note: FEMR currently only supports MEDS v1, so you will need to install MEDS v1 versions of packages. Aka pip install meds-etl==0.1.3** + +The best way to do this is with the [ETLs provided by MEDS](https://github.com/Medical-Event-Data-Standard/meds_etl). + + +## OMOP Data + +If you have OMOP CDM formated data, follow these instructions: + +1. Download your OMOP dataset to `[PATH_TO_SOURCE_OMOP]`. +2. Convert OMOP => MEDS using the following: +```bash +# Convert OMOP => MEDS data format +meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS] +``` + +3. Use HuggingFace's Datasets library to load our dataset in Python +```bash +import datasets +dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*') + +# Print dataset stats +print(dataset) +>>> Dataset({ +>>> features: ['patient_id', 'events'], +>>> num_rows: 6732 +>>> }) + +# Print number of events in first patient in dataset +print(len(dataset[0]['events'])) +>>> 2287 +``` + +## Stanford STARR-OMOP Data + +If you are using the STARR-OMOP dataset from Stanford (which uses the OMOP CDM), we add an initial Stanford-specific preprocessing step. Otherwise this should be identical to the **OMOP Data** section. Follow these instructions: + +1. Download your STARR-OMOP dataset to `[PATH_TO_SOURCE_OMOP]`. +2. Convert STARR-OMOP => MEDS using the following: +```bash +# Convert OMOP => MEDS data format +meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS]_raw + +# Apply Stanford fixes +femr_stanford_omop_fixer [PATH_TO_OUTPUT_MEDS]_raw [PATH_TO_OUTPUT_MEDS] +``` + +3. Use HuggingFace's Datasets library to load our dataset in Python +```bash +import datasets +dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*') + +# Print dataset stats +print(dataset) +>>> Dataset({ +>>> features: ['patient_id', 'events'], +>>> num_rows: 6732 +>>> }) + +# Print number of events in first patient in dataset +print(len(dataset[0]['events'])) +>>> 2287 +``` + +# Development + +The following guides are for developers who want to contribute to **FEMR**. + +## Precommit checks + +Before committing, please run the following commands to ensure that your code is formatted correctly and passes all tests. + +### Installation +```bash +conda install pre-commit pytest -y +pre-commit install +``` + +### Running + +#### Test Functions + +```bash +pytest tests +``` + +### Formatting Checks + +```bash +pre-commit run --all-files +```