|
a |
|
b/README.md |
|
|
1 |
# FEMR |
|
|
2 |
### Framework for Electronic Medical Records |
|
|
3 |
|
|
|
4 |
**FEMR** is a Python package for manipulating longitudinal EHR data for machine learning, with a focus on supporting the creation of foundation models and verifying their [presumed benefits](https://hai.stanford.edu/news/how-foundation-models-can-advance-ai-healthcare) in healthcare. Such a framework is needed given the [current state of large language models in healthcare](https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare) and the need for better evaluation frameworks. |
|
|
5 |
|
|
|
6 |
The currently supported foundation models are [CLMBR](https://arxiv.org/pdf/2001.05295.pdf) and [MOTOR](https://arxiv.org/abs/2301.03150). |
|
|
7 |
|
|
|
8 |
**FEMR** works with data that has been converted to the [MEDS](https://github.com/Medical-Event-Data-Standard/) schema, a simple schema that supports a wide variety of EHR / claims datasets. Please see the MEDS documentation, and in particular its [provided ETLs](https://github.com/Medical-Event-Data-Standard/meds_etl) for help converting your data to MEDS. |
|
|
9 |
|
|
|
10 |
**FEMR** helps users: |
|
|
11 |
1. [Use ontologies to better understand / featurize medical codes](http://github.com/som-shahlab/femr/blob/main/tutorials/1_Ontology.ipynb) |
|
|
12 |
2. [Algorithmically label patient records based on structured data](https://github.com/som-shahlab/femr/blob/main/tutorials/2_Labeling.ipynb) |
|
|
13 |
3. [Generate tabular features from patient timelines for use with traditional gradient boosted tree models](https://github.com/som-shahlab/femr/blob/main/tutorials/3_Count%20Featurization%20And%20Modeling.ipynb) |
|
|
14 |
4. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/4_Train%20CLMBR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/5_CLMBR%20Featurization%20And%20Modeling.ipynb) CLMBR-derived models for binary classification and prediction tasks. |
|
|
15 |
5. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/6_Train%20MOTOR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/7_MOTOR%20Featurization%20And%20Modeling.ipynb) MOTOR-derived models for binary classification and prediction tasks. |
|
|
16 |
|
|
|
17 |
We recommend users start with our [tutorial folder](https://github.com/som-shahlab/femr/tree/main/tutorials) |
|
|
18 |
|
|
|
19 |
# Installation |
|
|
20 |
|
|
|
21 |
```bash |
|
|
22 |
pip install femr |
|
|
23 |
|
|
|
24 |
# If you are using deep learning, you also need to install xformers |
|
|
25 |
# |
|
|
26 |
# Note that xformers has some known issues with MacOS. |
|
|
27 |
# If you are using MacOS you might also need to install llvm. See https://stackoverflow.com/questions/60005176/how-to-deal-with-clang-error-unsupported-option-fopenmp-on-travis |
|
|
28 |
pip install xformers |
|
|
29 |
|
|
|
30 |
``` |
|
|
31 |
# Getting Started |
|
|
32 |
|
|
|
33 |
The first step of using **FEMR** is to convert your patient data into [MEDS](https://github.com/Medical-Event-Data-Standard), the standard input format expected by **FEMR** codebase. |
|
|
34 |
|
|
|
35 |
**Note: FEMR currently only supports MEDS v1, so you will need to install MEDS v1 versions of packages. Aka pip install meds-etl==0.1.3** |
|
|
36 |
|
|
|
37 |
The best way to do this is with the [ETLs provided by MEDS](https://github.com/Medical-Event-Data-Standard/meds_etl). |
|
|
38 |
|
|
|
39 |
|
|
|
40 |
## OMOP Data |
|
|
41 |
|
|
|
42 |
If you have OMOP CDM formated data, follow these instructions: |
|
|
43 |
|
|
|
44 |
1. Download your OMOP dataset to `[PATH_TO_SOURCE_OMOP]`. |
|
|
45 |
2. Convert OMOP => MEDS using the following: |
|
|
46 |
```bash |
|
|
47 |
# Convert OMOP => MEDS data format |
|
|
48 |
meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS] |
|
|
49 |
``` |
|
|
50 |
|
|
|
51 |
3. Use HuggingFace's Datasets library to load our dataset in Python |
|
|
52 |
```bash |
|
|
53 |
import datasets |
|
|
54 |
dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*') |
|
|
55 |
|
|
|
56 |
# Print dataset stats |
|
|
57 |
print(dataset) |
|
|
58 |
>>> Dataset({ |
|
|
59 |
>>> features: ['patient_id', 'events'], |
|
|
60 |
>>> num_rows: 6732 |
|
|
61 |
>>> }) |
|
|
62 |
|
|
|
63 |
# Print number of events in first patient in dataset |
|
|
64 |
print(len(dataset[0]['events'])) |
|
|
65 |
>>> 2287 |
|
|
66 |
``` |
|
|
67 |
|
|
|
68 |
## Stanford STARR-OMOP Data |
|
|
69 |
|
|
|
70 |
If you are using the STARR-OMOP dataset from Stanford (which uses the OMOP CDM), we add an initial Stanford-specific preprocessing step. Otherwise this should be identical to the **OMOP Data** section. Follow these instructions: |
|
|
71 |
|
|
|
72 |
1. Download your STARR-OMOP dataset to `[PATH_TO_SOURCE_OMOP]`. |
|
|
73 |
2. Convert STARR-OMOP => MEDS using the following: |
|
|
74 |
```bash |
|
|
75 |
# Convert OMOP => MEDS data format |
|
|
76 |
meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS]_raw |
|
|
77 |
|
|
|
78 |
# Apply Stanford fixes |
|
|
79 |
femr_stanford_omop_fixer [PATH_TO_OUTPUT_MEDS]_raw [PATH_TO_OUTPUT_MEDS] |
|
|
80 |
``` |
|
|
81 |
|
|
|
82 |
3. Use HuggingFace's Datasets library to load our dataset in Python |
|
|
83 |
```bash |
|
|
84 |
import datasets |
|
|
85 |
dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*') |
|
|
86 |
|
|
|
87 |
# Print dataset stats |
|
|
88 |
print(dataset) |
|
|
89 |
>>> Dataset({ |
|
|
90 |
>>> features: ['patient_id', 'events'], |
|
|
91 |
>>> num_rows: 6732 |
|
|
92 |
>>> }) |
|
|
93 |
|
|
|
94 |
# Print number of events in first patient in dataset |
|
|
95 |
print(len(dataset[0]['events'])) |
|
|
96 |
>>> 2287 |
|
|
97 |
``` |
|
|
98 |
|
|
|
99 |
# Development |
|
|
100 |
|
|
|
101 |
The following guides are for developers who want to contribute to **FEMR**. |
|
|
102 |
|
|
|
103 |
## Precommit checks |
|
|
104 |
|
|
|
105 |
Before committing, please run the following commands to ensure that your code is formatted correctly and passes all tests. |
|
|
106 |
|
|
|
107 |
### Installation |
|
|
108 |
```bash |
|
|
109 |
conda install pre-commit pytest -y |
|
|
110 |
pre-commit install |
|
|
111 |
``` |
|
|
112 |
|
|
|
113 |
### Running |
|
|
114 |
|
|
|
115 |
#### Test Functions |
|
|
116 |
|
|
|
117 |
```bash |
|
|
118 |
pytest tests |
|
|
119 |
``` |
|
|
120 |
|
|
|
121 |
### Formatting Checks |
|
|
122 |
|
|
|
123 |
```bash |
|
|
124 |
pre-commit run --all-files |
|
|
125 |
``` |