Diff of /README.md [000000] .. [f54d94]

Switch to unified view

a b/README.md
1
# FEMR
2
### Framework for Electronic Medical Records
3
4
**FEMR** is a Python package for manipulating longitudinal EHR data for machine learning, with a focus on supporting the creation of foundation models and verifying their [presumed benefits](https://hai.stanford.edu/news/how-foundation-models-can-advance-ai-healthcare) in healthcare. Such a framework is needed given the [current state of large language models in healthcare](https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare) and the need for better evaluation frameworks.
5
6
The currently supported foundation models are [CLMBR](https://arxiv.org/pdf/2001.05295.pdf) and [MOTOR](https://arxiv.org/abs/2301.03150).
7
8
**FEMR** works with data that has been converted to the [MEDS](https://github.com/Medical-Event-Data-Standard/) schema, a simple schema that supports a wide variety of EHR / claims datasets. Please see the MEDS documentation, and in particular its [provided ETLs](https://github.com/Medical-Event-Data-Standard/meds_etl) for help converting your data to MEDS.
9
10
**FEMR** helps users:
11
1. [Use ontologies to better understand / featurize medical codes](http://github.com/som-shahlab/femr/blob/main/tutorials/1_Ontology.ipynb)
12
2. [Algorithmically label patient records based on structured data](https://github.com/som-shahlab/femr/blob/main/tutorials/2_Labeling.ipynb)
13
3. [Generate tabular features from patient timelines for use with traditional gradient boosted tree models](https://github.com/som-shahlab/femr/blob/main/tutorials/3_Count%20Featurization%20And%20Modeling.ipynb)
14
4. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/4_Train%20CLMBR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/5_CLMBR%20Featurization%20And%20Modeling.ipynb) CLMBR-derived models for binary classification and prediction tasks.
15
5. [Train](https://github.com/som-shahlab/femr/blob/main/tutorials/6_Train%20MOTOR.ipynb) and [finetune](https://github.com/som-shahlab/femr/blob/main/tutorials/7_MOTOR%20Featurization%20And%20Modeling.ipynb) MOTOR-derived models for binary classification and prediction tasks.
16
17
We recommend users start with our [tutorial folder](https://github.com/som-shahlab/femr/tree/main/tutorials)
18
19
# Installation
20
21
```bash
22
pip install femr
23
24
# If you are using deep learning, you also need to install xformers
25
#
26
# Note that xformers has some known issues with MacOS.
27
# If you are using MacOS you might also need to install llvm. See https://stackoverflow.com/questions/60005176/how-to-deal-with-clang-error-unsupported-option-fopenmp-on-travis
28
pip install xformers
29
30
```
31
# Getting Started
32
33
The first step of using **FEMR** is to convert your patient data into [MEDS](https://github.com/Medical-Event-Data-Standard), the standard input format expected by **FEMR** codebase.
34
35
**Note: FEMR currently only supports MEDS v1, so you will need to install MEDS v1 versions of packages. Aka pip install meds-etl==0.1.3**
36
37
The best way to do this is with the [ETLs provided by MEDS](https://github.com/Medical-Event-Data-Standard/meds_etl).
38
39
40
## OMOP Data
41
42
If you have OMOP CDM formated data, follow these instructions:
43
44
1. Download your OMOP dataset to `[PATH_TO_SOURCE_OMOP]`.
45
2. Convert OMOP => MEDS using the following:
46
```bash
47
# Convert OMOP => MEDS data format
48
meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS]
49
```
50
51
3. Use HuggingFace's Datasets library to load our dataset in Python
52
```bash
53
import datasets
54
dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*')
55
56
# Print dataset stats
57
print(dataset)
58
>>> Dataset({
59
>>>   features: ['patient_id', 'events'],
60
>>>   num_rows: 6732
61
>>> })
62
63
# Print number of events in first patient in dataset
64
print(len(dataset[0]['events']))
65
>>> 2287
66
```
67
68
## Stanford STARR-OMOP Data
69
70
If you are using the STARR-OMOP dataset from Stanford (which uses the OMOP CDM), we add an initial Stanford-specific preprocessing step. Otherwise this should be identical to the **OMOP Data** section. Follow these instructions:
71
72
1. Download your STARR-OMOP dataset to `[PATH_TO_SOURCE_OMOP]`.
73
2. Convert STARR-OMOP => MEDS using the following:
74
```bash
75
# Convert OMOP => MEDS data format
76
meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT_MEDS]_raw
77
78
# Apply Stanford fixes
79
femr_stanford_omop_fixer [PATH_TO_OUTPUT_MEDS]_raw [PATH_TO_OUTPUT_MEDS]
80
```
81
82
3. Use HuggingFace's Datasets library to load our dataset in Python
83
```bash
84
import datasets
85
dataset = datasets.Dataset.from_parquet(PATH_TO_OUTPUT_MEDS + 'data/*')
86
87
# Print dataset stats
88
print(dataset)
89
>>> Dataset({
90
>>>   features: ['patient_id', 'events'],
91
>>>   num_rows: 6732
92
>>> })
93
94
# Print number of events in first patient in dataset
95
print(len(dataset[0]['events']))
96
>>> 2287
97
```
98
99
# Development
100
101
The following guides are for developers who want to contribute to **FEMR**.
102
103
## Precommit checks
104
105
Before committing, please run the following commands to ensure that your code is formatted correctly and passes all tests.
106
107
### Installation
108
```bash
109
conda install pre-commit pytest -y
110
pre-commit install
111
```
112
113
### Running
114
115
#### Test Functions
116
117
```bash
118
pytest tests
119
```
120
121
### Formatting Checks
122
123
```bash
124
pre-commit run --all-files
125
```