# Count Featurization And Models

FEMR contains several utilities to implement common tabular featurization strategies.

[CountFeaturizer](https://github.com/som-shahlab/femr/blob/main/src/femr/featurizers/featurizers.py#L180) is the main class and it documents the various supported options.

In order to use the featurizer, you must construct a featurizer list, prepare the featurizers, and then featurize.

In [1]:
import pickle
import femr.featurizers
import femr.labelers
import meds
import pyarrow.csv
import datasets
import femr.index

# Load some labels
labels = pyarrow.csv.read_csv('input/labels.csv').to_pylist()

# Load our data
dataset = datasets.Dataset.from_parquet("input/meds/data/*")

# We need to create an index to allow us to find patients quickly
index = femr.index.PatientIndex(dataset)
    
# Define our featurizer

# Note that we are using both ages and counts here
age = femr.featurizers.AgeFeaturizer(is_normalize=False)
count = femr.featurizers.CountFeaturizer(string_value_combination=True)
featurizer_age_count = femr.featurizers.FeaturizerList([age, count])

# Preprocessing the featurizers, which includes processes such as normalizing age.
featurizer_age_count.preprocess_featurizers(dataset, index, labels)

# Actually do the featurization
features = featurizer_age_count.featurize(dataset, index, labels)

  from .autonotebook import tqdm as notebook_tqdm
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 36758.28 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3295.78 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2998.20 examples/s]


In [2]:
# Results consist of three components, the patient ids, feature times, and the features themselves

for k, v in features.items():
    print(k, v.shape)

patient_ids (200,)
feature_times (200,)
features (200, 1884)


# Joining features and labels

Given a feature set, it's important to be able to join a set of labels to those features.

This can be done with femr.featurizers.join_labels

In [3]:
features_and_labels = femr.featurizers.join_labels(features, labels)

for k, v in features_and_labels.items():
    print(k, v.shape)

boolean_values (200,)
patient_ids (200,)
times (200,)
features (200, 1884)


# Data Splitting

FEMR contains utilities for doing hash based patient splitting, where splits are determined based on a hash value of the patient id.

This is a deterministic approximate approach for splitting that is both stable and scalable.

In [4]:
import femr.splits
import numpy as np

# We split into a global training and test set
split = femr.splits.generate_hash_split(set(features_and_labels['patient_ids']), seed=87, frac_test=0.3)

train_mask = np.isin(features_and_labels['patient_ids'], split.train_patient_ids)
test_mask = np.isin(features_and_labels['patient_ids'], split.test_patient_ids)

percent_train = .70
X_train, y_train = (
    features_and_labels['features'][train_mask],
    features_and_labels['boolean_values'][train_mask],
)
X_test, y_test = (
    features_and_labels['features'][test_mask],
    features_and_labels['boolean_values'][test_mask],
)

# Building Models

The generated features can then be used to build your standard models. In this case we construct both logistic regression and XGBoost models and evaluate them.

Performance is perfect since our task (predicting gender) is 100% determined by the features

In [5]:
import xgboost as xgb
import sklearn.linear_model
import sklearn.metrics
import sklearn.preprocessing

def run_analysis(title: str, y_train, y_train_proba, y_test, y_test_proba):
    print(f"---- {title} ----")
    print("Train:")
    print_metrics(y_train, y_train_proba)
    print("Test:")
    print_metrics(y_test, y_test_proba)

def print_metrics(y_true, y_proba):
    y_pred = y_proba > 0.5
    auroc = sklearn.metrics.roc_auc_score(y_true, y_proba)
    aps = sklearn.metrics.average_precision_score(y_true, y_proba)
    accuracy = sklearn.metrics.accuracy_score(y_true, y_pred)
    f1 = sklearn.metrics.f1_score(y_true, y_pred)
    print("\tAUROC:", auroc)
    print("\tAPS:", aps)
    print("\tAccuracy:", accuracy)
    print("\tF1 Score:", f1)


scaler = sklearn.preprocessing.MaxAbsScaler().fit(
    X_train
)  # best for sparse data: see https://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = sklearn.linear_model.LogisticRegressionCV(penalty="l2", solver="liblinear").fit(X_train_scaled, y_train)
y_train_proba = model.predict_proba(X_train_scaled)[::, 1]
y_test_proba = model.predict_proba(X_test_scaled)[::, 1]
run_analysis("Logistic Regression", y_train, y_train_proba, y_test, y_test_proba)


# XGBoost
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_train_proba = model.predict_proba(X_train)[::, 1]
y_test_proba = model.predict_proba(X_test)[::, 1]
run_analysis("XGBoost", y_train, y_train_proba, y_test, y_test_proba)

---- Logistic Regression ----
Train:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
Test:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
---- XGBoost ----
Train:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
Test:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
