[f48632]: / play_scripts / older_sept25 / 6.caret.Rmd

Download this file

254 lines (176 with data), 11.1 kB

---
title: "6.caret_meta"
author: "Emily Yeo"
date: "`r Sys.Date()`"
output: html_document
---
#### based on the tutorial : https://topepo.github.io/caret/
#### R markdown guide https://bookdown.org/yihui/rmarkdown/html-document.html

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
pacman::p_load(knitr, tidyverse, dplyr, ggplot2, stringr, tibble, data.table,
               gridExtra, AppliedPredictiveModeling, caret, mlbench, earth)
```

###  Read in data

```{r echo=FALSE}
input <- "/Users/emily/projects/research/Stanislawski/comps/mutli-omic-predictions/data/"
met <- fread(paste0(input, "clinical/meta_metS.csv"))

met_c <- met %>% 
  mutate(clin.sig.wtloss.final = as.character(clin.sig.wtloss.final),
         record_id = as.character(record_id))
met_c <- met_c[!is.na(met_c$clin.sig.wtloss.final)]

# completely numeric data set
met_n <- as.data.frame(met)
int_cols <- sapply(met_n, is.integer)
met_n[int_cols] <- lapply(met_n[int_cols], as.numeric)
met_n <- met_n %>% dplyr::select(-c("subject", "study_id", "subject_id", 
                                    "record_id", "rmr_kcald_3m", 
                                    "rmr_rq_3m", "Intervention")) 
met_n$metS <- ifelse(met_n$metS == "healthy", 0, 1)
```

# 2.2 Change scatter 
to whatever variables you are interested in looking at 
```{r echo=FALSE}
regVar <- c("age", "class_attend_pct_wk52", "final.wt")
subset_df <- met_n %>% select(all_of(regVar))
str(subset_df)

theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 12
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
trellis.par.set(theme1)
featurePlot(x = subset_df, 
            y = met_n$outcome_BMI_fnl_BL, 
            plot = "scatter", 
            type = c("p", "smooth"),
            span = .5,
            layout = c(3, 1),
            labels = c("number", "BL BMI"))
```

# 3.1 Dummy Vars {.tabset}

### info
The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method. Note there is no intercept and each factor has a dummy variable for each level, so this parameterization may not be useful for some model functions, such as lm.

```{r echo=TRUE}
dummies <- dummyVars(clin.sig.wtloss.final ~., data = met_n)
```

### demo
```{r}
head(predict(dummies, newdata = met_n))
```

# {-}

# 3.2 Zero & near zero predictors {.tabset}

## Code 

Code to identify numeric values that are highly unbalanced
```{r}
data.frame(table(met_n$maintain))
```

Looking at the MDRR data, the nearZeroVar function can be used to identify near zero-variance variables (the saveMetrics argument can be used to show the details and usually defaults to FALSE):
By default, nearZeroVar will return the positions of the variables that are flagged to be problematic.

```{r}
nzv <- nearZeroVar(met_n, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]

dim(met_n)

nzv <- nearZeroVar(met_n)
filteredDescr <- met_n[, -nzv]
dim(filteredDescr)
setdiff(colnames(met_n), colnames(filteredDescr))
```

So it seem sthe withdrawal period is something that has unbalanced zeros 

## Background

In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. Similarly, predictors might have only a handful of unique values that occur with very low frequencies - also unbalanced. 

The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling.

To identify these types of predictors, the following two metrics can be calculated:

the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data and
the “percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases
If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance.

We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. Using both criteria should not falsely detect such predictors.

# {-}

# 3.3 Identifying Correlated Predictors

Given a correlation matrix, the findCorrelation function uses the following algorithm to flag predictors for removal. Seems to suggest removing descriptors with absolute correlations above 0.75.

```{r}
descrCor <-  cor(filteredDescr)
highCorr <- sum(abs(descrCor[upper.tri(descrCor)]) > .999)
summary(descrCor[upper.tri(descrCor)])

# wait for imputeted data to do this
#highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
#filteredDescr <- filteredDescr[,-highlyCorDescr]
#descrCor2 <- cor(filteredDescr)
#summary(descrCor2[upper.tri(descrCor2)])
```

# 3.6 Centering and scaling {.tabset}

### Code

The preProcess option "range" scales the data to the interval between zero and one.
Half of the data are used to estimate the location and scale of the predictors. The function preProcess doesn’t actually pre-process the data. predict.preProcess is used to pre-process this and other data sets.

```{r}
# Extract the continuous target variable
myTarget <- met_n$outcome_BMI_fnl_12m

# Remove the target column from the features
myData <- met_n[, !names(met_n) %in% "target"]

# remove column that had near zero-variance variables
myData <- myData %>% dplyr::select(-c(withdrawal))

# Split the data into training and test sets
set.seed(123)  # Set seed for reproducibility
inTrain <- sample(seq(nrow(myData)), size = nrow(myData) / 2)
training <- myData[inTrain,]
test <- myData[-inTrain,]
dim(training)
dim(test)

trainTarget <- myTarget[inTrain]
testTarget <- myTarget[-inTrain]

# Preprocess the data
preProcValues <- preProcess(training, method = c("center", "scale"))
trainTransformed <- predict(preProcValues, training)
testTransformed <- predict(preProcValues, test)
```


### Background

The preProcess class can be used for many operations on predictors, including centering and scaling. The transformation can be estimated from the training data and applied to any data set with the same variables. The function preProcess estimates the required parameters for each operation and predict.preProcess is used to apply them to specific data sets. This function can also be interfaces when calling the train function.

Several types of techniques are described in the next few sections and then another example is used to demonstrate how multiple methods can be used. Note that, in all cases, the preProcess function estimates whatever it requires from a specific data set (e.g. the training set) and then applies these transformations to any data set without recomputing the values

# {-}

# Imputation {.tabset}

### Background
preProcess can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger preProcess to center and scale the data, regardless of what is in the method argument. Alternatively, bagged trees can also be used to impute. For each predictor in the data, a bagged tree is created using all of the other predictors in the training set. When a new sample has a missing predictor value, the bagged model is used to predict the value. While, in theory, this is a more powerful method of imputing, the computational costs are much higher than the nearest neighbor technique.

### Code 

TBD

# {-}

# Transforming Predictors {.tabset}

### Code 

In some cases, there is a need to use principal component analysis (PCA) to transform the data to a smaller sub–space where the new variable are uncorrelated with one another. 
The “spatial sign” transformation (Serneels et al, 2006) projects the data for a predictor to the unit circle in p dimensions, where p is the number of predictors. Essentially, a vector of data is divided by its norm. The two figures below show two centered and scaled descriptors from the MDRR data before and after the spatial sign transformation. The predictors should be centered and scaled before applying this transformation.

```{r}
transparentTheme(trans = .4)

plotSubset <- data.frame(scale(testTransformed[, c("wt_pctchange_12m_bl", "outcome_wt_fnl_BL")])) 
group <- as.factor(testTransformed$sex)
levels(group) <- c("Woman", "Man")
xyplot(wt_pctchange_12m_bl ~ outcome_wt_fnl_BL,
       data = plotSubset,
       groups = group, 
       auto.key = list(columns = 2))  
```

After spatial sign 

```{r}
transformed <- spatialSign(plotSubset)
transformed <- as.data.frame(transformed)
xyplot(wt_pctchange_12m_bl ~ outcome_wt_fnl_BL, 
       data = transformed, 
       groups = group, 
       auto.key = list(columns = 2)) 
```

Another option, "BoxCox" will estimate a Box–Cox transformation on the predictors if the data are greater than zero.

```{r}
preProcValues2 <- preProcess(training, method = "BoxCox")
trainBC <- predict(preProcValues2, training)
testBC <- predict(preProcValues2, test)
preProcValues2
```


### Background 

In some cases, there is a need to use principal component analysis (PCA) to transform the data to a smaller sub–space where the new variable are uncorrelated with one another. The preProcess class can apply this transformation by including "pca" in the method argument. Doing this will also force scaling of the predictors. Note that when PCA is requested, predict.preProcess changes the column names to PC1, PC2 and so on.

Similarly, independent component analysis (ICA) can also be used to find new variables that are linear combinations of the original set such that the components are independent (as opposed to uncorrelated in PCA). The new variables will be labeled as IC1, IC2 and so on.

The “spatial sign” transformation (Serneels et al, 2006) projects the data for a predictor to the unit circle in p dimensions, where p is the number of predictors. Essentially, a vector of data is divided by its norm. The two figures in the code tab show two centered and scaled descriptors from the MDRR data before and after the spatial sign transformation. The predictors should be centered and scaled before applying this transformation.

# {-}

# Putting section 3 together {.tabset}

### info

In Applied Predictive Modeling there is a case study where the execution times of jobs in a high performance computing environment are being predicted. The data are:


### Transformations & scaling

### 

# {-}