[d5a14e]: / vignettes / BiocAIML.Rmd

Download this file

97 lines (77 with data), 2.6 kB

---
title: "BiocAIML: machine learning in genomics with Bioconductor"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{BiocAIML: machine learning in genomics with Bioconductor}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
    fig_crop: false
bibliography: biocaiml.bib
---

# Introduction

Machine learning methods have long been central to computational
biology.  The BiocAIML package aims to illustrate the use
of new approaches to machine learning with R in the
context of genome research.

# Taster: classification in glioblastoma multiforme (GBM)

## Data setup

An illustrative dataset derived from the Cancer Genome Atlas (TCGA)
is available with the BiocAIML package.  This dataset will
be retrieved from the cloud using `r BiocStyle::Biocpkg("curatedTCGAData")`,
massaged to include clinical data published in
@Brennan2013
and cached for future use.

```{r lktcg}
suppressPackageStartupMessages({
library(BiocAIML)
library(survival)
library(rpart)
library(SummarizedExperiment)
})
gbmse = build_gbm_se()
gbmse
```

## Sanity check

As a sanity check, we show that MGMT methylation status is
associated with longer survival times in this dataset.

```{r lksurv}
xm = gbmse[, gbmse$mgmt_status !="" & gbmse$vital_status !=""]
ss = Surv(xm$os_days, 1*(xm$vital_status=="DECEASED"))
plot(survfit(ss~xm$mgmt_status), lty=1:2)
legend(900, .95, lty=c(1,2), legend=c("MGMT methylated", "unmethylated"))
title("Time-on-study/vital status for 123 GBM patients\nanalyzed in Brennan et al. PMID 24120142")
```

## Classification with randomly chosen features

Let's pick a random sample of 100 genes and classify the
'expression-based' subtype of GBM using `r CRANpkg("survival")`'s
`rpart`.

```{r dotree1}
set.seed(1234)
xms = xm[sample(seq_len(nrow(xm)), size=100),]
xmsdf = data.frame(cl=xms$expression_subclass, t(assay(xms)))
rp1 = rpart(cl~., data=xmsdf)
tt = table(predicted=predict(rp1, type="class"), given=na.omit(xmsdf$cl))
tt
```

There are `r sum(tt)-sum(diag(tt))` "errors" in 122 predictions.
The associated tree is:

```{r lktr,fig.width=7, out.width="67%", fig.height=7, out.height="67%"}
plot(rp1)
text(rp1)
```

and the cross-validated error profile is
```{r lkcp}
plotcp(rp1)
```

This indicates that the best tree obtainable with these
features (100 randomly sampled genes) has 8 nodes and a relative
error (compared to declaring all patients to have the majority
class) of around 80%.  



# References