|
a |
|
b/partyMod/man/cforest.Rd |
|
|
1 |
\name{cforest} |
|
|
2 |
\alias{cforest} |
|
|
3 |
\alias{proximity} |
|
|
4 |
\title{ Random Forest } |
|
|
5 |
\description{ |
|
|
6 |
An implementation of the random forest and bagging ensemble algorithms |
|
|
7 |
utilizing conditional inference trees as base learners. |
|
|
8 |
} |
|
|
9 |
\usage{ |
|
|
10 |
cforest(formula, data = list(), subset = NULL, weights = NULL, |
|
|
11 |
controls = cforest_unbiased(), |
|
|
12 |
xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) |
|
|
13 |
proximity(object, newdata = NULL) |
|
|
14 |
} |
|
|
15 |
\arguments{ |
|
|
16 |
\item{formula}{ a symbolic description of the model to be fit. Note |
|
|
17 |
that symbols like \code{:} and \code{-} will not work |
|
|
18 |
and the tree will make use of all variables listed on the |
|
|
19 |
rhs of \code{formula}.} |
|
|
20 |
\item{data}{ an data frame containing the variables in the model. } |
|
|
21 |
\item{subset}{ an optional vector specifying a subset of observations to be |
|
|
22 |
used in the fitting process.} |
|
|
23 |
\item{weights}{ an optional vector of weights to be used in the fitting |
|
|
24 |
process. Non-negative integer valued weights are |
|
|
25 |
allowed as well as non-negative real weights. |
|
|
26 |
Observations are sampled (with or without replacement) |
|
|
27 |
according to probabilities \code{weights / sum(weights)}. |
|
|
28 |
The fraction of observations to be sampled (without replacement) |
|
|
29 |
is computed based on the sum of the weights if all weights |
|
|
30 |
are integer-valued and based on the number of weights greater zero |
|
|
31 |
else. Alternatively, \code{weights} can be a double matrix defining |
|
|
32 |
case weights for all \code{ncol(weights)} trees in the forest directly. |
|
|
33 |
This requires more storage but gives the user more control.} |
|
|
34 |
\item{controls}{an object of class \code{\link{ForestControl-class}}, which can be |
|
|
35 |
obtained using \code{\link{cforest_control}} (and its |
|
|
36 |
convenience interfaces \code{cforest_unbiased} and \code{cforest_classical}).} |
|
|
37 |
\item{xtrafo}{ a function to be applied to all input variables. |
|
|
38 |
By default, the \code{\link{ptrafo}} function is applied.} |
|
|
39 |
\item{ytrafo}{ a function to be applied to all response variables. |
|
|
40 |
By default, the \code{\link{ptrafo}} function is applied.} |
|
|
41 |
\item{scores}{ an optional named list of scores to be attached to ordered |
|
|
42 |
factors.} |
|
|
43 |
\item{object}{ an object as returned by \code{cforest}.} |
|
|
44 |
\item{newdata}{ an optional data frame containing test data.} |
|
|
45 |
} |
|
|
46 |
\details{ |
|
|
47 |
|
|
|
48 |
This implementation of the random forest (and bagging) algorithm differs |
|
|
49 |
from the reference implementation in \code{\link[randomForest]{randomForest}} |
|
|
50 |
with respect to the base learners used and the aggregation scheme applied. |
|
|
51 |
|
|
|
52 |
Conditional inference trees, see \code{\link{ctree}}, are fitted to each |
|
|
53 |
of the \code{ntree} (defined via \code{\link{cforest_control}}) |
|
|
54 |
bootstrap samples of the learning sample. Most of the hyper parameters in |
|
|
55 |
\code{\link{cforest_control}} regulate the construction of the conditional inference trees. |
|
|
56 |
Therefore you MUST NOT change anything you don't understand completely. |
|
|
57 |
|
|
|
58 |
Hyper parameters you might want to change in \code{\link{cforest_control}} are: |
|
|
59 |
|
|
|
60 |
1. The number of randomly preselected variables \code{mtry}, which is fixed |
|
|
61 |
to the value 5 by default here for technical reasons, while in |
|
|
62 |
\code{\link[randomForest]{randomForest}} the default values for classification and regression |
|
|
63 |
vary with the number of input variables. |
|
|
64 |
|
|
|
65 |
2. The number of trees \code{ntree}. Use more trees if you have more variables. |
|
|
66 |
|
|
|
67 |
3. The depth of the trees, regulated by \code{mincriterion}. Usually unstopped and unpruned |
|
|
68 |
trees are used in random forests. To grow large trees, set \code{mincriterion} to a small value. |
|
|
69 |
|
|
|
70 |
The aggregation scheme works by averaging observation weights extracted |
|
|
71 |
from each of the \code{ntree} trees and NOT by averaging predictions directly |
|
|
72 |
as in \code{\link[randomForest]{randomForest}}. |
|
|
73 |
See Hothorn et al. (2004) for a description. |
|
|
74 |
|
|
|
75 |
Predictions can be computed using \code{\link{predict}}. For observations |
|
|
76 |
with zero weights, predictions are computed from the fitted tree |
|
|
77 |
when \code{newdata = NULL}. While \code{\link{predict}} returns predictions |
|
|
78 |
of the same type as the response in the data set by default (i.e., predicted class labels for factors), |
|
|
79 |
\code{\link{treeresponse}} returns the statistics of the conditional distribution of the response |
|
|
80 |
(i.e., predicted class probabilities for factors). The same is done by \code{predict(..., type = "prob")}. |
|
|
81 |
Note that for multivariate responses \code{predict} does not convert predictions to the type |
|
|
82 |
of the response, i.e., \code{type = "prob"} is used. |
|
|
83 |
|
|
|
84 |
Ensembles of conditional inference trees have not yet been extensively |
|
|
85 |
tested, so this routine is meant for the expert user only and its current |
|
|
86 |
state is rather experimental. However, there are some things available |
|
|
87 |
in \code{\link{cforest}} that can't be done with \code{\link[randomForest]{randomForest}}, |
|
|
88 |
for example fitting forests to censored response variables (see Hothorn et al., 2006a) or to |
|
|
89 |
multivariate and ordered responses. |
|
|
90 |
|
|
|
91 |
Moreover, when predictors vary in their scale of measurement of number |
|
|
92 |
of categories, variable selection and computation of variable importance is biased |
|
|
93 |
in favor of variables with many potential cutpoints in \code{\link[randomForest]{randomForest}}, |
|
|
94 |
while in \code{\link{cforest}} unbiased trees and an adequate resampling scheme |
|
|
95 |
are used by default. See Hothorn et al. (2006b) and Strobl et al. (2007) |
|
|
96 |
as well as Strobl et al. (2009). |
|
|
97 |
|
|
|
98 |
The \code{proximity} matrix is an \eqn{n \times n} matrix \eqn{P} with \eqn{P_{ij}} |
|
|
99 |
equal to the fraction of trees where observations \eqn{i} and \eqn{j} |
|
|
100 |
are element of the same terminal node (when both \eqn{i} and \eqn{j} |
|
|
101 |
had non-zero weights in the same bootstrap sample). |
|
|
102 |
|
|
|
103 |
} |
|
|
104 |
\value{ |
|
|
105 |
An object of class \code{\link{RandomForest-class}}. |
|
|
106 |
} |
|
|
107 |
\references{ |
|
|
108 |
|
|
|
109 |
Leo Breiman (2001). Random Forests. \emph{Machine Learning}, 45(1), 5--32. |
|
|
110 |
|
|
|
111 |
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger |
|
|
112 |
(2004). Bagging Survival Trees. \emph{Statistics in Medicine}, \bold{23}(1), 77--91. |
|
|
113 |
|
|
|
114 |
Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro |
|
|
115 |
and Mark J. van der Laan (2006a). Survival Ensembles. \emph{Biostatistics}, |
|
|
116 |
\bold{7}(3), 355--373. |
|
|
117 |
|
|
|
118 |
Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased |
|
|
119 |
Recursive Partitioning: A Conditional Inference Framework. |
|
|
120 |
\emph{Journal of Computational and Graphical Statistics}, \bold{15}(3), |
|
|
121 |
651--674. Preprint available from |
|
|
122 |
\url{http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf} |
|
|
123 |
|
|
|
124 |
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). |
|
|
125 |
Bias in Random Forest Variable Importance Measures: Illustrations, Sources and |
|
|
126 |
a Solution. \emph{BMC Bioinformatics}, \bold{8}, 25. |
|
|
127 |
\url{http://www.biomedcentral.com/1471-2105/8/25} |
|
|
128 |
|
|
|
129 |
Carolin Strobl, James Malley and Gerhard Tutz (2009). |
|
|
130 |
An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of |
|
|
131 |
Classification and Regression Trees, Bagging, and Random forests. |
|
|
132 |
\emph{Psychological Methods}, \bold{14}(4), 323--348. |
|
|
133 |
|
|
|
134 |
} |
|
|
135 |
\examples{ |
|
|
136 |
|
|
|
137 |
set.seed(290875) |
|
|
138 |
|
|
|
139 |
### honest (i.e., out-of-bag) cross-classification of |
|
|
140 |
### true vs. predicted classes |
|
|
141 |
data("mammoexp", package = "TH.data") |
|
|
142 |
table(mammoexp$ME, predict(cforest(ME ~ ., data = mammoexp, |
|
|
143 |
control = cforest_unbiased(ntree = 50)), |
|
|
144 |
OOB = TRUE)) |
|
|
145 |
|
|
|
146 |
### fit forest to censored response |
|
|
147 |
if (require("TH.data") && require("survival")) { |
|
|
148 |
|
|
|
149 |
data("GBSG2", package = "TH.data") |
|
|
150 |
bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, |
|
|
151 |
control = cforest_unbiased(ntree = 50)) |
|
|
152 |
|
|
|
153 |
### estimate conditional Kaplan-Meier curves |
|
|
154 |
treeresponse(bst, newdata = GBSG2[1:2,], OOB = TRUE) |
|
|
155 |
|
|
|
156 |
### if you can't resist to look at individual trees ... |
|
|
157 |
party:::prettytree(bst@ensemble[[1]], names(bst@data@get("input"))) |
|
|
158 |
} |
|
|
159 |
|
|
|
160 |
### proximity, see ?randomForest |
|
|
161 |
iris.cf <- cforest(Species ~ ., data = iris, |
|
|
162 |
control = cforest_unbiased(mtry = 2)) |
|
|
163 |
iris.mds <- cmdscale(1 - proximity(iris.cf), eig = TRUE) |
|
|
164 |
op <- par(pty="s") |
|
|
165 |
pairs(cbind(iris[,1:4], iris.mds$points), cex = 0.6, gap = 0, |
|
|
166 |
col = c("red", "green", "blue")[as.numeric(iris$Species)], |
|
|
167 |
main = "Iris Data: Predictors and MDS of Proximity Based on cforest") |
|
|
168 |
par(op) |
|
|
169 |
|
|
|
170 |
} |
|
|
171 |
\keyword{tree} |