a b/partyMod/man/cforest.Rd
1
\name{cforest}
2
\alias{cforest}
3
\alias{proximity}
4
\title{ Random Forest }
5
\description{
6
    An implementation of the random forest and bagging ensemble algorithms
7
    utilizing conditional inference trees as base learners.
8
}
9
\usage{
10
cforest(formula, data = list(), subset = NULL, weights = NULL, 
11
        controls = cforest_unbiased(),
12
        xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL)
13
proximity(object, newdata = NULL)
14
}
15
\arguments{
16
  \item{formula}{ a symbolic description of the model to be fit. Note 
17
                  that symbols like \code{:} and \code{-} will not work
18
                  and the tree will make use of all variables listed on the
19
                  rhs of \code{formula}.}                  
20
  \item{data}{ an data frame containing the variables in the model. }
21
  \item{subset}{ an optional vector specifying a subset of observations to be
22
                 used in the fitting process.}
23
  \item{weights}{ an optional vector of weights to be used in the fitting
24
                  process. Non-negative integer valued weights are
25
                  allowed as well as non-negative real weights.
26
                  Observations are sampled (with or without replacement)
27
                  according to probabilities \code{weights / sum(weights)}.
28
                  The fraction of observations to be sampled (without replacement)
29
                  is computed based on the sum of the weights if all weights
30
                  are integer-valued and based on the number of weights greater zero
31
                  else. Alternatively, \code{weights} can be a double matrix defining
32
                  case weights for all \code{ncol(weights)} trees in the forest directly.
33
                  This requires more storage but gives the user more control.}
34
  \item{controls}{an object of class \code{\link{ForestControl-class}}, which can be
35
                  obtained using \code{\link{cforest_control}} (and its
36
          convenience interfaces \code{cforest_unbiased} and \code{cforest_classical}).}
37
  \item{xtrafo}{ a function to be applied to all input variables.
38
                By default, the \code{\link{ptrafo}} function is applied.}
39
  \item{ytrafo}{ a function to be applied to all response variables.
40
                By default, the \code{\link{ptrafo}} function is applied.}
41
  \item{scores}{ an optional named list of scores to be attached to ordered
42
               factors.}
43
  \item{object}{ an object as returned by \code{cforest}.}
44
  \item{newdata}{ an optional data frame containing test data.}
45
}
46
\details{
47
48
  This implementation of the random forest (and bagging) algorithm differs
49
  from the reference implementation in \code{\link[randomForest]{randomForest}}
50
  with respect to the base learners used and the aggregation scheme applied.
51
  
52
  Conditional inference trees, see \code{\link{ctree}}, are fitted to each
53
  of the \code{ntree} (defined via \code{\link{cforest_control}}) 
54
  bootstrap samples of the learning sample. Most of the hyper parameters in 
55
  \code{\link{cforest_control}} regulate the construction of the conditional inference trees.
56
  Therefore you MUST NOT change anything you don't understand completely.
57
58
  Hyper parameters you might want to change in \code{\link{cforest_control}} are: 
59
60
  1. The number of randomly preselected variables \code{mtry}, which is fixed 
61
  to the value 5 by default here for technical reasons, while in 
62
  \code{\link[randomForest]{randomForest}} the default values for classification and regression
63
  vary with the number of input variables. 
64
65
  2. The number of trees \code{ntree}. Use more trees if you have more variables.
66
67
  3. The depth of the trees, regulated by \code{mincriterion}. Usually unstopped and unpruned
68
  trees are used in random forests. To grow large trees, set \code{mincriterion} to a small value.
69
70
  The aggregation scheme works by averaging observation weights extracted
71
  from each of the \code{ntree} trees and NOT by averaging predictions directly
72
  as in \code{\link[randomForest]{randomForest}}.
73
  See Hothorn et al. (2004) for a description. 
74
75
  Predictions can be computed using \code{\link{predict}}. For observations
76
  with zero weights, predictions are computed from the fitted tree 
77
  when \code{newdata = NULL}. While \code{\link{predict}} returns predictions
78
  of the same type as the response in the data set by default (i.e., predicted class labels for factors), 
79
  \code{\link{treeresponse}} returns the statistics of the conditional distribution of the response
80
  (i.e., predicted class probabilities for factors). The same is done by \code{predict(..., type = "prob")}.
81
  Note that for multivariate responses \code{predict} does not convert predictions to the type 
82
  of the response, i.e., \code{type = "prob"} is used.
83
84
  Ensembles of conditional inference trees have not yet been extensively
85
  tested, so this routine is meant for the expert user only and its current
86
  state is rather experimental. However, there are some things available 
87
  in \code{\link{cforest}} that can't be done with \code{\link[randomForest]{randomForest}}, 
88
  for example fitting forests to censored response variables (see Hothorn et al., 2006a) or to
89
  multivariate and ordered responses.
90
  
91
  Moreover, when predictors vary in their scale of measurement of number 
92
  of categories, variable selection and computation of variable importance is biased 
93
  in favor of variables with many potential cutpoints in \code{\link[randomForest]{randomForest}}, 
94
  while in \code{\link{cforest}} unbiased trees and an adequate resampling scheme 
95
  are used by default. See Hothorn et al. (2006b) and Strobl et al. (2007) 
96
  as well as Strobl et al. (2009). 
97
98
  The \code{proximity} matrix is an \eqn{n \times n} matrix \eqn{P} with \eqn{P_{ij}}
99
  equal to the fraction of trees where observations \eqn{i} and \eqn{j} 
100
  are element of the same terminal node (when both \eqn{i} and \eqn{j}
101
  had non-zero weights in the same bootstrap sample).
102
103
}
104
\value{
105
  An object of class \code{\link{RandomForest-class}}.
106
}
107
\references{ 
108
109
    Leo Breiman (2001). Random Forests. \emph{Machine Learning}, 45(1), 5--32.
110
111
    Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger
112
    (2004). Bagging Survival Trees. \emph{Statistics in Medicine}, \bold{23}(1), 77--91.
113
114
    Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro 
115
    and Mark J. van der Laan (2006a). Survival Ensembles. \emph{Biostatistics}, 
116
    \bold{7}(3), 355--373.
117
118
    Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased
119
    Recursive Partitioning: A Conditional Inference Framework.
120
    \emph{Journal of Computational and Graphical Statistics}, \bold{15}(3),
121
    651--674.  Preprint available from 
122
    \url{http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf}
123
124
    Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007).
125
    Bias in Random Forest Variable Importance Measures: Illustrations, Sources and 
126
    a Solution. \emph{BMC Bioinformatics}, \bold{8}, 25. 
127
    \url{http://www.biomedcentral.com/1471-2105/8/25}
128
129
    Carolin Strobl, James Malley and Gerhard Tutz (2009).
130
    An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of
131
    Classification and Regression Trees, Bagging, and Random forests.
132
    \emph{Psychological Methods}, \bold{14}(4), 323--348.
133
134
}
135
\examples{
136
137
    set.seed(290875)
138
139
    ### honest (i.e., out-of-bag) cross-classification of
140
    ### true vs. predicted classes
141
    data("mammoexp", package = "TH.data")
142
    table(mammoexp$ME, predict(cforest(ME ~ ., data = mammoexp, 
143
                               control = cforest_unbiased(ntree = 50)),
144
                               OOB = TRUE))
145
146
    ### fit forest to censored response
147
    if (require("TH.data") && require("survival")) {
148
149
        data("GBSG2", package = "TH.data")
150
        bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, 
151
                   control = cforest_unbiased(ntree = 50))
152
153
        ### estimate conditional Kaplan-Meier curves
154
        treeresponse(bst, newdata = GBSG2[1:2,], OOB = TRUE)
155
156
        ### if you can't resist to look at individual trees ...
157
        party:::prettytree(bst@ensemble[[1]], names(bst@data@get("input")))
158
    }
159
160
    ### proximity, see ?randomForest
161
    iris.cf <- cforest(Species ~ ., data = iris, 
162
                       control = cforest_unbiased(mtry = 2))
163
    iris.mds <- cmdscale(1 - proximity(iris.cf), eig = TRUE)
164
    op <- par(pty="s")
165
    pairs(cbind(iris[,1:4], iris.mds$points), cex = 0.6, gap = 0, 
166
          col = c("red", "green", "blue")[as.numeric(iris$Species)],
167
          main = "Iris Data: Predictors and MDS of Proximity Based on cforest")
168
    par(op)
169
170
}
171
\keyword{tree}