atlantis / Git / Diff of /partyMod/man/cforest.Rd

Models:
DanielG/
atlantis
Downloads: 1
Diff of /partyMod/man/cforest.Rd [000000] .. [fbf06f]
Switch to side-by-side view

--- a
+++ b/partyMod/man/cforest.Rd
@@ -0,0 +1,171 @@
+\name{cforest}
+\alias{cforest}
+\alias{proximity}
+\title{ Random Forest }
+\description{
+    An implementation of the random forest and bagging ensemble algorithms
+    utilizing conditional inference trees as base learners.
+}
+\usage{
+cforest(formula, data = list(), subset = NULL, weights = NULL, 
+        controls = cforest_unbiased(),
+        xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL)
+proximity(object, newdata = NULL)
+}
+\arguments{
+  \item{formula}{ a symbolic description of the model to be fit. Note 
+                  that symbols like \code{:} and \code{-} will not work
+                  and the tree will make use of all variables listed on the
+                  rhs of \code{formula}.}                  
+  \item{data}{ an data frame containing the variables in the model. }
+  \item{subset}{ an optional vector specifying a subset of observations to be
+                 used in the fitting process.}
+  \item{weights}{ an optional vector of weights to be used in the fitting
+                  process. Non-negative integer valued weights are
+                  allowed as well as non-negative real weights.
+                  Observations are sampled (with or without replacement)
+                  according to probabilities \code{weights / sum(weights)}.
+                  The fraction of observations to be sampled (without replacement)
+                  is computed based on the sum of the weights if all weights
+                  are integer-valued and based on the number of weights greater zero
+                  else. Alternatively, \code{weights} can be a double matrix defining
+                  case weights for all \code{ncol(weights)} trees in the forest directly.
+                  This requires more storage but gives the user more control.}
+  \item{controls}{an object of class \code{\link{ForestControl-class}}, which can be
+                  obtained using \code{\link{cforest_control}} (and its
+		  convenience interfaces \code{cforest_unbiased} and \code{cforest_classical}).}
+  \item{xtrafo}{ a function to be applied to all input variables.
+                By default, the \code{\link{ptrafo}} function is applied.}
+  \item{ytrafo}{ a function to be applied to all response variables.
+                By default, the \code{\link{ptrafo}} function is applied.}
+  \item{scores}{ an optional named list of scores to be attached to ordered
+               factors.}
+  \item{object}{ an object as returned by \code{cforest}.}
+  \item{newdata}{ an optional data frame containing test data.}
+}
+\details{
+
+  This implementation of the random forest (and bagging) algorithm differs
+  from the reference implementation in \code{\link[randomForest]{randomForest}}
+  with respect to the base learners used and the aggregation scheme applied.
+  
+  Conditional inference trees, see \code{\link{ctree}}, are fitted to each
+  of the \code{ntree} (defined via \code{\link{cforest_control}}) 
+  bootstrap samples of the learning sample. Most of the hyper parameters in 
+  \code{\link{cforest_control}} regulate the construction of the conditional inference trees.
+  Therefore you MUST NOT change anything you don't understand completely.
+
+  Hyper parameters you might want to change in \code{\link{cforest_control}} are: 
+
+  1. The number of randomly preselected variables \code{mtry}, which is fixed 
+  to the value 5 by default here for technical reasons, while in 
+  \code{\link[randomForest]{randomForest}} the default values for classification and regression
+  vary with the number of input variables. 
+
+  2. The number of trees \code{ntree}. Use more trees if you have more variables.
+
+  3. The depth of the trees, regulated by \code{mincriterion}. Usually unstopped and unpruned
+  trees are used in random forests. To grow large trees, set \code{mincriterion} to a small value.
+
+  The aggregation scheme works by averaging observation weights extracted
+  from each of the \code{ntree} trees and NOT by averaging predictions directly
+  as in \code{\link[randomForest]{randomForest}}.
+  See Hothorn et al. (2004) for a description. 
+
+  Predictions can be computed using \code{\link{predict}}. For observations
+  with zero weights, predictions are computed from the fitted tree 
+  when \code{newdata = NULL}. While \code{\link{predict}} returns predictions
+  of the same type as the response in the data set by default (i.e., predicted class labels for factors), 
+  \code{\link{treeresponse}} returns the statistics of the conditional distribution of the response
+  (i.e., predicted class probabilities for factors). The same is done by \code{predict(..., type = "prob")}.
+  Note that for multivariate responses \code{predict} does not convert predictions to the type 
+  of the response, i.e., \code{type = "prob"} is used.
+
+  Ensembles of conditional inference trees have not yet been extensively
+  tested, so this routine is meant for the expert user only and its current
+  state is rather experimental. However, there are some things available 
+  in \code{\link{cforest}} that can't be done with \code{\link[randomForest]{randomForest}}, 
+  for example fitting forests to censored response variables (see Hothorn et al., 2006a) or to
+  multivariate and ordered responses.
+  
+  Moreover, when predictors vary in their scale of measurement of number 
+  of categories, variable selection and computation of variable importance is biased 
+  in favor of variables with many potential cutpoints in \code{\link[randomForest]{randomForest}}, 
+  while in \code{\link{cforest}} unbiased trees and an adequate resampling scheme 
+  are used by default. See Hothorn et al. (2006b) and Strobl et al. (2007) 
+  as well as Strobl et al. (2009). 
+
+  The \code{proximity} matrix is an \eqn{n \times n} matrix \eqn{P} with \eqn{P_{ij}}
+  equal to the fraction of trees where observations \eqn{i} and \eqn{j} 
+  are element of the same terminal node (when both \eqn{i} and \eqn{j}
+  had non-zero weights in the same bootstrap sample).
+
+}
+\value{
+  An object of class \code{\link{RandomForest-class}}.
+}
+\references{ 
+
+    Leo Breiman (2001). Random Forests. \emph{Machine Learning}, 45(1), 5--32.
+
+    Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger
+    (2004). Bagging Survival Trees. \emph{Statistics in Medicine}, \bold{23}(1), 77--91.
+
+    Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro 
+    and Mark J. van der Laan (2006a). Survival Ensembles. \emph{Biostatistics}, 
+    \bold{7}(3), 355--373.
+
+    Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased
+    Recursive Partitioning: A Conditional Inference Framework.
+    \emph{Journal of Computational and Graphical Statistics}, \bold{15}(3),
+    651--674.  Preprint available from 
+    \url{http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf}
+
+    Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007).
+    Bias in Random Forest Variable Importance Measures: Illustrations, Sources and 
+    a Solution. \emph{BMC Bioinformatics}, \bold{8}, 25. 
+    \url{http://www.biomedcentral.com/1471-2105/8/25}
+
+    Carolin Strobl, James Malley and Gerhard Tutz (2009).
+    An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of
+    Classification and Regression Trees, Bagging, and Random forests.
+    \emph{Psychological Methods}, \bold{14}(4), 323--348.
+
+}
+\examples{
+
+    set.seed(290875)
+
+    ### honest (i.e., out-of-bag) cross-classification of
+    ### true vs. predicted classes
+    data("mammoexp", package = "TH.data")
+    table(mammoexp$ME, predict(cforest(ME ~ ., data = mammoexp, 
+                               control = cforest_unbiased(ntree = 50)),
+                               OOB = TRUE))
+
+    ### fit forest to censored response
+    if (require("TH.data") && require("survival")) {
+
+        data("GBSG2", package = "TH.data")
+        bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, 
+                   control = cforest_unbiased(ntree = 50))
+
+        ### estimate conditional Kaplan-Meier curves
+        treeresponse(bst, newdata = GBSG2[1:2,], OOB = TRUE)
+
+        ### if you can't resist to look at individual trees ...
+        party:::prettytree(bst@ensemble[[1]], names(bst@data@get("input")))
+    }
+
+    ### proximity, see ?randomForest
+    iris.cf <- cforest(Species ~ ., data = iris, 
+                       control = cforest_unbiased(mtry = 2))
+    iris.mds <- cmdscale(1 - proximity(iris.cf), eig = TRUE)
+    op <- par(pty="s")
+    pairs(cbind(iris[,1:4], iris.mds$points), cex = 0.6, gap = 0, 
+          col = c("red", "green", "blue")[as.numeric(iris$Species)],
+          main = "Iris Data: Predictors and MDS of Proximity Based on cforest")
+    par(op)
+
+}
+\keyword{tree}