--- a +++ b/partyMod/vignettes/MOB.Rnw @@ -0,0 +1,488 @@ +\documentclass[nojss]{jss} + +%\VignetteIndexEntry{party with the mob} +%\VignetteDepends{mlbench,colorspace} +%\VignetteKeywords{parametric models, object-orientation, recursive partitioning} +%\VignettePackage{party} + +%% packages +\usepackage{amsmath} +\SweaveOpts{engine=R} +%% neet no \usepackage{Sweave} + +<<setup, echo = FALSE, results = hide>>= +require("party") +options(useFancyQuotes = FALSE) +@ + +%% commands +\newcommand{\ui}{\underline{i}} +\newcommand{\oi}{\overline{\imath}} +\newcommand{\argmin}{\operatorname{argmin}\displaylimits} + +%% author/title +\author{Achim Zeileis\\Universit\"at Innsbruck \And + Torsten Hothorn\\Ludwig-Maximilians-\\Universit\"at M\"unchen \And + Kurt Hornik\\WU Wirtschafts-\\universit\"at Wien} +\Plainauthor{Achim Zeileis, Torsten Hothorn, Kurt Hornik} + +\title{\pkg{party} with the \code{mob}: Model-Based Recursive Partitioning in \proglang{R}} +\Plaintitle{party with the mob: Model-Based Recursive Partitioning in R} +\Shorttitle{\pkg{party} with the \texttt{mob}} + +\Keywords{parametric models, object-orientation, recursive partitioning} + +%% abstract +\Abstract{ + The \pkg{party} package \citep{Hothorn+Hornik+Zeileis:2006a} provides + the function \code{mob()} implementing a recently suggested algorithm + for \underline{mo}del-\underline{b}ased recursive partitioning + \citep{Zeileis+Hothorn+Hornik:2005}. The basic steps are: (1)~fit a + parametric model to a data set, (2)~test for parameter instability over + a set of partitioning variables, (3)~if there is some overall parameter + instability, split the model with respect to the variable associated with + the highest instability, (4)~repeat the procedure in each of the child nodes. + It is discussed how these steps of the conceptual algorithm are translated + into computational tools in an object-oriented manner, allowing the user to + plug in various types of parametric models. The outcome is a tree where + each node is associated with a fitted parametric model that can be + effectively visualized and summarized. +} + +\Address{ + Achim Zeileis\\ + Department of Statistics\\ + Faculty of Economics and Statistics\\ + Universit\"at Innsbruck\\ + Universit\"atsstr.~15\\ + 6020 Innsbruck, Austria\\ + E-mail: \email{Achim.Zeileis@R-project.org}\\ + URL: \url{http://eeecon.uibk.ac.at/~zeileis/}\\ + + Torsten Hothorn\\ + Institut f\"ur Statistik\\ + Ludwig-Maximilians-Universit\"at M\"unchen\\ + Ludwigstra{\ss}e 33\\ + 80539 M\"unchen, Germany\\ + E-mail: \email{Torsten.Hothorn@R-project.org}\\ + URL: \url{http://www.stat.uni-muenchen.de/~hothorn/}\\ + + Kurt Hornik\\ + Institute for Statistics and Mathematics\\ + WU Wirtschaftsuniversit\"at Wien\\ + Augasse 2--6\\ + 1090 Wien, Austria\\ + E-mail: \email{Kurt.Hornik@R-project.org}\\ + URL: \url{http://statmath.wu.ac.at/~hornik/}\\ + +} + +\begin{document} + + +\section{Motivation} +\label{sec:motivation} + +Consider a parametric model $\mathcal{M}(Y, \theta)$ +with (possibly vector-valued) observations $Y$ and a +$k$-dimensional vector of parameters $\theta$. This model +could be a (possibly multivariate) normal distribution for $Y$, or some +kind of regression model when $Y = (y, x)$ can be split up into a dependent variable +$y$ and regressors $x$. An example for the latter could be a linear regression +model $y = x^\top \theta$ or a generalized linear model (GLM) or a survival +regression. + +Given $n$ observations $Y_i$ ($i = 1, \dots, n$) the model can be fitted +by minimizing some objective function $\Psi(Y, \theta)$, e.g., a residual sum of squares +or a negative log-likelihood leading to ordinary least squares (OLS) or maximum +likelihood (ML) estimation. + +If a global model for all $n$ observations does not fit well and further +covariates $Z_1, \dots, Z_\ell$ are available, it might be possible to partition +the $n$ observations with respect to these variables and find a fitting model +in each cell of the partition. The algorithm described here tries to find +such a partition adaptively using a greedy forward search. This procedure is +implemented in the function \code{mob()} and described in more detail in the following +section. However, we we will state some goals and design principles in advance. + +To translate the model-based partitioning problem into \proglang{R}, we start with +a formula description of the variables involved. This formula should be of type +\verb:y ~ x1 + ... + xk | z1 + ... + zl: where the variables on the +left of the \code{|} specify the data $Y$ and the variables on the right specify the +partitioning variables $Z_j$. Classical regression trees usually have a univariate +response $Y$ and various partitioning variables, i.e., could be specified as +\verb:y ~ 1 | z1 + ... + zl:. Structural change models, on the other hand, are usually +regression models that are segmented with respect to a single partitioning variable, +typically time: \verb:y ~ x1 + ... + xk | z:. + +The type of models $\mathcal{M}$ to be used with \code{mob()} should not be +confined (by the implementation), hence we have written an object-oriented +implementation. The idea is that $\mathcal{M}$ is translated into software +by a model of class ``\code{StatModel}'' as provided by the \pkg{modeltools} +package. The algorithm the relies on various methods being available for these +models. The ``\code{StatModel}'' objects \code{linearModel} and \code{glinearModel}, +implementing (generalized) linear regression models, are readily available in +\pkg{modeltools}, others can easily be user-defined. + + +\section{The model-based recursive partitioning algorithm} +\label{sec:algorithm} + +The basic idea is to grow a tee in which every node is associated with a model +of type $\mathcal{M}$. To assess whether splitting of the node is necessary a +fluctuation test for parameter instability is performed. If there is significant instability +with respect to any of the partitioning variables $Z_j$, the node is splitted +into $B$ locally optimal segments (currently only $B = 2$ is implemented) +and then the procedure is repeated in each of the $B$ children. +If no more significant instabilities can be found, the recursion stops. +More precisely, the steps of the algorithm are + +\begin{enumerate} +\item Fit the model once to all observations in the current node. +\item Assess whether the parameter estimates are stable with respect to + every partitioning variable $Z_1, \dots, Z_\ell$. If there is some overall instability, + select the variable $Z_j$ associated with the highest parameter instability, otherwise + stop. +\item Compute the split point(s) that locally optimize the objective function $\Psi$. +\item Split the node into child nodes and repeat the procedure. +\end{enumerate} + +The details for steps 1--3 are specified in the following. + + +\subsection{Parameter estimation} + +This step of the algorithm is common practice, the only additional +requirement is (as previously noted) that model has to be of the class +``\code{StatModel}'' as provided by \pkg{modeltools}. Looking at the source +code for the \code{linearModel} provided by this package illustrates how +a simple wrapper to existing \proglang{R} functionality can be written. +In particular, a method to the generic function \code{reweight()} has to +be available. The reason is that it is inefficient to fit a brand-new model +\code{modelobj} (including formula-parsing) in every node---much computation time +is saved if simply \code{reweight(modelobj, weights)} is called in each of the +child nodes. The \code{weights} argument controls which observations go into which +of the child nodes. + + +\subsection{Testing for parameter instability} + +The task in this step of the algorithm is to find out whether the parameters +of the fitted model are stable over each particular ordering implied by +the partitioning variables $Z_j$ or whether splitting the sample with respect +to one of the $Z_j$ might capture instabilities in the parameters and thus improve the fit. +The tests used in this step belong to the class of generalized M-fluctuation +tests \citep{ZeileisHornik2003,Zeileis2005}. For numerical partitioning variables +$Z_j$ the $\sup LM$~statistic is used which is the maximum over all single split +$LM$ statistics. For categorical partitioning variables, a $\chi^2$~statistic is +employed which captures the fluctuation within each of the categories of $Z_j$. + +For computing the test statistics and corresponding $p$~values $p_j$ for +each of the partitioning variables $Z_j$ in \proglang{R}, the only requirement +is that there are methods for the extractor functions \code{estfun()} and +\code{weights()}. The \code{estfun()} method extracts the empirical estimating +functions (model scores) from the fitted \code{modelobj}, these are the +main ingredient for M-fluctuation tests. The \code{weights()} method is used +to determine which observations are in the current node (i.e., have a weight +greater than zero) and which are not (i.e., have zero weight). + +To determine whether there is some overall instability, it is checked whether +the minial $p$~value $p_{j^*} = \min_{j = 1, \dots, \ell} p_j$ falls below a +pre-specified significance level $\alpha$ (by default $\alpha = 0.05$) or not. +To adjust for multiple testing, the $p$ values can be Bonferroni adjusted +(which is the default). If there is significant instability, the variable $Z_{j^*}$ +associated with the minimal $p$~value is used for splitting the node. + +\subsection{Splitting} + +In this step, the observation in the current node are split with respect +to the chosen partitioning variable $Z_{j^*}$ into $B$ child nodes. Currently, +the infrastructure in \pkg{party} only supports binary splits, i.e., $B = 2$. +For deterimining the split point, an exhaustive search procedure is adopted: +For each conceivable split point, the $B$ child node models are fit and the split +associated with the minimal value of the objective function $\Psi$ is chosen. + +Computationally, this means that the fitted model \code{modelobj} is \code{reweight()}ed +for each of the child nodes. The observations entering a child node keep their +current weight while those observations that go into different child nodes receive +zero weight. To compare the objective function $\Psi$, an extractor function +is required to compute it from the fitted \code{modelobj}. This extractor function +can be user-specified and set in \verb:mob_control():, it defaults to \code{deviance()}. + + +This concludes one iteration of the recursive partitioning algorithm and steps~1--3 +are carried out again in each of the $B$ daughter nodes until no significant +instability is detected in step~2. + +\section{Illustrations} +\label{sec:illustration} + +\subsection{Boston housing data} + +Since the analysis by \cite{BreimanFriedman1985}, the Boston housing data are +a popular and well-investigated empirical basis for illustrating non-linear +regression methods both in machine learning and statistics +\citep[see][for two recent examples]{Gama2004,Samarovetal2005} and we follow +these examples by segmenting a bivariate linear regression model for the house +values. + +Thus, the model $\mathcal{M}$ used is \code{linearModel} from the \pkg{modeltools} +package which is automatically loaded together with \pkg{party}. + +<<>>= +library("party") +@ + +The data set is available in package \pkg{mlbench} via + +<<>>= +data("BostonHousing", package = "mlbench") +@ + +and provides $n = \Sexpr{NROW(BostonHousing)}$ observations of the median value of owner-occupied +homes in Boston (in USD~1000) along with $\Sexpr{NCOL(BostonHousing)}$ covariates including in particular +the number of rooms per dwelling (\code{rm}) and the percentage +of lower status of the population (\code{lstat}). A segment-wise linear relationship between +the value and these two variables is very intuitive, whereas the shape of the influence +of the remaining covariates is rather unclear and hence should be learned from the data. +Therefore, a linear regression model for median value explained by \verb:rm^2: +and \verb:log(lstat): with $k = 3$ regression coefficients is employed and +partitioned with respect to all $\ell = 11$ remaining variables. To facilitate subsequent +commands, the transformations are explicitely stored in \code{BostonHousing}: + +<<>>= +BostonHousing$lstat <- log(BostonHousing$lstat) +BostonHousing$rm <- BostonHousing$rm^2 +@ + +Choosing appropriate +transformations of the dependent variable and the regressors that enter the linear +regression model is important to obtain a well-fitting model in each segment and +we follow in our choice the recommendations of \cite{BreimanFriedman1985}. Monotonous +transformations of the partitioning variables do not affect the recursive partitioning +algorithm and hence do not have to be performed. However, it is important to distinguish +between numerical and categorical variables for choosing an appropriate parameter +stability test. The variable \code{chas} is a dummy indicator variable (for tract bounds +with Charles river) and should thus be turned into a factor. Furthermore, the variable +\code{rad} is an index of accessibility to radial highways and takes only 9 distinct +values. Thus it is most appropriately treated as an ordered factor. + +<<>>= +BostonHousing$chas <- factor(BostonHousing$chas, levels = 0:1, labels = c("no", "yes")) +BostonHousing$rad <- factor(BostonHousing$rad, ordered = TRUE) +@ + +Both transformations only affect the parameter stability test chosen (step~2), not the splitting +procedure (step~3). + +The model is estimated +by OLS, the instability is assessed using a Bonferroni-corrected +significance level of $\alpha = 0.05$ and the nodes are split with a required minimal +segment size of $40$ observations. The control parameters are thus set to + +<<>>= +ctrl <- mob_control(alpha = 0.05, bonferroni = TRUE, minsplit = 40, + objfun = deviance, verbose = TRUE) +@ + +Actually, all of these settings are the defaults except \code{minsplit = 40} and +\code{verbose = TRUE} which causes some information about the fitting process +being written to the screen. The objective function \code{deviance()} extracts in +this case the residual sum of squares from a fitted \code{linearModel} object. + +Having collected all building blocks, we can now call the function \code{mob()} +that takes the model specification of the linear regression model \verb:medv ~ lstat + rm: +plus all partitioning variables, along with the \code{data} set, the \code{control} +settings and the \code{model} to be used. + +<<>>= +fmBH <- mob(medv ~ lstat + rm | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio, + data = BostonHousing, control = ctrl, model = linearModel) +@ + +The result is the fitted model \code{fmBH} of class ``\code{mob}'' that contains +the tree with a fitted linear regression associated with every node. Printing +this object will show the splits, their $p$ values and call the \code{print()} method +for the model in each terminal node (i.e., this simply relies on a \code{print()} +method being available for the fitted model and re-uses it). + +<<>>= +fmBH +@ + +Looking at the printed output is typically rather tedious, a visualization via +the \code{plot()} method + +<<eval=FALSE>>= +plot(fmBH) +@ + +is much easier to interpret. By default, this produces partial scatter plots of the +variable $y$ against each of the regressors $x_i$ in the terminal nodes. Each scatter +plot also shows the fitted values, i.e., a project of the fitted hyperplane. + +\setkeys{Gin}{width=\textwidth} +\begin{figure}[p] +\begin{center} +<<BostonHousing-plot,echo=FALSE,results=hide,fig=TRUE,height=8,width=12,include=FALSE>>= +plot(fmBH) +@ +\includegraphics[width=18cm,keepaspectratio,angle=90]{MOB-BostonHousing-plot} +\caption{\label{fig:BostonHousing} Linear-regression-based tree for the Boston housing data. +The plots in the leaves give partial scatter plots for \code{rm} (upper panel) and +\code{lstat} (lower panel).} +\end{center} +\end{figure} + +From this visualization, it can be seen that in the nodes~4, 6, 7 and 8 the increase of +value with the number of rooms dominates the picture (upper panel) whereas in node~9 the +decrease with the lower status population percentage (lower panel) is more pronounced. +Splits are performed in the variables \code{tax} (poperty-tax rate) and +\code{ptratio} (pupil-teacher ratio). + +Various quantities of interest can be computed, provided that the \code{model} used +provides the corresponding methods, e.g., \code{predict()}, \code{residuals()}, \code{logLik()}, +\code{coef()} and \code{summary()}. The latter two by default try to extract information +for the terminal nodes, but a \code{node} argument can be set to the node IDs of interest. +As an example, the regression coefficients for the terminal node models can be easily +extracted by + +<<>>= +coef(fmBH) +@ + +reflecting the differences of the models that can also be seen in the the associated +\code{plot()}. Even more information is available in a \code{summary()}, e.g., for node 7: + +<<>>= +summary(fmBH, node = 7) +@ + +The test statistics and $p$~values computed in each node, can be extracted analogously +by using the method for the function \code{sctest()} (for performing \underline{s}tructural +\underline{c}hange \underline{test}s). + +<<>>= +sctest(fmBH, node = 7) +@ + +For summarizing the quality of the fit, we could compute the mean squared error, log-likelihood +or AIC: + +<<>>= +mean(residuals(fmBH)^2) +logLik(fmBH) +AIC(fmBH) +@ + +<<echo=FALSE>>= +nt <- NROW(coef(fmBH)) +nk <- NCOL(coef(fmBH)) +@ + +As the \code{logLik()} method simply re-uses the method for \code{linearModel} objects, +this does not only report $\Sexpr{(nk+1)*nt-1}$ estimated parameters ($\Sexpr{nk}$ parameters in +each of the $\Sexpr{nt}$ terminal nodes plus $\Sexpr{nt} - 1$ split points) but +$\Sexpr{(nk+2)*nt-1}$ parameters because each terminal node is additionally associated with a +variance estimate. However, for the fitting process, the variance was treated as a nuisance +parameter as we employed OLS estimation (rather than fully-specified ML estimation). + + +\subsection{Pima Indians diabetes data} + +Another popular benchmark data set for binary classifications is the Pima Indians +diabetes database which is also available from \pkg{mlbench}: + +<<>>= +data("PimaIndiansDiabetes2", package = "mlbench") +PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)]) +@ + +After omitting missing values (and the variables \verb:triceps: and \verb:insulin: which +are missing for most women), the data set provides diabetes test results for +$n = \Sexpr{NROW(PimaIndiansDiabetes)}$ women +along with $\Sexpr{NCOL(PimaIndiansDiabetes)}$ covariates including in particular +the plasma glucose concentration \code{glucose} as an important predictor for diabetes. +Fitting a logistic regression model \verb:diabetes ~ glucose: seems to be straightforward, +whereas the influence of the remaining variables should again be learned by recursive +partitioning. This will yield a model tree with $k = 2$ regression coefficients in each terminal +node, partitioned with respect to the remaining $\ell = 5$ remaining variables. + +The model is estimated by ML employing the \code{glinearModel}, +the instability is assessed using a Bonferroni-corrected +significance level of $\alpha = 0.05$ and the nodes are split with a required minimal +segment size of $20$ observations. Hence, all control parameters correspond to the +default values in \verb:mob_control(): and do not have to be set explicitely in +the \code{mob()} call: + +<<>>= +fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age, + data = PimaIndiansDiabetes, model = glinearModel, family = binomial()) +@ + +To visualize this, we simply call again: + +<<eval=FALSE>>= +plot(fmPID) +@ + +\setkeys{Gin}{width=\textwidth} +\begin{figure}[bth] +\begin{center} +<<PimaIndiansDiabetes-plot,echo=FALSE,results=hide,fig=TRUE,height=6,width=9>>= +plot(fmPID) +@ +\caption{\label{fig:PimaIndiansDiabetes} Logistic-regression-based tree for the Pima Indians +diabetes data. The plots in the leaves give spinograms for \code{diabetes} versus +\code{glucose}.} +\end{center} +\end{figure} + +which produces again a plot of the dependent variable $y$ against the only regressors +$x$ in the terminal nodes. As $y$ is \code{diabetes}, a binary variable, and $x$ is +\code{glucose}, a numeric variable, a spinogram is chosen for visualization. The breaks +in the spinogram are the five-point summary of \code{glucose} on the full data set. The +fitted lines are the mean predicted probabilities in each group. + +The model tree distinguishes three different groups: +\begin{itemize} + \item[\#2] Women with low body mass index that have on average a low risk of + diabetes, however this increases clearly with glucose level. + \item[\#4] Women with average and high body mass index, younger than 30 years, + that have a higher avarage risk that also increases with glucose level. + \item[\#5] Women with average and high body mass index, older than 30 years, + that have a high avarage risk that increases only slowly with glucose level. +\end{itemize} + +The same interpretation can also be drawn from the coefficient estimates +and the corresponding odds ratios (with respect to glucose): + +<<>>= +coef(fmPID) +exp(coef(fmPID)[,2]) +@ + +<<echo=FALSE>>= +risk <- round(100 * (exp(coef(fmPID)[,2])-1), digits = 1) +@ + +i.e., the odds increase by \Sexpr{risk[1]}\%, \Sexpr{risk[2]}\% and \Sexpr{risk[3]}\% +with respect to glucose in the three groups. + + + +\section{Conclusion} +\label{sec:conclusion} + +The function \code{mob()} in the \pkg{party} package provides a flexible and object-oriented +implementation of the general algorithm for model-based recursive partitioning. +Models of class ``\code{StatModel}''---that employ a formula interface and are equipped with +methods for the generic functions \code{reweight()}, \code{weights()}, \code{estfun()} plus +some function for extracting the value of the objective function---can be easily partitioned. +The resulting ``\code{mob}'' tree can be flexibly summarized, both numerically and graphically, +and used for predictions on new data. + + +\bibliography{partyrefs} + +\end{document}