|
a |
|
b/partyMod/vignettes/MOB.Rnw |
|
|
1 |
\documentclass[nojss]{jss} |
|
|
2 |
|
|
|
3 |
%\VignetteIndexEntry{party with the mob} |
|
|
4 |
%\VignetteDepends{mlbench,colorspace} |
|
|
5 |
%\VignetteKeywords{parametric models, object-orientation, recursive partitioning} |
|
|
6 |
%\VignettePackage{party} |
|
|
7 |
|
|
|
8 |
%% packages |
|
|
9 |
\usepackage{amsmath} |
|
|
10 |
\SweaveOpts{engine=R} |
|
|
11 |
%% neet no \usepackage{Sweave} |
|
|
12 |
|
|
|
13 |
<<setup, echo = FALSE, results = hide>>= |
|
|
14 |
require("party") |
|
|
15 |
options(useFancyQuotes = FALSE) |
|
|
16 |
@ |
|
|
17 |
|
|
|
18 |
%% commands |
|
|
19 |
\newcommand{\ui}{\underline{i}} |
|
|
20 |
\newcommand{\oi}{\overline{\imath}} |
|
|
21 |
\newcommand{\argmin}{\operatorname{argmin}\displaylimits} |
|
|
22 |
|
|
|
23 |
%% author/title |
|
|
24 |
\author{Achim Zeileis\\Universit\"at Innsbruck \And |
|
|
25 |
Torsten Hothorn\\Ludwig-Maximilians-\\Universit\"at M\"unchen \And |
|
|
26 |
Kurt Hornik\\WU Wirtschafts-\\universit\"at Wien} |
|
|
27 |
\Plainauthor{Achim Zeileis, Torsten Hothorn, Kurt Hornik} |
|
|
28 |
|
|
|
29 |
\title{\pkg{party} with the \code{mob}: Model-Based Recursive Partitioning in \proglang{R}} |
|
|
30 |
\Plaintitle{party with the mob: Model-Based Recursive Partitioning in R} |
|
|
31 |
\Shorttitle{\pkg{party} with the \texttt{mob}} |
|
|
32 |
|
|
|
33 |
\Keywords{parametric models, object-orientation, recursive partitioning} |
|
|
34 |
|
|
|
35 |
%% abstract |
|
|
36 |
\Abstract{ |
|
|
37 |
The \pkg{party} package \citep{Hothorn+Hornik+Zeileis:2006a} provides |
|
|
38 |
the function \code{mob()} implementing a recently suggested algorithm |
|
|
39 |
for \underline{mo}del-\underline{b}ased recursive partitioning |
|
|
40 |
\citep{Zeileis+Hothorn+Hornik:2005}. The basic steps are: (1)~fit a |
|
|
41 |
parametric model to a data set, (2)~test for parameter instability over |
|
|
42 |
a set of partitioning variables, (3)~if there is some overall parameter |
|
|
43 |
instability, split the model with respect to the variable associated with |
|
|
44 |
the highest instability, (4)~repeat the procedure in each of the child nodes. |
|
|
45 |
It is discussed how these steps of the conceptual algorithm are translated |
|
|
46 |
into computational tools in an object-oriented manner, allowing the user to |
|
|
47 |
plug in various types of parametric models. The outcome is a tree where |
|
|
48 |
each node is associated with a fitted parametric model that can be |
|
|
49 |
effectively visualized and summarized. |
|
|
50 |
} |
|
|
51 |
|
|
|
52 |
\Address{ |
|
|
53 |
Achim Zeileis\\ |
|
|
54 |
Department of Statistics\\ |
|
|
55 |
Faculty of Economics and Statistics\\ |
|
|
56 |
Universit\"at Innsbruck\\ |
|
|
57 |
Universit\"atsstr.~15\\ |
|
|
58 |
6020 Innsbruck, Austria\\ |
|
|
59 |
E-mail: \email{Achim.Zeileis@R-project.org}\\ |
|
|
60 |
URL: \url{http://eeecon.uibk.ac.at/~zeileis/}\\ |
|
|
61 |
|
|
|
62 |
Torsten Hothorn\\ |
|
|
63 |
Institut f\"ur Statistik\\ |
|
|
64 |
Ludwig-Maximilians-Universit\"at M\"unchen\\ |
|
|
65 |
Ludwigstra{\ss}e 33\\ |
|
|
66 |
80539 M\"unchen, Germany\\ |
|
|
67 |
E-mail: \email{Torsten.Hothorn@R-project.org}\\ |
|
|
68 |
URL: \url{http://www.stat.uni-muenchen.de/~hothorn/}\\ |
|
|
69 |
|
|
|
70 |
Kurt Hornik\\ |
|
|
71 |
Institute for Statistics and Mathematics\\ |
|
|
72 |
WU Wirtschaftsuniversit\"at Wien\\ |
|
|
73 |
Augasse 2--6\\ |
|
|
74 |
1090 Wien, Austria\\ |
|
|
75 |
E-mail: \email{Kurt.Hornik@R-project.org}\\ |
|
|
76 |
URL: \url{http://statmath.wu.ac.at/~hornik/}\\ |
|
|
77 |
|
|
|
78 |
} |
|
|
79 |
|
|
|
80 |
\begin{document} |
|
|
81 |
|
|
|
82 |
|
|
|
83 |
\section{Motivation} |
|
|
84 |
\label{sec:motivation} |
|
|
85 |
|
|
|
86 |
Consider a parametric model $\mathcal{M}(Y, \theta)$ |
|
|
87 |
with (possibly vector-valued) observations $Y$ and a |
|
|
88 |
$k$-dimensional vector of parameters $\theta$. This model |
|
|
89 |
could be a (possibly multivariate) normal distribution for $Y$, or some |
|
|
90 |
kind of regression model when $Y = (y, x)$ can be split up into a dependent variable |
|
|
91 |
$y$ and regressors $x$. An example for the latter could be a linear regression |
|
|
92 |
model $y = x^\top \theta$ or a generalized linear model (GLM) or a survival |
|
|
93 |
regression. |
|
|
94 |
|
|
|
95 |
Given $n$ observations $Y_i$ ($i = 1, \dots, n$) the model can be fitted |
|
|
96 |
by minimizing some objective function $\Psi(Y, \theta)$, e.g., a residual sum of squares |
|
|
97 |
or a negative log-likelihood leading to ordinary least squares (OLS) or maximum |
|
|
98 |
likelihood (ML) estimation. |
|
|
99 |
|
|
|
100 |
If a global model for all $n$ observations does not fit well and further |
|
|
101 |
covariates $Z_1, \dots, Z_\ell$ are available, it might be possible to partition |
|
|
102 |
the $n$ observations with respect to these variables and find a fitting model |
|
|
103 |
in each cell of the partition. The algorithm described here tries to find |
|
|
104 |
such a partition adaptively using a greedy forward search. This procedure is |
|
|
105 |
implemented in the function \code{mob()} and described in more detail in the following |
|
|
106 |
section. However, we we will state some goals and design principles in advance. |
|
|
107 |
|
|
|
108 |
To translate the model-based partitioning problem into \proglang{R}, we start with |
|
|
109 |
a formula description of the variables involved. This formula should be of type |
|
|
110 |
\verb:y ~ x1 + ... + xk | z1 + ... + zl: where the variables on the |
|
|
111 |
left of the \code{|} specify the data $Y$ and the variables on the right specify the |
|
|
112 |
partitioning variables $Z_j$. Classical regression trees usually have a univariate |
|
|
113 |
response $Y$ and various partitioning variables, i.e., could be specified as |
|
|
114 |
\verb:y ~ 1 | z1 + ... + zl:. Structural change models, on the other hand, are usually |
|
|
115 |
regression models that are segmented with respect to a single partitioning variable, |
|
|
116 |
typically time: \verb:y ~ x1 + ... + xk | z:. |
|
|
117 |
|
|
|
118 |
The type of models $\mathcal{M}$ to be used with \code{mob()} should not be |
|
|
119 |
confined (by the implementation), hence we have written an object-oriented |
|
|
120 |
implementation. The idea is that $\mathcal{M}$ is translated into software |
|
|
121 |
by a model of class ``\code{StatModel}'' as provided by the \pkg{modeltools} |
|
|
122 |
package. The algorithm the relies on various methods being available for these |
|
|
123 |
models. The ``\code{StatModel}'' objects \code{linearModel} and \code{glinearModel}, |
|
|
124 |
implementing (generalized) linear regression models, are readily available in |
|
|
125 |
\pkg{modeltools}, others can easily be user-defined. |
|
|
126 |
|
|
|
127 |
|
|
|
128 |
\section{The model-based recursive partitioning algorithm} |
|
|
129 |
\label{sec:algorithm} |
|
|
130 |
|
|
|
131 |
The basic idea is to grow a tee in which every node is associated with a model |
|
|
132 |
of type $\mathcal{M}$. To assess whether splitting of the node is necessary a |
|
|
133 |
fluctuation test for parameter instability is performed. If there is significant instability |
|
|
134 |
with respect to any of the partitioning variables $Z_j$, the node is splitted |
|
|
135 |
into $B$ locally optimal segments (currently only $B = 2$ is implemented) |
|
|
136 |
and then the procedure is repeated in each of the $B$ children. |
|
|
137 |
If no more significant instabilities can be found, the recursion stops. |
|
|
138 |
More precisely, the steps of the algorithm are |
|
|
139 |
|
|
|
140 |
\begin{enumerate} |
|
|
141 |
\item Fit the model once to all observations in the current node. |
|
|
142 |
\item Assess whether the parameter estimates are stable with respect to |
|
|
143 |
every partitioning variable $Z_1, \dots, Z_\ell$. If there is some overall instability, |
|
|
144 |
select the variable $Z_j$ associated with the highest parameter instability, otherwise |
|
|
145 |
stop. |
|
|
146 |
\item Compute the split point(s) that locally optimize the objective function $\Psi$. |
|
|
147 |
\item Split the node into child nodes and repeat the procedure. |
|
|
148 |
\end{enumerate} |
|
|
149 |
|
|
|
150 |
The details for steps 1--3 are specified in the following. |
|
|
151 |
|
|
|
152 |
|
|
|
153 |
\subsection{Parameter estimation} |
|
|
154 |
|
|
|
155 |
This step of the algorithm is common practice, the only additional |
|
|
156 |
requirement is (as previously noted) that model has to be of the class |
|
|
157 |
``\code{StatModel}'' as provided by \pkg{modeltools}. Looking at the source |
|
|
158 |
code for the \code{linearModel} provided by this package illustrates how |
|
|
159 |
a simple wrapper to existing \proglang{R} functionality can be written. |
|
|
160 |
In particular, a method to the generic function \code{reweight()} has to |
|
|
161 |
be available. The reason is that it is inefficient to fit a brand-new model |
|
|
162 |
\code{modelobj} (including formula-parsing) in every node---much computation time |
|
|
163 |
is saved if simply \code{reweight(modelobj, weights)} is called in each of the |
|
|
164 |
child nodes. The \code{weights} argument controls which observations go into which |
|
|
165 |
of the child nodes. |
|
|
166 |
|
|
|
167 |
|
|
|
168 |
\subsection{Testing for parameter instability} |
|
|
169 |
|
|
|
170 |
The task in this step of the algorithm is to find out whether the parameters |
|
|
171 |
of the fitted model are stable over each particular ordering implied by |
|
|
172 |
the partitioning variables $Z_j$ or whether splitting the sample with respect |
|
|
173 |
to one of the $Z_j$ might capture instabilities in the parameters and thus improve the fit. |
|
|
174 |
The tests used in this step belong to the class of generalized M-fluctuation |
|
|
175 |
tests \citep{ZeileisHornik2003,Zeileis2005}. For numerical partitioning variables |
|
|
176 |
$Z_j$ the $\sup LM$~statistic is used which is the maximum over all single split |
|
|
177 |
$LM$ statistics. For categorical partitioning variables, a $\chi^2$~statistic is |
|
|
178 |
employed which captures the fluctuation within each of the categories of $Z_j$. |
|
|
179 |
|
|
|
180 |
For computing the test statistics and corresponding $p$~values $p_j$ for |
|
|
181 |
each of the partitioning variables $Z_j$ in \proglang{R}, the only requirement |
|
|
182 |
is that there are methods for the extractor functions \code{estfun()} and |
|
|
183 |
\code{weights()}. The \code{estfun()} method extracts the empirical estimating |
|
|
184 |
functions (model scores) from the fitted \code{modelobj}, these are the |
|
|
185 |
main ingredient for M-fluctuation tests. The \code{weights()} method is used |
|
|
186 |
to determine which observations are in the current node (i.e., have a weight |
|
|
187 |
greater than zero) and which are not (i.e., have zero weight). |
|
|
188 |
|
|
|
189 |
To determine whether there is some overall instability, it is checked whether |
|
|
190 |
the minial $p$~value $p_{j^*} = \min_{j = 1, \dots, \ell} p_j$ falls below a |
|
|
191 |
pre-specified significance level $\alpha$ (by default $\alpha = 0.05$) or not. |
|
|
192 |
To adjust for multiple testing, the $p$ values can be Bonferroni adjusted |
|
|
193 |
(which is the default). If there is significant instability, the variable $Z_{j^*}$ |
|
|
194 |
associated with the minimal $p$~value is used for splitting the node. |
|
|
195 |
|
|
|
196 |
\subsection{Splitting} |
|
|
197 |
|
|
|
198 |
In this step, the observation in the current node are split with respect |
|
|
199 |
to the chosen partitioning variable $Z_{j^*}$ into $B$ child nodes. Currently, |
|
|
200 |
the infrastructure in \pkg{party} only supports binary splits, i.e., $B = 2$. |
|
|
201 |
For deterimining the split point, an exhaustive search procedure is adopted: |
|
|
202 |
For each conceivable split point, the $B$ child node models are fit and the split |
|
|
203 |
associated with the minimal value of the objective function $\Psi$ is chosen. |
|
|
204 |
|
|
|
205 |
Computationally, this means that the fitted model \code{modelobj} is \code{reweight()}ed |
|
|
206 |
for each of the child nodes. The observations entering a child node keep their |
|
|
207 |
current weight while those observations that go into different child nodes receive |
|
|
208 |
zero weight. To compare the objective function $\Psi$, an extractor function |
|
|
209 |
is required to compute it from the fitted \code{modelobj}. This extractor function |
|
|
210 |
can be user-specified and set in \verb:mob_control():, it defaults to \code{deviance()}. |
|
|
211 |
|
|
|
212 |
|
|
|
213 |
This concludes one iteration of the recursive partitioning algorithm and steps~1--3 |
|
|
214 |
are carried out again in each of the $B$ daughter nodes until no significant |
|
|
215 |
instability is detected in step~2. |
|
|
216 |
|
|
|
217 |
\section{Illustrations} |
|
|
218 |
\label{sec:illustration} |
|
|
219 |
|
|
|
220 |
\subsection{Boston housing data} |
|
|
221 |
|
|
|
222 |
Since the analysis by \cite{BreimanFriedman1985}, the Boston housing data are |
|
|
223 |
a popular and well-investigated empirical basis for illustrating non-linear |
|
|
224 |
regression methods both in machine learning and statistics |
|
|
225 |
\citep[see][for two recent examples]{Gama2004,Samarovetal2005} and we follow |
|
|
226 |
these examples by segmenting a bivariate linear regression model for the house |
|
|
227 |
values. |
|
|
228 |
|
|
|
229 |
Thus, the model $\mathcal{M}$ used is \code{linearModel} from the \pkg{modeltools} |
|
|
230 |
package which is automatically loaded together with \pkg{party}. |
|
|
231 |
|
|
|
232 |
<<>>= |
|
|
233 |
library("party") |
|
|
234 |
@ |
|
|
235 |
|
|
|
236 |
The data set is available in package \pkg{mlbench} via |
|
|
237 |
|
|
|
238 |
<<>>= |
|
|
239 |
data("BostonHousing", package = "mlbench") |
|
|
240 |
@ |
|
|
241 |
|
|
|
242 |
and provides $n = \Sexpr{NROW(BostonHousing)}$ observations of the median value of owner-occupied |
|
|
243 |
homes in Boston (in USD~1000) along with $\Sexpr{NCOL(BostonHousing)}$ covariates including in particular |
|
|
244 |
the number of rooms per dwelling (\code{rm}) and the percentage |
|
|
245 |
of lower status of the population (\code{lstat}). A segment-wise linear relationship between |
|
|
246 |
the value and these two variables is very intuitive, whereas the shape of the influence |
|
|
247 |
of the remaining covariates is rather unclear and hence should be learned from the data. |
|
|
248 |
Therefore, a linear regression model for median value explained by \verb:rm^2: |
|
|
249 |
and \verb:log(lstat): with $k = 3$ regression coefficients is employed and |
|
|
250 |
partitioned with respect to all $\ell = 11$ remaining variables. To facilitate subsequent |
|
|
251 |
commands, the transformations are explicitely stored in \code{BostonHousing}: |
|
|
252 |
|
|
|
253 |
<<>>= |
|
|
254 |
BostonHousing$lstat <- log(BostonHousing$lstat) |
|
|
255 |
BostonHousing$rm <- BostonHousing$rm^2 |
|
|
256 |
@ |
|
|
257 |
|
|
|
258 |
Choosing appropriate |
|
|
259 |
transformations of the dependent variable and the regressors that enter the linear |
|
|
260 |
regression model is important to obtain a well-fitting model in each segment and |
|
|
261 |
we follow in our choice the recommendations of \cite{BreimanFriedman1985}. Monotonous |
|
|
262 |
transformations of the partitioning variables do not affect the recursive partitioning |
|
|
263 |
algorithm and hence do not have to be performed. However, it is important to distinguish |
|
|
264 |
between numerical and categorical variables for choosing an appropriate parameter |
|
|
265 |
stability test. The variable \code{chas} is a dummy indicator variable (for tract bounds |
|
|
266 |
with Charles river) and should thus be turned into a factor. Furthermore, the variable |
|
|
267 |
\code{rad} is an index of accessibility to radial highways and takes only 9 distinct |
|
|
268 |
values. Thus it is most appropriately treated as an ordered factor. |
|
|
269 |
|
|
|
270 |
<<>>= |
|
|
271 |
BostonHousing$chas <- factor(BostonHousing$chas, levels = 0:1, labels = c("no", "yes")) |
|
|
272 |
BostonHousing$rad <- factor(BostonHousing$rad, ordered = TRUE) |
|
|
273 |
@ |
|
|
274 |
|
|
|
275 |
Both transformations only affect the parameter stability test chosen (step~2), not the splitting |
|
|
276 |
procedure (step~3). |
|
|
277 |
|
|
|
278 |
The model is estimated |
|
|
279 |
by OLS, the instability is assessed using a Bonferroni-corrected |
|
|
280 |
significance level of $\alpha = 0.05$ and the nodes are split with a required minimal |
|
|
281 |
segment size of $40$ observations. The control parameters are thus set to |
|
|
282 |
|
|
|
283 |
<<>>= |
|
|
284 |
ctrl <- mob_control(alpha = 0.05, bonferroni = TRUE, minsplit = 40, |
|
|
285 |
objfun = deviance, verbose = TRUE) |
|
|
286 |
@ |
|
|
287 |
|
|
|
288 |
Actually, all of these settings are the defaults except \code{minsplit = 40} and |
|
|
289 |
\code{verbose = TRUE} which causes some information about the fitting process |
|
|
290 |
being written to the screen. The objective function \code{deviance()} extracts in |
|
|
291 |
this case the residual sum of squares from a fitted \code{linearModel} object. |
|
|
292 |
|
|
|
293 |
Having collected all building blocks, we can now call the function \code{mob()} |
|
|
294 |
that takes the model specification of the linear regression model \verb:medv ~ lstat + rm: |
|
|
295 |
plus all partitioning variables, along with the \code{data} set, the \code{control} |
|
|
296 |
settings and the \code{model} to be used. |
|
|
297 |
|
|
|
298 |
<<>>= |
|
|
299 |
fmBH <- mob(medv ~ lstat + rm | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio, |
|
|
300 |
data = BostonHousing, control = ctrl, model = linearModel) |
|
|
301 |
@ |
|
|
302 |
|
|
|
303 |
The result is the fitted model \code{fmBH} of class ``\code{mob}'' that contains |
|
|
304 |
the tree with a fitted linear regression associated with every node. Printing |
|
|
305 |
this object will show the splits, their $p$ values and call the \code{print()} method |
|
|
306 |
for the model in each terminal node (i.e., this simply relies on a \code{print()} |
|
|
307 |
method being available for the fitted model and re-uses it). |
|
|
308 |
|
|
|
309 |
<<>>= |
|
|
310 |
fmBH |
|
|
311 |
@ |
|
|
312 |
|
|
|
313 |
Looking at the printed output is typically rather tedious, a visualization via |
|
|
314 |
the \code{plot()} method |
|
|
315 |
|
|
|
316 |
<<eval=FALSE>>= |
|
|
317 |
plot(fmBH) |
|
|
318 |
@ |
|
|
319 |
|
|
|
320 |
is much easier to interpret. By default, this produces partial scatter plots of the |
|
|
321 |
variable $y$ against each of the regressors $x_i$ in the terminal nodes. Each scatter |
|
|
322 |
plot also shows the fitted values, i.e., a project of the fitted hyperplane. |
|
|
323 |
|
|
|
324 |
\setkeys{Gin}{width=\textwidth} |
|
|
325 |
\begin{figure}[p] |
|
|
326 |
\begin{center} |
|
|
327 |
<<BostonHousing-plot,echo=FALSE,results=hide,fig=TRUE,height=8,width=12,include=FALSE>>= |
|
|
328 |
plot(fmBH) |
|
|
329 |
@ |
|
|
330 |
\includegraphics[width=18cm,keepaspectratio,angle=90]{MOB-BostonHousing-plot} |
|
|
331 |
\caption{\label{fig:BostonHousing} Linear-regression-based tree for the Boston housing data. |
|
|
332 |
The plots in the leaves give partial scatter plots for \code{rm} (upper panel) and |
|
|
333 |
\code{lstat} (lower panel).} |
|
|
334 |
\end{center} |
|
|
335 |
\end{figure} |
|
|
336 |
|
|
|
337 |
From this visualization, it can be seen that in the nodes~4, 6, 7 and 8 the increase of |
|
|
338 |
value with the number of rooms dominates the picture (upper panel) whereas in node~9 the |
|
|
339 |
decrease with the lower status population percentage (lower panel) is more pronounced. |
|
|
340 |
Splits are performed in the variables \code{tax} (poperty-tax rate) and |
|
|
341 |
\code{ptratio} (pupil-teacher ratio). |
|
|
342 |
|
|
|
343 |
Various quantities of interest can be computed, provided that the \code{model} used |
|
|
344 |
provides the corresponding methods, e.g., \code{predict()}, \code{residuals()}, \code{logLik()}, |
|
|
345 |
\code{coef()} and \code{summary()}. The latter two by default try to extract information |
|
|
346 |
for the terminal nodes, but a \code{node} argument can be set to the node IDs of interest. |
|
|
347 |
As an example, the regression coefficients for the terminal node models can be easily |
|
|
348 |
extracted by |
|
|
349 |
|
|
|
350 |
<<>>= |
|
|
351 |
coef(fmBH) |
|
|
352 |
@ |
|
|
353 |
|
|
|
354 |
reflecting the differences of the models that can also be seen in the the associated |
|
|
355 |
\code{plot()}. Even more information is available in a \code{summary()}, e.g., for node 7: |
|
|
356 |
|
|
|
357 |
<<>>= |
|
|
358 |
summary(fmBH, node = 7) |
|
|
359 |
@ |
|
|
360 |
|
|
|
361 |
The test statistics and $p$~values computed in each node, can be extracted analogously |
|
|
362 |
by using the method for the function \code{sctest()} (for performing \underline{s}tructural |
|
|
363 |
\underline{c}hange \underline{test}s). |
|
|
364 |
|
|
|
365 |
<<>>= |
|
|
366 |
sctest(fmBH, node = 7) |
|
|
367 |
@ |
|
|
368 |
|
|
|
369 |
For summarizing the quality of the fit, we could compute the mean squared error, log-likelihood |
|
|
370 |
or AIC: |
|
|
371 |
|
|
|
372 |
<<>>= |
|
|
373 |
mean(residuals(fmBH)^2) |
|
|
374 |
logLik(fmBH) |
|
|
375 |
AIC(fmBH) |
|
|
376 |
@ |
|
|
377 |
|
|
|
378 |
<<echo=FALSE>>= |
|
|
379 |
nt <- NROW(coef(fmBH)) |
|
|
380 |
nk <- NCOL(coef(fmBH)) |
|
|
381 |
@ |
|
|
382 |
|
|
|
383 |
As the \code{logLik()} method simply re-uses the method for \code{linearModel} objects, |
|
|
384 |
this does not only report $\Sexpr{(nk+1)*nt-1}$ estimated parameters ($\Sexpr{nk}$ parameters in |
|
|
385 |
each of the $\Sexpr{nt}$ terminal nodes plus $\Sexpr{nt} - 1$ split points) but |
|
|
386 |
$\Sexpr{(nk+2)*nt-1}$ parameters because each terminal node is additionally associated with a |
|
|
387 |
variance estimate. However, for the fitting process, the variance was treated as a nuisance |
|
|
388 |
parameter as we employed OLS estimation (rather than fully-specified ML estimation). |
|
|
389 |
|
|
|
390 |
|
|
|
391 |
\subsection{Pima Indians diabetes data} |
|
|
392 |
|
|
|
393 |
Another popular benchmark data set for binary classifications is the Pima Indians |
|
|
394 |
diabetes database which is also available from \pkg{mlbench}: |
|
|
395 |
|
|
|
396 |
<<>>= |
|
|
397 |
data("PimaIndiansDiabetes2", package = "mlbench") |
|
|
398 |
PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)]) |
|
|
399 |
@ |
|
|
400 |
|
|
|
401 |
After omitting missing values (and the variables \verb:triceps: and \verb:insulin: which |
|
|
402 |
are missing for most women), the data set provides diabetes test results for |
|
|
403 |
$n = \Sexpr{NROW(PimaIndiansDiabetes)}$ women |
|
|
404 |
along with $\Sexpr{NCOL(PimaIndiansDiabetes)}$ covariates including in particular |
|
|
405 |
the plasma glucose concentration \code{glucose} as an important predictor for diabetes. |
|
|
406 |
Fitting a logistic regression model \verb:diabetes ~ glucose: seems to be straightforward, |
|
|
407 |
whereas the influence of the remaining variables should again be learned by recursive |
|
|
408 |
partitioning. This will yield a model tree with $k = 2$ regression coefficients in each terminal |
|
|
409 |
node, partitioned with respect to the remaining $\ell = 5$ remaining variables. |
|
|
410 |
|
|
|
411 |
The model is estimated by ML employing the \code{glinearModel}, |
|
|
412 |
the instability is assessed using a Bonferroni-corrected |
|
|
413 |
significance level of $\alpha = 0.05$ and the nodes are split with a required minimal |
|
|
414 |
segment size of $20$ observations. Hence, all control parameters correspond to the |
|
|
415 |
default values in \verb:mob_control(): and do not have to be set explicitely in |
|
|
416 |
the \code{mob()} call: |
|
|
417 |
|
|
|
418 |
<<>>= |
|
|
419 |
fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age, |
|
|
420 |
data = PimaIndiansDiabetes, model = glinearModel, family = binomial()) |
|
|
421 |
@ |
|
|
422 |
|
|
|
423 |
To visualize this, we simply call again: |
|
|
424 |
|
|
|
425 |
<<eval=FALSE>>= |
|
|
426 |
plot(fmPID) |
|
|
427 |
@ |
|
|
428 |
|
|
|
429 |
\setkeys{Gin}{width=\textwidth} |
|
|
430 |
\begin{figure}[bth] |
|
|
431 |
\begin{center} |
|
|
432 |
<<PimaIndiansDiabetes-plot,echo=FALSE,results=hide,fig=TRUE,height=6,width=9>>= |
|
|
433 |
plot(fmPID) |
|
|
434 |
@ |
|
|
435 |
\caption{\label{fig:PimaIndiansDiabetes} Logistic-regression-based tree for the Pima Indians |
|
|
436 |
diabetes data. The plots in the leaves give spinograms for \code{diabetes} versus |
|
|
437 |
\code{glucose}.} |
|
|
438 |
\end{center} |
|
|
439 |
\end{figure} |
|
|
440 |
|
|
|
441 |
which produces again a plot of the dependent variable $y$ against the only regressors |
|
|
442 |
$x$ in the terminal nodes. As $y$ is \code{diabetes}, a binary variable, and $x$ is |
|
|
443 |
\code{glucose}, a numeric variable, a spinogram is chosen for visualization. The breaks |
|
|
444 |
in the spinogram are the five-point summary of \code{glucose} on the full data set. The |
|
|
445 |
fitted lines are the mean predicted probabilities in each group. |
|
|
446 |
|
|
|
447 |
The model tree distinguishes three different groups: |
|
|
448 |
\begin{itemize} |
|
|
449 |
\item[\#2] Women with low body mass index that have on average a low risk of |
|
|
450 |
diabetes, however this increases clearly with glucose level. |
|
|
451 |
\item[\#4] Women with average and high body mass index, younger than 30 years, |
|
|
452 |
that have a higher avarage risk that also increases with glucose level. |
|
|
453 |
\item[\#5] Women with average and high body mass index, older than 30 years, |
|
|
454 |
that have a high avarage risk that increases only slowly with glucose level. |
|
|
455 |
\end{itemize} |
|
|
456 |
|
|
|
457 |
The same interpretation can also be drawn from the coefficient estimates |
|
|
458 |
and the corresponding odds ratios (with respect to glucose): |
|
|
459 |
|
|
|
460 |
<<>>= |
|
|
461 |
coef(fmPID) |
|
|
462 |
exp(coef(fmPID)[,2]) |
|
|
463 |
@ |
|
|
464 |
|
|
|
465 |
<<echo=FALSE>>= |
|
|
466 |
risk <- round(100 * (exp(coef(fmPID)[,2])-1), digits = 1) |
|
|
467 |
@ |
|
|
468 |
|
|
|
469 |
i.e., the odds increase by \Sexpr{risk[1]}\%, \Sexpr{risk[2]}\% and \Sexpr{risk[3]}\% |
|
|
470 |
with respect to glucose in the three groups. |
|
|
471 |
|
|
|
472 |
|
|
|
473 |
|
|
|
474 |
\section{Conclusion} |
|
|
475 |
\label{sec:conclusion} |
|
|
476 |
|
|
|
477 |
The function \code{mob()} in the \pkg{party} package provides a flexible and object-oriented |
|
|
478 |
implementation of the general algorithm for model-based recursive partitioning. |
|
|
479 |
Models of class ``\code{StatModel}''---that employ a formula interface and are equipped with |
|
|
480 |
methods for the generic functions \code{reweight()}, \code{weights()}, \code{estfun()} plus |
|
|
481 |
some function for extracting the value of the objective function---can be easily partitioned. |
|
|
482 |
The resulting ``\code{mob}'' tree can be flexibly summarized, both numerically and graphically, |
|
|
483 |
and used for predictions on new data. |
|
|
484 |
|
|
|
485 |
|
|
|
486 |
\bibliography{partyrefs} |
|
|
487 |
|
|
|
488 |
\end{document} |