Switch to unified view

a b/partyMod/vignettes/MOB.Rnw
1
\documentclass[nojss]{jss}
2
3
%\VignetteIndexEntry{party with the mob}
4
%\VignetteDepends{mlbench,colorspace}
5
%\VignetteKeywords{parametric models, object-orientation, recursive partitioning}
6
%\VignettePackage{party}
7
8
%% packages
9
\usepackage{amsmath}
10
\SweaveOpts{engine=R}
11
%% neet no \usepackage{Sweave}
12
13
<<setup, echo = FALSE, results = hide>>=
14
require("party")
15
options(useFancyQuotes = FALSE)
16
@
17
18
%% commands
19
\newcommand{\ui}{\underline{i}}
20
\newcommand{\oi}{\overline{\imath}}
21
\newcommand{\argmin}{\operatorname{argmin}\displaylimits}
22
23
%% author/title
24
\author{Achim Zeileis\\Universit\"at Innsbruck \And
25
        Torsten Hothorn\\Ludwig-Maximilians-\\Universit\"at M\"unchen \And
26
    Kurt Hornik\\WU Wirtschafts-\\universit\"at Wien}
27
\Plainauthor{Achim Zeileis, Torsten Hothorn, Kurt Hornik}
28
29
\title{\pkg{party} with the \code{mob}: Model-Based Recursive Partitioning in \proglang{R}}
30
\Plaintitle{party with the mob: Model-Based Recursive Partitioning in R}
31
\Shorttitle{\pkg{party} with the \texttt{mob}}
32
33
\Keywords{parametric models, object-orientation, recursive partitioning}
34
35
%% abstract
36
\Abstract{
37
  The \pkg{party} package \citep{Hothorn+Hornik+Zeileis:2006a} provides
38
  the function \code{mob()} implementing a recently suggested algorithm
39
  for \underline{mo}del-\underline{b}ased recursive partitioning
40
  \citep{Zeileis+Hothorn+Hornik:2005}. The basic steps are: (1)~fit a
41
  parametric model to a data set, (2)~test for parameter instability over
42
  a set of partitioning variables, (3)~if there is some overall parameter
43
  instability, split the model with respect to the variable associated with
44
  the highest instability, (4)~repeat the procedure in each of the child nodes.
45
  It is discussed how these steps of the conceptual algorithm are translated
46
  into computational tools in an object-oriented manner, allowing the user to
47
  plug in various types of parametric models. The outcome is a tree where
48
  each node is associated with a fitted parametric model that can be
49
  effectively visualized and summarized.
50
}
51
52
\Address{
53
  Achim Zeileis\\
54
  Department of Statistics\\
55
  Faculty of Economics and Statistics\\
56
  Universit\"at Innsbruck\\
57
  Universit\"atsstr.~15\\
58
  6020 Innsbruck, Austria\\
59
  E-mail: \email{Achim.Zeileis@R-project.org}\\
60
  URL: \url{http://eeecon.uibk.ac.at/~zeileis/}\\
61
62
  Torsten Hothorn\\
63
  Institut f\"ur Statistik\\
64
  Ludwig-Maximilians-Universit\"at M\"unchen\\
65
  Ludwigstra{\ss}e 33\\
66
  80539 M\"unchen, Germany\\
67
  E-mail: \email{Torsten.Hothorn@R-project.org}\\
68
  URL: \url{http://www.stat.uni-muenchen.de/~hothorn/}\\
69
  
70
  Kurt Hornik\\
71
  Institute for Statistics and Mathematics\\
72
  WU Wirtschaftsuniversit\"at Wien\\
73
  Augasse 2--6\\
74
  1090 Wien, Austria\\
75
  E-mail: \email{Kurt.Hornik@R-project.org}\\
76
  URL: \url{http://statmath.wu.ac.at/~hornik/}\\
77
78
}
79
80
\begin{document}
81
82
83
\section{Motivation}
84
\label{sec:motivation}
85
86
Consider a parametric model $\mathcal{M}(Y, \theta)$
87
with (possibly vector-valued) observations $Y$ and a
88
$k$-dimensional vector of parameters $\theta$. This model 
89
could be a (possibly multivariate) normal distribution for $Y$, or some
90
kind of regression model when $Y = (y, x)$ can be split up into a dependent variable
91
$y$ and regressors $x$. An example for the latter could be a linear regression
92
model $y = x^\top \theta$ or a generalized linear model (GLM) or a survival 
93
regression.
94
95
Given $n$ observations $Y_i$ ($i = 1, \dots, n$) the model can be fitted
96
by minimizing some objective function $\Psi(Y, \theta)$, e.g., a residual sum of squares
97
or a negative log-likelihood leading to ordinary least squares (OLS) or maximum
98
likelihood (ML) estimation.
99
100
If a global model for all $n$ observations does not fit well and further
101
covariates $Z_1, \dots, Z_\ell$ are available, it might be possible to partition
102
the $n$ observations with respect to these variables and find a fitting model
103
in each cell of the partition. The algorithm described here tries to find
104
such a partition adaptively using a greedy forward search. This procedure is
105
implemented in the function \code{mob()} and described in more detail in the following
106
section. However, we we will state some goals and design principles in advance.
107
108
To translate the model-based partitioning problem into \proglang{R}, we start with
109
a formula description of the variables involved. This formula should be of type
110
\verb:y ~ x1 + ... + xk | z1 + ... + zl: where the variables on the
111
left of the \code{|} specify the data $Y$ and the variables on the right specify the
112
partitioning variables $Z_j$. Classical regression trees usually have a univariate
113
response $Y$ and various partitioning variables, i.e., could be specified as
114
\verb:y ~ 1 | z1 + ... + zl:. Structural change models, on the other hand, are usually
115
regression models that are segmented with respect to a single partitioning variable,
116
typically time: \verb:y ~ x1 + ... + xk | z:.
117
118
The type of models $\mathcal{M}$ to be used with \code{mob()} should not be
119
confined (by the implementation), hence we have written an object-oriented 
120
implementation. The idea is that $\mathcal{M}$ is translated into software
121
by a model of class ``\code{StatModel}'' as provided by the \pkg{modeltools}
122
package. The algorithm the relies on various methods being available for these
123
models. The ``\code{StatModel}'' objects \code{linearModel} and \code{glinearModel},
124
implementing (generalized) linear regression models, are readily available in
125
\pkg{modeltools}, others can easily be user-defined.
126
127
128
\section{The model-based recursive partitioning algorithm}
129
\label{sec:algorithm}
130
131
The basic idea is to grow a tee in which every node is associated with a model
132
of type $\mathcal{M}$. To assess whether splitting of the node is necessary a
133
fluctuation test for parameter instability is performed. If there is significant instability
134
with respect to any of the partitioning variables $Z_j$, the node is splitted
135
into $B$ locally optimal segments (currently only $B = 2$ is implemented)
136
and then the procedure is repeated in each of the $B$ children.
137
If no more significant instabilities can be found, the recursion stops.
138
More precisely, the steps of the algorithm are
139
140
\begin{enumerate}
141
\item Fit the model once to all observations in the current node.
142
\item Assess whether the parameter estimates are stable with respect to
143
  every partitioning variable $Z_1, \dots, Z_\ell$. If there is some overall instability,
144
  select the variable $Z_j$ associated with the highest parameter instability, otherwise
145
  stop.
146
\item Compute the split point(s) that locally optimize the objective function $\Psi$.
147
\item Split the node into child nodes and repeat the procedure.
148
\end{enumerate}
149
150
The details for steps 1--3 are specified in the following.
151
152
153
\subsection{Parameter estimation}
154
155
This step of the algorithm is common practice, the only additional 
156
requirement is (as previously noted) that model has to be of the class
157
``\code{StatModel}'' as provided by \pkg{modeltools}. Looking at the source
158
code for the \code{linearModel} provided by this package illustrates how
159
a simple wrapper to existing \proglang{R} functionality can be written.
160
In particular, a method to the generic function \code{reweight()} has to
161
be available. The reason is that it is inefficient to fit a brand-new model
162
\code{modelobj} (including formula-parsing) in every node---much computation time
163
is saved if simply \code{reweight(modelobj, weights)} is called in each of the
164
child nodes. The \code{weights} argument controls which observations go into which
165
of the child nodes.
166
167
168
\subsection{Testing for parameter instability}
169
170
The task in this step of the algorithm is to find out whether the parameters
171
of the fitted model are stable over each particular ordering implied by
172
the partitioning variables $Z_j$ or whether splitting the sample with respect
173
to one of the $Z_j$ might capture instabilities in the parameters and thus improve the fit.
174
The tests used in this step belong to the class of generalized M-fluctuation
175
tests \citep{ZeileisHornik2003,Zeileis2005}. For numerical partitioning variables
176
$Z_j$ the $\sup LM$~statistic is used which is the maximum over all single split
177
$LM$ statistics. For categorical partitioning variables, a $\chi^2$~statistic is
178
employed which captures the fluctuation within each of the categories of $Z_j$.
179
180
For computing the test statistics and corresponding $p$~values $p_j$ for 
181
each of the partitioning variables $Z_j$ in \proglang{R}, the only requirement
182
is that there are methods for the extractor functions \code{estfun()} and 
183
\code{weights()}. The \code{estfun()} method extracts the empirical estimating
184
functions (model scores) from the fitted \code{modelobj}, these are the 
185
main ingredient for M-fluctuation tests. The \code{weights()} method is used
186
to determine which observations are in the current node (i.e., have a weight
187
greater than zero) and which are not (i.e., have zero weight).
188
189
To determine whether there is some overall instability, it is checked whether
190
the minial $p$~value $p_{j^*} = \min_{j = 1, \dots, \ell} p_j$ falls below a
191
pre-specified significance level $\alpha$ (by default $\alpha = 0.05$) or not.
192
To adjust for multiple testing, the $p$ values can be Bonferroni adjusted
193
(which is the default). If there is significant instability, the variable $Z_{j^*}$
194
associated with the minimal $p$~value is used for splitting the node.
195
196
\subsection{Splitting}
197
198
In this step, the observation in the current node are split with respect
199
to the chosen partitioning variable $Z_{j^*}$ into $B$ child nodes. Currently,
200
the infrastructure in \pkg{party} only supports binary splits, i.e., $B = 2$.
201
For deterimining the split point, an exhaustive search procedure is adopted:
202
For each conceivable split point, the $B$ child node models are fit and the split
203
associated with the minimal value of the objective function $\Psi$ is chosen.
204
205
Computationally, this means that the fitted model \code{modelobj} is \code{reweight()}ed
206
for each of the child nodes. The observations entering a child node keep their
207
current weight while those observations that go into different child nodes receive
208
zero weight. To compare the objective function $\Psi$, an extractor function
209
is required to compute it from the fitted \code{modelobj}. This extractor function
210
can be user-specified and set in \verb:mob_control():, it defaults to \code{deviance()}.
211
212
213
This concludes one iteration of the recursive partitioning algorithm and steps~1--3
214
are carried out again in each of the $B$ daughter nodes until no significant 
215
instability is detected in step~2.
216
217
\section{Illustrations}
218
\label{sec:illustration}
219
220
\subsection{Boston housing data}
221
222
Since the analysis by \cite{BreimanFriedman1985}, the Boston housing data are 
223
a popular and well-investigated empirical basis for illustrating non-linear 
224
regression methods both in machine learning and statistics
225
\citep[see][for two recent examples]{Gama2004,Samarovetal2005} and we follow 
226
these examples by segmenting a bivariate linear regression model for the house
227
values.
228
229
Thus, the model $\mathcal{M}$ used is \code{linearModel} from the \pkg{modeltools}
230
package which is automatically loaded together with \pkg{party}.
231
232
<<>>=
233
library("party")
234
@
235
236
The data set is available in package \pkg{mlbench} via
237
238
<<>>=
239
data("BostonHousing", package = "mlbench")
240
@
241
242
and provides $n = \Sexpr{NROW(BostonHousing)}$ observations of the median value of owner-occupied
243
homes in Boston (in USD~1000) along with $\Sexpr{NCOL(BostonHousing)}$ covariates including in particular
244
the number of rooms per dwelling (\code{rm}) and the percentage
245
of lower status of the population (\code{lstat}). A segment-wise linear relationship between
246
the value and these two variables is very intuitive, whereas the shape of the influence
247
of the remaining covariates is rather unclear and hence should be learned from the data.
248
Therefore, a linear regression model for median value explained by \verb:rm^2:
249
and \verb:log(lstat): with $k = 3$ regression coefficients is employed and
250
partitioned with respect to all $\ell = 11$ remaining variables. To facilitate subsequent
251
commands, the transformations are explicitely stored in \code{BostonHousing}:
252
253
<<>>=
254
BostonHousing$lstat <- log(BostonHousing$lstat)
255
BostonHousing$rm <- BostonHousing$rm^2
256
@
257
258
Choosing appropriate
259
transformations of the dependent variable and the regressors that enter the linear
260
regression model is important to obtain a well-fitting model in each segment and
261
we follow in our choice the recommendations of \cite{BreimanFriedman1985}. Monotonous
262
transformations of the partitioning variables do not affect the recursive partitioning
263
algorithm and hence do not have to be performed. However, it is important to distinguish
264
between numerical and categorical variables for choosing an appropriate parameter 
265
stability test. The variable \code{chas} is a dummy indicator variable (for tract bounds
266
with Charles river) and should thus be turned into a factor. Furthermore, the variable
267
\code{rad} is an index of accessibility to radial highways and takes only 9 distinct
268
values. Thus it is most appropriately treated as an ordered factor.
269
270
<<>>=
271
BostonHousing$chas <- factor(BostonHousing$chas, levels = 0:1, labels = c("no", "yes"))
272
BostonHousing$rad <- factor(BostonHousing$rad, ordered = TRUE)
273
@
274
275
Both transformations only affect the parameter stability test chosen (step~2), not the splitting
276
procedure (step~3).
277
278
The model is estimated
279
by OLS, the instability is assessed using a Bonferroni-corrected
280
significance level of $\alpha = 0.05$ and the nodes are split with a required minimal
281
segment size of $40$ observations. The control parameters are thus set to
282
283
<<>>=
284
ctrl <- mob_control(alpha = 0.05, bonferroni = TRUE, minsplit = 40,
285
  objfun = deviance, verbose = TRUE)
286
@
287
288
Actually, all of these settings are the defaults except \code{minsplit = 40} and
289
\code{verbose = TRUE} which causes some information about the fitting process
290
being written to the screen. The objective function \code{deviance()} extracts in
291
this case the residual sum of squares from a fitted \code{linearModel} object.
292
293
Having collected all building blocks, we can now call the function \code{mob()}
294
that takes the model specification of the linear regression model \verb:medv ~ lstat + rm:
295
plus all partitioning variables, along with the \code{data} set, the \code{control}
296
settings and the \code{model} to be used.
297
298
<<>>=
299
fmBH <- mob(medv ~ lstat + rm | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio,
300
  data = BostonHousing, control = ctrl, model = linearModel)
301
@
302
303
The result is the fitted model \code{fmBH} of class ``\code{mob}'' that contains
304
the tree with a fitted linear regression associated with every node. Printing
305
this object will show the splits, their $p$ values and call the \code{print()} method
306
for the model in each terminal node (i.e., this simply relies on a \code{print()}
307
method being available for the fitted model and re-uses it).
308
309
<<>>=
310
fmBH
311
@
312
313
Looking at the printed output is typically rather tedious, a visualization via
314
the \code{plot()} method
315
316
<<eval=FALSE>>=
317
plot(fmBH)
318
@
319
320
is much easier to interpret. By default, this produces partial scatter plots of the
321
variable $y$ against each of the regressors $x_i$ in the terminal nodes. Each scatter
322
plot also shows the fitted values, i.e., a project of the fitted hyperplane.
323
324
\setkeys{Gin}{width=\textwidth}
325
\begin{figure}[p]
326
\begin{center}
327
<<BostonHousing-plot,echo=FALSE,results=hide,fig=TRUE,height=8,width=12,include=FALSE>>=
328
plot(fmBH)
329
@
330
\includegraphics[width=18cm,keepaspectratio,angle=90]{MOB-BostonHousing-plot}
331
\caption{\label{fig:BostonHousing} Linear-regression-based tree for the Boston housing data.
332
The plots in the leaves give partial scatter plots for \code{rm} (upper panel) and 
333
\code{lstat} (lower panel).}
334
\end{center}
335
\end{figure}
336
337
From this visualization, it can be seen that in the nodes~4, 6, 7 and 8 the increase of
338
value with the number of rooms dominates the picture (upper panel) whereas in node~9 the
339
decrease with the lower status population percentage (lower panel) is more pronounced. 
340
Splits are performed in the variables \code{tax} (poperty-tax rate) and
341
\code{ptratio} (pupil-teacher ratio).
342
343
Various quantities of interest can be computed, provided that the \code{model} used
344
provides the corresponding methods, e.g., \code{predict()}, \code{residuals()}, \code{logLik()},
345
\code{coef()} and \code{summary()}. The latter two by default try to extract information
346
for the terminal nodes, but a \code{node} argument can be set to the node IDs of interest.
347
As an example, the regression coefficients for the terminal node models can be easily 
348
extracted by
349
350
<<>>=
351
coef(fmBH)
352
@
353
354
reflecting the differences of the models that can also be seen in the the associated
355
\code{plot()}. Even more information is available in a \code{summary()}, e.g., for node 7:
356
357
<<>>=
358
summary(fmBH, node = 7)
359
@
360
361
The test statistics and $p$~values computed in each node, can be extracted analogously
362
by using the method for the function \code{sctest()} (for performing \underline{s}tructural
363
\underline{c}hange \underline{test}s).
364
365
<<>>=
366
sctest(fmBH, node = 7)
367
@
368
369
For summarizing the quality of the fit, we could compute the mean squared error, log-likelihood
370
or AIC:
371
372
<<>>=
373
mean(residuals(fmBH)^2)
374
logLik(fmBH)
375
AIC(fmBH)
376
@
377
378
<<echo=FALSE>>=
379
nt <- NROW(coef(fmBH))
380
nk <- NCOL(coef(fmBH))
381
@
382
383
As the \code{logLik()} method simply re-uses the method for \code{linearModel} objects,
384
this does not only report $\Sexpr{(nk+1)*nt-1}$ estimated parameters ($\Sexpr{nk}$ parameters in
385
each of the $\Sexpr{nt}$ terminal nodes plus $\Sexpr{nt} - 1$ split points) but
386
$\Sexpr{(nk+2)*nt-1}$ parameters because each terminal node is additionally associated with a 
387
variance estimate. However, for the fitting process, the variance was treated as a nuisance 
388
parameter as we employed OLS estimation (rather than fully-specified ML estimation). 
389
390
391
\subsection{Pima Indians diabetes data}
392
393
Another popular benchmark data set for binary classifications is the Pima Indians
394
diabetes database which is also available from \pkg{mlbench}:
395
396
<<>>=
397
data("PimaIndiansDiabetes2", package = "mlbench")
398
PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)])
399
@
400
401
After omitting missing values (and the variables \verb:triceps: and \verb:insulin: which
402
are missing for most women), the data set provides diabetes test results for
403
$n = \Sexpr{NROW(PimaIndiansDiabetes)}$ women
404
along with $\Sexpr{NCOL(PimaIndiansDiabetes)}$ covariates including in particular
405
the plasma glucose concentration \code{glucose} as an important predictor for diabetes.
406
Fitting a logistic regression model \verb:diabetes ~ glucose: seems to be straightforward,
407
whereas the influence of the remaining variables should again be learned by recursive
408
partitioning. This will yield a model tree with $k = 2$ regression coefficients in each terminal
409
node, partitioned with respect to the remaining $\ell = 5$ remaining variables.
410
411
The model is estimated by ML employing the \code{glinearModel}, 
412
the instability is assessed using a Bonferroni-corrected
413
significance level of $\alpha = 0.05$ and the nodes are split with a required minimal
414
segment size of $20$ observations. Hence, all control parameters correspond to the
415
default values in \verb:mob_control(): and do not have to be set explicitely in 
416
the \code{mob()} call:
417
418
<<>>=
419
fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age,
420
  data = PimaIndiansDiabetes, model = glinearModel, family = binomial())
421
@
422
423
To visualize this, we simply call again:
424
425
<<eval=FALSE>>=
426
plot(fmPID)
427
@
428
429
\setkeys{Gin}{width=\textwidth}
430
\begin{figure}[bth]
431
\begin{center}
432
<<PimaIndiansDiabetes-plot,echo=FALSE,results=hide,fig=TRUE,height=6,width=9>>=
433
plot(fmPID)
434
@
435
\caption{\label{fig:PimaIndiansDiabetes} Logistic-regression-based tree for the Pima Indians
436
diabetes data. The plots in the leaves give spinograms for \code{diabetes} versus 
437
\code{glucose}.}
438
\end{center}
439
\end{figure}
440
441
which produces again a plot of the dependent variable $y$ against the only regressors
442
$x$ in the terminal nodes. As $y$ is \code{diabetes}, a binary variable, and $x$ is 
443
\code{glucose}, a numeric variable, a spinogram is chosen for visualization. The breaks
444
in the spinogram are the five-point summary of \code{glucose} on the full data set. The
445
fitted lines are the mean predicted probabilities in each group.
446
447
The model tree distinguishes three different groups:
448
\begin{itemize}
449
  \item[\#2] Women with low body mass index that have on average a low risk of
450
    diabetes, however this increases clearly with glucose level.
451
  \item[\#4] Women with average and high body mass index, younger than 30 years,
452
    that have a higher avarage risk that also increases with glucose level.
453
  \item[\#5] Women with average and high body mass index, older than 30 years,
454
    that have a high avarage risk that increases only slowly with glucose level.
455
\end{itemize}
456
457
The same interpretation can also be drawn from the coefficient estimates
458
and the corresponding odds ratios (with respect to glucose):
459
460
<<>>=
461
coef(fmPID)
462
exp(coef(fmPID)[,2])
463
@  
464
465
<<echo=FALSE>>=
466
risk <- round(100 * (exp(coef(fmPID)[,2])-1), digits = 1)
467
@
468
469
i.e., the odds increase by \Sexpr{risk[1]}\%, \Sexpr{risk[2]}\% and \Sexpr{risk[3]}\%
470
with respect to glucose in the three groups.
471
472
473
474
\section{Conclusion}
475
\label{sec:conclusion}
476
477
The function \code{mob()} in the \pkg{party} package provides a flexible and object-oriented
478
implementation of the general algorithm for model-based recursive partitioning.
479
Models of class ``\code{StatModel}''---that employ a formula interface and are equipped with
480
methods for the generic functions \code{reweight()}, \code{weights()}, \code{estfun()} plus
481
some function for extracting the value of the objective function---can be easily partitioned.
482
The resulting ``\code{mob}'' tree can be flexibly summarized, both numerically and graphically,
483
and used for predictions on new data.
484
485
486
\bibliography{partyrefs}
487
488
\end{document}