\documentclass[11pt]{article} \usepackage[round]{natbib} \usepackage{bibentry} \usepackage{amsfonts} \usepackage{hyperref} \renewcommand{\baselinestretch}{1.3} \newcommand{\ipred}{\texttt{ipred }} %\VignetteIndexEntry{Some more or less useful examples for illustration.} %\VignetteDepends{ipred} %\textwidth=6.2in %\VignetteDepends{mvtnorm,TH.data,rpart,MASS} \begin{document} \title{\ipred: Improved Predictors} \date{} \SweaveOpts{engine=R,eps=TRUE,pdf=TRUE} <>= options(prompt=">", width=50) set.seed(210477) @ \maketitle This short manual is heavily based on \cite{Rnews:Peters+Hothorn+Lausen:2002} and needs some improvements. \section{Introduction} In classification problems, there are several attempts to create rules which assign future observations to certain classes. Common methods are for example linear discriminant analysis or classification trees. Recent developments lead to substantial reduction of misclassification error in many applications. Bootstrap aggregation \citep[``bagging'',][]{breiman:1996} combines classifiers trained on bootstrap samples of the original data. Another approach is indirect classification, which incorporates a priori knowledge into a classification rule \citep{hand:2001}. Since the misclassification error is a criterion to assess the classification techniques, its estimation is of main importance. A nearly unbiased but highly variable estimator can be calculated by cross validation. \cite{efron:1997} discuss bootstrap estimates of misclassification error. As a by-product of bagging, \cite{out-of-bag:1996} proposes the out-of-bag estimator. \\ However, the calculation of the desired classification models and their misclassification errors is often aggravated by different and specialized interfaces of the various procedures. We propose the \ipred package as a first attempt to create a unified interface for improved predictors and various error rate estimators. In the following we demonstrate the functionality of the package in the example of glaucoma classification. We start with an overview about the disease and data and review the implemented classification and estimation methods in context with their application to glaucoma diagnosis. \section{Glaucoma} Glaucoma is a slowly processing and irreversible disease that affects the optic nerve head. It is the second most reason for blindness worldwide. Glaucoma is usually diagnosed based on a reduced visual field, assessed by a medical examination of perimetry and a smaller number of intact nerve fibers at the optic nerve head. One opportunity to examine the amount of intact nerve fibers is using the Heidelberg Retina Tomograph (HRT), a confocal laser scanning tomograph, which does a three dimensional topographical analysis of the optic nerve head morphology. It produces a series of $32$ images, each of $256 \times 256$ pixels, which are converted to a single topographic image. A less complex, but although a less informative examination tool is the $2$-dimensional fundus photography. However, in cooperation with clinicians and a priori analysis we derived a diagnosis of glaucoma based on three variables only: $w_{lora}$ represents the loss of nerve fibers and is obtained by a $2$-dimensional fundus photography, $w_{cs}$ and $w_{clv}$ describe the visual field defect \citep{ifcs:2001}. \begin{center} \begin{figure}[h] \begin{center} {\small \setlength{\unitlength}{0.6cm} \begin{picture}(14.5,5) \put(5, 4.5){\makebox(2, 0.5){$w_{clv}\geq 5.1$}} \put(2.5, 3){\makebox(2, 0.5){$w_{lora}\geq 49.23$}} \put(7.5, 3){\makebox(2, 0.5){$w_{lora} \geq 58.55$}} \put(0, 1.5){\makebox(2, 0.5){$glaucoma$}} \put(3.5, 1.5){\makebox(2, 0.5){$normal$}} \put(6.5, 1.5){\makebox(2, 0.5){$w_{cs} < 1.405$}} \put(10, 1.5){\makebox(2, 0.5){$normal$}} \put(3.5, 0){\makebox(2, 0.5){$glaucoma$}} \put(6.5, 0){\makebox(2, 0.5){$normal$}} \put(6, 4.5){\vector(-3, -2){1.5}} \put(6, 4.5){\vector(3, -2){1.5}} \put(3.5, 3){\vector(3, -2){1.5}} \put(3.5, 3){\vector(-3, -2){1.5}} \put(8.5, 3){\vector(3, -2){1.5}} \put(8.5, 3){\vector(-3, -2){1.5}} \put(6.5, 1.5){\vector(3, -2){1.5}} \put(6.5, 1.5){\vector(-3, -2){1.5}} \end{picture} } \end{center} \caption{Glaucoma diagnosis. \label{diag}} \end{figure} \end{center} Figure \ref{diag} represents the diagnosis of glaucoma in terms of a medical decision tree. A complication of the disease is that a damage in the optic nerve head morphology precedes a measurable visual field defect. Furthermore, an early detection is of main importance, since an adequate therapy can only slow down the progression of the disease. Hence, a classification rule for detecting early damages should include morphological informations, rather than visual field data only. Two example datasets are included in the package. The first one contains measurements of the eye morphology only (\texttt{GlaucomaM}), including $62$ variables for $196$ observations. The second dataset (\texttt{GlaucomaMVF}) contains additional visual field measurements for a different set of patients. In both example datasets, the observations in the two groups are matched by age and sex to prevent any bias. \section{Bagging} Referring to the example of glaucoma diagnosis we first demonstrate the functionality of the \texttt{bagging} function. We fit \texttt{nbagg = 25} (default) classification trees for bagging by <>= library("ipred") library("rpart") library("MASS") data("GlaucomaM", package="TH.data") gbag <- bagging(Class ~ ., data = GlaucomaM, coob=TRUE) @ where \texttt{GlaucomaM} contains explanatory HRT variables and the response of glaucoma diagnosis (\texttt{Class}), a factor at two levels \texttt{normal} and \texttt{glaucoma}. \texttt{print} returns informations about the returned object, i.e. the number of bootstrap replications used and, as requested by \texttt{coob=TRUE}, the out-of-bag estimate of misclassification error \citep{out-of-bag:1996}. <>= print(gbag) @ The out-of-bag estimate uses the observations which are left out in a bootstrap sample to estimate the misclassification error at almost no additional computational costs. \cite{double-bag:2002} propose to use the out-of-bag samples for a combination of linear discriminant analysis and classification trees, called ``Double-Bagging''. For example, a combination of a stabilised linear disciminant analysis with classification trees can be computed along the following lines <>= scomb <- list(list(model=slda, predict=function(object, newdata) predict(object, newdata)$x)) gbagc <- bagging(Class ~ ., data = GlaucomaM, comb=scomb) @ \texttt{predict} predicts future observations according to the fitted model. <>= predict(gbagc, newdata=GlaucomaM[c(1:3, 99:102), ]) @ Both \texttt{bagging} and \texttt{predict} rely on the \texttt{rpart} routines. The \texttt{rpart} routine for each bootstrap sample can be controlled in the usual way. By default \texttt{rpart.control} is used with \texttt{minsize=2} and \texttt{cp=0} and it is wise to turn cross-validation off (\texttt{xval=0}). The function \texttt{prune} can be used to prune each of the trees to an appropriate size. \section{Indirect Classification} Especially in a medical context it often occurs that a priori knowledge about a classifying structure is given. For example it might be known that a disease is assessed on a subgroup of the given variables or, moreover, that class memberships are assigned by a deterministically known classifying function. \cite{hand:2001} proposes the framework of indirect classification which incorporates this a priori knowledge into a classification rule. In this framework we subdivide a given data set into three groups of variables: those to be used predicting the class membership (explanatory), those to be used defining the class membership (intermediate) and the class membership variable itself (response). For future observations, an indirect classifier predicts values for the appointed intermediate variables based on explanatory variables only. The observation is classified based on their predicted intermediate variables and a fixed classifying function. This indirect way of classification using the predicted intermediate variables offers possibilities to incorporate a priori knowledge by the subdivision of variables and by the construction of a fixed classifying function. We apply indirect classification by using the function \texttt{inclass}. Referring to the glaucoma example, explanatory variables are HRT and anamnestic variables only, intermediate variables are $w_{lora}, \, w_{cs}$ and $w_{clv}$. The response is the diagnosis of glaucoma which is determined by a fixed classifying function and therefore not included in the learning sample \texttt{GlaucomaMVF}. We assign the given variables to explanatory and intermediate by specifying the input formula. <>= data("GlaucomaMVF", package="ipred") GlaucomaMVF <- GlaucomaMVF[,-63] formula.indirect <- Class~clv + lora + cs ~ . @ The variables on the left-hand side represent the intermediate variables, modeled by the explanatory variables on the right-hand side. Almost each modeling technique can be used to predict the intermediate variables. We chose a linear model by \texttt{pFUN = list(list(model = lm))}. <>= classify <- function (data) { attach(data) res <- ifelse((!is.na(clv) & !is.na(lora) & clv >= 5.1 & lora >= 49.23372) | (!is.na(clv) & !is.na(lora) & !is.na(cs) & clv < 5.1 & lora >= 58.55409 & cs < 1.405) | (is.na(clv) & !is.na(lora) & !is.na(cs) & lora >= 58.55409 & cs < 1.405) | (!is.na(clv) & is.na(lora) & cs < 1.405), 0, 1) detach(data) factor (res, labels = c("glaucoma", "normal")) } fit <- inclass(formula.indirect, pFUN = list(list(model = lm)), cFUN = classify, data = GlaucomaMVF) @ \texttt{print} displays the subdivision of variables and the chosen modeling technique <>= print(fit) @ Furthermore, indirect classification predicts the intermediate variables based on the explanatory variables and classifies them according to a fixed classifying function in a second step, that means a deterministically known function for the class membership has to be specified. In our example this function is given in Figure \ref{diag} and implemented in the function \texttt{classify}.\\ Prediction of future observations is now performed by <>= predict(object = fit, newdata = GlaucomaMVF[c(1:3, 86:88),]) @ We perform a bootstrap aggregated indirect classification approach by choosing \texttt{pFUN = bagging} and specifying the number of bootstrap samples \citep{ifcs:2001}. Regression or classification trees are fitted for each bootstrap sample, with respect to the measurement scale of the specified intermediate variables <>= mypredict.rpart <- function(object, newdata) { RES <- predict(object, newdata) RET <- rep(NA, nrow(newdata)) NAMES <- rownames(newdata) RET[NAMES %in% names(RES)] <- RES[NAMES[NAMES %in% names(RES)]] RET } fit <- inbagg(formula.indirect, pFUN = list(list(model = rpart, predict = mypredict.rpart)), cFUN = classify, nbagg = 25, data = GlaucomaMVF) @ The call for the prediction of values remains unchanged. \section{Error Rate Estimation} Classification rules are usually assessed by their misclassification rate. Hence, error rate estimation is of main importance. The function \texttt{errorest} implements a unified interface to several resampling based estimators. Referring to the example, we apply a linear discriminant analysis and specify the error rate estimator by \texttt{estimator = "cv", "boot"} or \texttt{"632plus"}, respectively. A 10-fold cross validation is performed by choosing \texttt{estimator = "cv"} and \texttt{est.para = control.errorest(k = 10)}. The options \texttt{estimator = "boot"} or \texttt{estimator = "632plus"} deliver a bootstrap estimator and its bias corrected version {\sl .632+} \citep[see][]{efron:1997}, we specify the number of bootstrap samples to be drawn by \texttt{est.para = control.errorest(nboot = 50)}. Further arguments are required to particularize the classification technique. The argument \texttt{predict} represents the chosen predictive function. For a unified interface \texttt{predict} has to be based on the arguments \texttt{object} and \texttt{newdata} only, therefore a wrapper function \texttt{mypredict} is necessary for classifiers which require more than those arguments or do not return the predicted classes by default. For a linear discriminant analysis with \texttt{lda}, we need to specify <>= mypredict.lda <- function(object, newdata){ predict(object, newdata = newdata)$class } @ and calculate a 10-fold-cross-validated error rate estimator for a linear discriminant analysis by calling <>= errorest(Class ~ ., data= GlaucomaM, model=lda, estimator = "cv", predict= mypredict.lda) @ For the indirect approach the specification of the call becomes slightly more complicated. %Again for a unified interface a wrapper %function has to be used, which incorporates the fixed classification rule The bias corrected estimator {\sl .632+} is computed by <>= errorest(formula.indirect, data = GlaucomaMVF, model = inclass, estimator = "632plus", pFUN = list(list(model = lm)), cFUN = classify) @ Because of the subdivision of variables and a formula describing the modeling between explanatory and intermediate variables only, we must call the class membership variable. Hence, in contrast to the function \texttt{inclass} the data set \texttt{GlaucomaMVF} used in \texttt{errorest} must contain explanatory, intermediate and response variables. Sometimes it may be necessary to reduce the number of predictors before training a classifier. Estimating the error rate after the variable selection leads to biased estimates of the misclassfication error and therefore one should estimate the error rate of the whole procedure. Within the \texttt{errorest} framework, this can be done as follows. First, we define a function which does both variable selection and training of the classifier. For illustration proposes, we select the predictors by comparing their univariate $P$-values of a two-sample $t$-test with a prespecified level and train a LDA using the selected variables only. <>= mymod <- function(formula, data, level=0.05) { # select all predictors that are associated with an # univariate t.test p-value of less that level sel <- which(lapply(data, function(x) { if (!is.numeric(x)) return(1) else return(t.test(x ~ data$Class)$p.value) }) < level) # make sure that the response is still there sel <- c(which(colnames(data) %in% "Class"), sel) # compute a LDA using the selected predictors only mod <- lda(formula , data=data[,sel]) # and return a function for prediction function(newdata) { predict(mod, newdata=newdata[,sel])$class } } @ Note that \texttt{mymod} does not return an object of class \texttt{lda} but a function with argument \texttt{newdata} only. Thanks to lexical scoping, this function is used for computing predicted classes instead of a function \texttt{predict} passed to \texttt{errorest} as argument. Computing a $5$-fold cross-validated error rate estimator now is approximately a one-liner. <>= errorest(Class ~ . , data=GlaucomaM, model=mymod, estimator = "cv", est.para=control.errorest(k=5)) @ %%To summarize the performance of the different classification techniques in the considered example of glaucoma diagnosis, the 10-fold %%cross-validated error estimator delivers the %%results given in Table \ref{tenf}. %%\begin{figure} %%\begin{center} %%\begin{tabular}{ rrr } %%\hline %%dataset & method & error estimate \\ %%\hline %%\texttt{GlaucomaM} & {\sl slda} & 0.168 \\ %%\texttt{GlaucomaM} & {\sl bagging} & 0.158 \\ %%\texttt{GlaucomaM} & {\sl double-bagging} & 0.153 \\ %%\texttt{GlaucomaMVF} & {\sl inclass-bagging} & 0.206 \\ %%\tetxtt{GlaucomaMVF} & {\sl inclass-lm} & 0.229 \\ %%\hline %%\end{tabular} %%\caption{10-fold cross-validated error estimation of %%the misclassification error for several classification %%methods: {\sl slda} - stabilised linear discriminant analysis, %%{\sl bagging} - bagging with 50 bootstrap samples, %%{\sl double-bagging} - bagging with 50 bootstrap samples, %%combined with sLDA, {\sl inclass-bagging} - %%indirect classification using bagging, %%{\sl inclass-lm} indirect classification using %%linear modeling. \label{tenf}} %%\end{center} %%\end{figure} %%Note that an estimator of the variance is available for the ordinary %%bootstrap estimator (\texttt{estimator="boot"}) only, see \cite{efron:1997}. \section{Summary} \ipred tries to implement a unified interface to some recent developments in classification and error rate estimation. It is by no means finished nor perfect and we very much appreciate comments, suggestions and criticism. Currently, the major drawback is speed. Calling \texttt{rpart} $50$ times for each bootstrap sample is relatively inefficient but the design of interfaces was our main focus instead of optimization. Beside the examples shown, \texttt{bagging} can be used to compute bagging for regression trees and \texttt{errorest} computes estimators of the mean squared error for regression models. \bibliographystyle{plainnat} \bibliography{ipred} \end{document}