International Journal of Epidemiology 2002;31:699-700
© International Epidemiological Association 2002
Book Review |
Regression Modelling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Frank E Harrell Jr, New York: Springer 2001, pp. 568, $79.95. ISBN 0-387-95232-2.
Most statistical textbooks present techniques and give simple examples of their use. This book is different. It assumes you already have the basic tools of linear and logistic regression, parametric and semi-parametric survival analysis in your well-stocked statistical tool box which you acquired in graduate school. The question this book addresses is how do you use those regression tools properly. The book succeeds in being both philosophical and intensely practical in nature. It is about the art of data analysis and modelling strategies. It takes you through the whole process starting with imputation of missing data, leading you through dealing with non-linear relationships, estimating transformations, variable selection, model building and finally validation of the model using powerful bootstrap techniques.
Harrell has a unifying approach to regression modelling strategies in that he emphasises how the methods he presents may be used across many different types of regression model in a variety of subject areas, although his examples are biomedical. One of the main points of the book is that there is a dishonesty that is widespread in that we treat inference from P-values, confidence intervals and statistics as if the data were not used to build the model. We need to recognise that it is usually not possible to pre-specify a multivariable regression model, for example, whether a survival model should be a Weibull or a lognormal model, what transformations of variables are appropriate, inclusion of non-linear terms and interaction terms and so on. However, statistics are often computed as if the data were not used to make decisions about the form of the model and how predictors are represented in the model. This means that models over fit the data on which they are estimated and poorly predict responses of future observations. Great emphasis is placed on addressing this fundamental problem of the modelling process. In particular, the author strongly recommends using bootstrap methods in many steps of the modelling strategy, including variable selection, derivation of distribution-free confidence intervals and estimation of optimism in model fit. For example, there has been much criticism of stepwise variable selection, but Harrell uses this procedure with bootstrapping and shows that variation in bootstrapped samples of the same dataset will lead to selection of different sets of variables and that a better strategy is to use the set of variables which occurs most frequently in the bootstrapped samples. This will give a more reliable and useful set of prognostic factors in the model which will predict responses from new data with greater precision and accuracy.
There are detailed case studies of real examples which are analysed using S-Plus with the code being explicitly given. The web site of the book gives access to the datasets and an S-Plus library with 200 functions for model fitting and testing, estimation, validation, prediction, graphics and typesetting. The book is particularly strong on graphical presentation of the regression models and claims that a picture will often persuade a non-statistician of the necessity for a particular transformation of a predictor rather than to opt for a simple linear term which does not fit the data so well. In particular, cubic splines and non-parametric smoothers are recommended early on as a way of relaxing linear assumptions and are used throughout the case studies.
This is an excellent book for its target audience, postgraduates who know the technical details of regression models, but not necessarily when and how to use them. It is also a worthwhile addition to the reference shelf of data analysts and statistical methodologists who will appreciate the many recipes given for successful modelling strategies and tips on validation when the data have been used to inform the modelling process.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||