Common Mistakes in Multiple Regression
1.
The
response variable, Y, doesn’t need to be normal as this is not an assumption of
multiple regression. In fact, Y will rarely be normal. What must be true is that the errors
around a prediction Y^ must be normal which one can check with a normal plot of
the residuals. The errors must also have a constant variance which you can
check with a predicted by residual plot. This plot should have not pattern. One
should what for a fanning out in this plot which would indicate that a log
transformation is needed. Doing a normal plot of Y will cost one points.
2.
Regression
does not assume that the regressors have any distribution. So checking to see
if they have a normal distribution with normal plots is not required and will
cost one points. One should, however, use box plots to check for outliers in
the regressors. (Just don’t do a normal plot.) One probably leave these points
in the data and once the multiple regression is done check for influential
observations
3.
It
is a waste to do bivariate regressions with least square fits. The fits give no
useful information so one should not do them if one doesn’t want to loose
points. It is good, however, to look at the scatter plots of each regressor
versus the response. They may give one insights to whether transformations are
needed.
4.
It
is a mistake to relay on stepwise regression exclusively to identify the model.
The model that stepwise regression comes up should be taken as a suggestion
that you have to check using the tools taught in the class, such as, the
Effects Table and residual analysis. When one has a small number of regressors
one can select “all models” from the triangle menu on the stepwise output. The
model selected this way will be better in term of the highest R-Square for a
given number of regressors. However, this is not the only criteria for a model
being good. One must check it as one must do with the model selected by
stepwise regression.
5.
Confusing
the errors with the residuals
Common Conceptual mistakes
1.
Saying
that a confidence interval contains an statistic, such as the sample mean,
with a given confidence instead of a parameter such as the
population mean,
.
2.
Confusing
the terms: predictor variable, regressor, independent variable, dependent
variable and response
3.
Confusing
the concept of outlier and influential observation
1.
Confusing
“estimating” with “predicting”