Multiple Regression in SPSS/PASW (The Simple Way)

Uploaded by bartonpoulson on 22.05.2009

Hi, my name's Bart Poulson and in this, uh, tutorial, I'm going to show how to calculate
a multiple regression in SPSS, uh, now known as PASW, uh, for "Predictive Analytic Software."
Uh, I'm using version 17, but the previous versions are essentially identical for what
I'm doing. I'm also doing this on my Mac, but the, uh, Windows version again is essentially
identical. Multiple regression is used to predict the values on a quantitative outcome
variable, uh, using several other predictor variables. They can be quantitative, uh, or
categorical, so on. Um, I'm using a data set that exists in SPSS that's called "world95.sav."
The ".sav" is the suffix for a SPSS data set. You can get it by opening up, uh, the data
sets and going to the end to this one, world dot save 95— I have it open already.
And it's data for 109 countries from Afghanistan to Zambia, and there's the US and it has information
about population, life expectancy, literacy, uh, uh, predominant religions and so on. What
I'm going to be doing in this one is I'm gonna be taking a few variables to predict women's
average life expectancy for each one of these countries. Um, now in a previous one, I showed
how to calculate a correlation matrix and that's important because the same information
is used in calculating a multiple regression. In fact, I've kept the results from that one
so let me show them right here. Kay, what I have here are 5 variables. I'm going to
use this one, average female life expectancy, as my outcome. And then I'm gonna use these,
the literacy rate, the GDP, the daily caloric intake and the birth rate as predictor variables
of female life expectancy. This matrix has, uh, the variables listed down the side, has
the same variables across the top; it's, uh, symmetrical on the diagonal. The 1's right
here are each variable correlated with themselves, so that's, uh, gonna be one, and you see,
for instance, that the -862 here is -862 here. Now one important thing to note about this
is all of the correlation coefficients are very strong and they are all statistically
significant. Uh, you see the asterisks right there, that has to do with this one—the
significance. The—it says "Sig (2-tailed)"; this is the probability level from a null
hypothesis test and all of them, it's extremely small, it's less than 001. Um, so please remember
they are all statistically significant and all, uh, much less than 001. 'Cuz what I'm
gonna do now is I'm gonna use a multiple regression to look at the association of all four of
these together to predict life expectancy for women. So the way I'm gonna do this is
I come up to "Analyze" and down to "Regression." I'm gonna use "Linear"; I used this, uh, command
for a bivariate or simple regression earlier. Now we're just gonna get a little more sophisticated.
Now I already put the variables in there from an earlier one, but what I did is I came over
here and I just, uh, selected each of the variables by pressing them in here. I have
average female life expectancy, literacy rate, GDP, caloric intake, and birth rate per 1000
people. Now, multiple regression can be a very complicated, a very sophisticated thing.
There's an awful lot of decisions to make that can make a big difference in how things
work. However, for this one, I'm gonna use the simplest possible version where I simply
keep all of the--the defaults the way they are and this is not a bad method for most
regressions. So, I'm just gonna come down here, and I'm gonna hit "OK." Now this one
right here is, uh, what's called a syntax. It's a command, a written command for what
I just did. The nice thing is you can save that in a separate file, you can run it again
later, you can modify it. Uh, I may talk about that later. This says that I did a regression.
This one says which data set I used. See, it says "world95" right here. This says what
variables I used as the predictor. Birth rate, GDP, caloric intake, uh, people who read.
And that this one is the outcome variable, the dependent variable, which is average female
life expectancy for each of the 109 countries. This one right here tells me that they are
very highly correlated, that these variables predict life expectancy very well. This first
one here, the capital R, is called the multiple correlation coefficient 'cuz it's looking
at the association of all of the variables together. You know, the--the maximum value
is 1, positive or negative. This has a .912 which is extremely high. Uh, more frequently
people use the R squared, which is, you know, 912 squared is 832 which means that 83% of
the variants in average life expectancy can be predicted by the combination of these 4
variables. Now this one, "Adjusted R Square" takes into consideration the number of observations
and the number of predictor variables, um, to make sure that things aren't too inflated.
So it's generally smaller. Uh, the "Standard Error of the Estimate" is simply something
that goes into the, uh, hypoth—hypothesis test for this one. We're not gonna worry about
that. This next one is also an indication of how well the model fits. Um, all you need
to know is that it fits really well. This significance test right here is much, much
less than 05, uh, which means a 5%, uh, type point error rate, or 5% false positive rate.
This is a very tight model. This is good one. The important ones are done here under where
it says "Coefficients." Now, what we're gonna look at is for instance, this one right here,
the constant, this says when all of the predictor variables are zero, which actually isn't possible,
you would start with an average life expectancy for women of 43.778 years. Um, and that this
number is significantly different from zero and there's--there's no shock right there.
The more interesting ones are these correlations right here. Uh, this says for the percent
of people who read, uh, for every percentage point of people who read, add .226 years to
the average life expectancy for women. This one, GDP, says add 0. Uh, caloric intake is
add 006. Now that means for each additional calorie, uh, which is why this is a very small
number because calories, you know, you have thousands of 'em. Um, and then birth rate
is negative. Now on the other hand, if you come over here to the "Sig" column, that means
the probability level for each one of these. And again, they generally need to be less
than 05 to be considered significant or meaningful or reliable. All of them are less than 05
except for this one right here, the 786 for the GDP. Now, here's the important lesson
about multiple regression. Let's go back up to the correlation matrix I showed at the
beginning. When you look right here, this is the outcome variable, "Average Female Life
Expectancy," every one of these variables including GDP are highly correlated, uh, on
their own with the outcome variable. Please note GDP has a correlation of 642, which is
really big, and has a probability level less than 001. All of them do. But when we come
back down here, GDP is no longer significantly associated. Now, the reason for that is because
multiple regression looks at the combination of these four variables to predict the outcome.
This is the—uh, contribution of each variable, but only in combination with each
other. Uh, and that's one of the reasons why you might wanna look at both the individual
or bivariate correlations which I did above and these ones down here. However, it should
be noted also that this is a much better prediction than any of the ones individually up here.
The highest correlation we have here is 865 and then 862 which are very close down there.
The entire model together has a multiple correlation of .912, so it's not a huge improvement, but
it is still there. And so, it might be better to use the entire model to try to predict,
uh, women's average life expectancy. Now, the fact also that the, uh, literacy rate
and the daily caloric intake are both positively associated and the birth rate is negatively
associated with life expectancy, again, may have more to do with the, uh, economic development
in health care, uh, available in countries, um, more than anything else. But that's, uh,
what we need to know about multiple regression for right now. Thanks.