Uploaded by bartonpoulson on 22.05.2009

Transcript:

Hi, my name's Bart Poulson and in this, uh, tutorial, I'm going to show how to calculate

a multiple regression in SPSS, uh, now known as PASW, uh, for "Predictive Analytic Software."

Uh, I'm using version 17, but the previous versions are essentially identical for what

I'm doing. I'm also doing this on my Mac, but the, uh, Windows version again is essentially

identical. Multiple regression is used to predict the values on a quantitative outcome

variable, uh, using several other predictor variables. They can be quantitative, uh, or

categorical, so on. Um, I'm using a data set that exists in SPSS that's called "world95.sav."

The ".sav" is the suffix for a SPSS data set. You can get it by opening up, uh, the data

sets and going to the end to this one, world dot save 95â€” I have it open already.

And it's data for 109 countries from Afghanistan to Zambia, and there's the US and it has information

about population, life expectancy, literacy, uh, uh, predominant religions and so on. What

I'm going to be doing in this one is I'm gonna be taking a few variables to predict women's

average life expectancy for each one of these countries. Um, now in a previous one, I showed

how to calculate a correlation matrix and that's important because the same information

is used in calculating a multiple regression. In fact, I've kept the results from that one

so let me show them right here. Kay, what I have here are 5 variables. I'm going to

use this one, average female life expectancy, as my outcome. And then I'm gonna use these,

the literacy rate, the GDP, the daily caloric intake and the birth rate as predictor variables

of female life expectancy. This matrix has, uh, the variables listed down the side, has

the same variables across the top; it's, uh, symmetrical on the diagonal. The 1's right

here are each variable correlated with themselves, so that's, uh, gonna be one, and you see,

for instance, that the -862 here is -862 here. Now one important thing to note about this

is all of the correlation coefficients are very strong and they are all statistically

significant. Uh, you see the asterisks right there, that has to do with this oneâ€”the

significance. Theâ€”it says "Sig (2-tailed)"; this is the probability level from a null

hypothesis test and all of them, it's extremely small, it's less than 001. Um, so please remember

they are all statistically significant and all, uh, much less than 001. 'Cuz what I'm

gonna do now is I'm gonna use a multiple regression to look at the association of all four of

these together to predict life expectancy for women. So the way I'm gonna do this is

I come up to "Analyze" and down to "Regression." I'm gonna use "Linear"; I used this, uh, command

for a bivariate or simple regression earlier. Now we're just gonna get a little more sophisticated.

Now I already put the variables in there from an earlier one, but what I did is I came over

here and I just, uh, selected each of the variables by pressing them in here. I have

average female life expectancy, literacy rate, GDP, caloric intake, and birth rate per 1000

people. Now, multiple regression can be a very complicated, a very sophisticated thing.

There's an awful lot of decisions to make that can make a big difference in how things

work. However, for this one, I'm gonna use the simplest possible version where I simply

keep all of the--the defaults the way they are and this is not a bad method for most

regressions. So, I'm just gonna come down here, and I'm gonna hit "OK." Now this one

right here is, uh, what's called a syntax. It's a command, a written command for what

I just did. The nice thing is you can save that in a separate file, you can run it again

later, you can modify it. Uh, I may talk about that later. This says that I did a regression.

This one says which data set I used. See, it says "world95" right here. This says what

variables I used as the predictor. Birth rate, GDP, caloric intake, uh, people who read.

And that this one is the outcome variable, the dependent variable, which is average female

life expectancy for each of the 109 countries. This one right here tells me that they are

very highly correlated, that these variables predict life expectancy very well. This first

one here, the capital R, is called the multiple correlation coefficient 'cuz it's looking

at the association of all of the variables together. You know, the--the maximum value

is 1, positive or negative. This has a .912 which is extremely high. Uh, more frequently

people use the R squared, which is, you know, 912 squared is 832 which means that 83% of

the variants in average life expectancy can be predicted by the combination of these 4

variables. Now this one, "Adjusted R Square" takes into consideration the number of observations

and the number of predictor variables, um, to make sure that things aren't too inflated.

So it's generally smaller. Uh, the "Standard Error of the Estimate" is simply something

that goes into the, uh, hypothâ€”hypothesis test for this one. We're not gonna worry about

that. This next one is also an indication of how well the model fits. Um, all you need

to know is that it fits really well. This significance test right here is much, much

less than 05, uh, which means a 5%, uh, type point error rate, or 5% false positive rate.

This is a very tight model. This is good one. The important ones are done here under where

it says "Coefficients." Now, what we're gonna look at is for instance, this one right here,

the constant, this says when all of the predictor variables are zero, which actually isn't possible,

you would start with an average life expectancy for women of 43.778 years. Um, and that this

number is significantly different from zero and there's--there's no shock right there.

The more interesting ones are these correlations right here. Uh, this says for the percent

of people who read, uh, for every percentage point of people who read, add .226 years to

the average life expectancy for women. This one, GDP, says add 0. Uh, caloric intake is

add 006. Now that means for each additional calorie, uh, which is why this is a very small

number because calories, you know, you have thousands of 'em. Um, and then birth rate

is negative. Now on the other hand, if you come over here to the "Sig" column, that means

the probability level for each one of these. And again, they generally need to be less

than 05 to be considered significant or meaningful or reliable. All of them are less than 05

except for this one right here, the 786 for the GDP. Now, here's the important lesson

about multiple regression. Let's go back up to the correlation matrix I showed at the

beginning. When you look right here, this is the outcome variable, "Average Female Life

Expectancy," every one of these variables including GDP are highly correlated, uh, on

their own with the outcome variable. Please note GDP has a correlation of 642, which is

really big, and has a probability level less than 001. All of them do. But when we come

back down here, GDP is no longer significantly associated. Now, the reason for that is because

multiple regression looks at the combination of these four variables to predict the outcome.

This is theâ€”uh, contribution of each variable, but only in combination with each

other. Uh, and that's one of the reasons why you might wanna look at both the individual

or bivariate correlations which I did above and these ones down here. However, it should

be noted also that this is a much better prediction than any of the ones individually up here.

The highest correlation we have here is 865 and then 862 which are very close down there.

The entire model together has a multiple correlation of .912, so it's not a huge improvement, but

it is still there. And so, it might be better to use the entire model to try to predict,

uh, women's average life expectancy. Now, the fact also that the, uh, literacy rate

and the daily caloric intake are both positively associated and the birth rate is negatively

associated with life expectancy, again, may have more to do with the, uh, economic development

in health care, uh, available in countries, um, more than anything else. But that's, uh,

what we need to know about multiple regression for right now. Thanks.

a multiple regression in SPSS, uh, now known as PASW, uh, for "Predictive Analytic Software."

Uh, I'm using version 17, but the previous versions are essentially identical for what

I'm doing. I'm also doing this on my Mac, but the, uh, Windows version again is essentially

identical. Multiple regression is used to predict the values on a quantitative outcome

variable, uh, using several other predictor variables. They can be quantitative, uh, or

categorical, so on. Um, I'm using a data set that exists in SPSS that's called "world95.sav."

The ".sav" is the suffix for a SPSS data set. You can get it by opening up, uh, the data

sets and going to the end to this one, world dot save 95â€” I have it open already.

And it's data for 109 countries from Afghanistan to Zambia, and there's the US and it has information

about population, life expectancy, literacy, uh, uh, predominant religions and so on. What

I'm going to be doing in this one is I'm gonna be taking a few variables to predict women's

average life expectancy for each one of these countries. Um, now in a previous one, I showed

how to calculate a correlation matrix and that's important because the same information

is used in calculating a multiple regression. In fact, I've kept the results from that one

so let me show them right here. Kay, what I have here are 5 variables. I'm going to

use this one, average female life expectancy, as my outcome. And then I'm gonna use these,

the literacy rate, the GDP, the daily caloric intake and the birth rate as predictor variables

of female life expectancy. This matrix has, uh, the variables listed down the side, has

the same variables across the top; it's, uh, symmetrical on the diagonal. The 1's right

here are each variable correlated with themselves, so that's, uh, gonna be one, and you see,

for instance, that the -862 here is -862 here. Now one important thing to note about this

is all of the correlation coefficients are very strong and they are all statistically

significant. Uh, you see the asterisks right there, that has to do with this oneâ€”the

significance. Theâ€”it says "Sig (2-tailed)"; this is the probability level from a null

hypothesis test and all of them, it's extremely small, it's less than 001. Um, so please remember

they are all statistically significant and all, uh, much less than 001. 'Cuz what I'm

gonna do now is I'm gonna use a multiple regression to look at the association of all four of

these together to predict life expectancy for women. So the way I'm gonna do this is

I come up to "Analyze" and down to "Regression." I'm gonna use "Linear"; I used this, uh, command

for a bivariate or simple regression earlier. Now we're just gonna get a little more sophisticated.

Now I already put the variables in there from an earlier one, but what I did is I came over

here and I just, uh, selected each of the variables by pressing them in here. I have

average female life expectancy, literacy rate, GDP, caloric intake, and birth rate per 1000

people. Now, multiple regression can be a very complicated, a very sophisticated thing.

There's an awful lot of decisions to make that can make a big difference in how things

work. However, for this one, I'm gonna use the simplest possible version where I simply

keep all of the--the defaults the way they are and this is not a bad method for most

regressions. So, I'm just gonna come down here, and I'm gonna hit "OK." Now this one

right here is, uh, what's called a syntax. It's a command, a written command for what

I just did. The nice thing is you can save that in a separate file, you can run it again

later, you can modify it. Uh, I may talk about that later. This says that I did a regression.

This one says which data set I used. See, it says "world95" right here. This says what

variables I used as the predictor. Birth rate, GDP, caloric intake, uh, people who read.

And that this one is the outcome variable, the dependent variable, which is average female

life expectancy for each of the 109 countries. This one right here tells me that they are

very highly correlated, that these variables predict life expectancy very well. This first

one here, the capital R, is called the multiple correlation coefficient 'cuz it's looking

at the association of all of the variables together. You know, the--the maximum value

is 1, positive or negative. This has a .912 which is extremely high. Uh, more frequently

people use the R squared, which is, you know, 912 squared is 832 which means that 83% of

the variants in average life expectancy can be predicted by the combination of these 4

variables. Now this one, "Adjusted R Square" takes into consideration the number of observations

and the number of predictor variables, um, to make sure that things aren't too inflated.

So it's generally smaller. Uh, the "Standard Error of the Estimate" is simply something

that goes into the, uh, hypothâ€”hypothesis test for this one. We're not gonna worry about

that. This next one is also an indication of how well the model fits. Um, all you need

to know is that it fits really well. This significance test right here is much, much

less than 05, uh, which means a 5%, uh, type point error rate, or 5% false positive rate.

This is a very tight model. This is good one. The important ones are done here under where

it says "Coefficients." Now, what we're gonna look at is for instance, this one right here,

the constant, this says when all of the predictor variables are zero, which actually isn't possible,

you would start with an average life expectancy for women of 43.778 years. Um, and that this

number is significantly different from zero and there's--there's no shock right there.

The more interesting ones are these correlations right here. Uh, this says for the percent

of people who read, uh, for every percentage point of people who read, add .226 years to

the average life expectancy for women. This one, GDP, says add 0. Uh, caloric intake is

add 006. Now that means for each additional calorie, uh, which is why this is a very small

number because calories, you know, you have thousands of 'em. Um, and then birth rate

is negative. Now on the other hand, if you come over here to the "Sig" column, that means

the probability level for each one of these. And again, they generally need to be less

than 05 to be considered significant or meaningful or reliable. All of them are less than 05

except for this one right here, the 786 for the GDP. Now, here's the important lesson

about multiple regression. Let's go back up to the correlation matrix I showed at the

beginning. When you look right here, this is the outcome variable, "Average Female Life

Expectancy," every one of these variables including GDP are highly correlated, uh, on

their own with the outcome variable. Please note GDP has a correlation of 642, which is

really big, and has a probability level less than 001. All of them do. But when we come

back down here, GDP is no longer significantly associated. Now, the reason for that is because

multiple regression looks at the combination of these four variables to predict the outcome.

This is theâ€”uh, contribution of each variable, but only in combination with each

other. Uh, and that's one of the reasons why you might wanna look at both the individual

or bivariate correlations which I did above and these ones down here. However, it should

be noted also that this is a much better prediction than any of the ones individually up here.

The highest correlation we have here is 865 and then 862 which are very close down there.

The entire model together has a multiple correlation of .912, so it's not a huge improvement, but

it is still there. And so, it might be better to use the entire model to try to predict,

uh, women's average life expectancy. Now, the fact also that the, uh, literacy rate

and the daily caloric intake are both positively associated and the birth rate is negatively

associated with life expectancy, again, may have more to do with the, uh, economic development

in health care, uh, available in countries, um, more than anything else. But that's, uh,

what we need to know about multiple regression for right now. Thanks.