Chapter 12: Multiple Linear Regression and Certain Nonlinear
Regression Models MINITAB Project |
STATISTICS EXPLORATION # 3:CORRELATION AND REGRESSION
PURPOSE - to use MINITAB to
BACKGROUND INFORMATION
Some terms and background information that are associated with correlation and regression are explained below.
Scatter Plot of Diameter versus Volume
Scatter Plot Displaying Little or No Correlation
Scatter Plot Displaying a superimposed Linear Regression Line
Scatter Plot Displaying a superimposed Quadratic Regression Line
PROCEDURES
First, load the MINITAB (windows version) software as described in Exploration #0.
NOTE: The procedures presented in these explorations may not be the only way to achieve the end results. Also, whenever graphs are presented, only the MINITAB graphics features will be used.
In this section we will present examples that will enable you to get an understanding of the concept correlation through scatter plots.
Example 1: Consider the following table, which contains measurements on two variables for ten people: the number of hours the person spent riding a bicycle in the past week and the number of months the person has owned the bicycle. Present a scatter plot for this information with the number of hours along the vertical axis and the number of months owned along the horizontal axis.
Person |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Hours Exercised |
5 |
2 |
8 |
3 |
8 |
5 |
5 |
7 |
10 |
3 |
Months Owned |
5 |
10 |
4 |
8 |
2 |
7 |
9 |
6 |
1 |
12 |
Enter the number of hours in column C1 and the number of months owned in column C2. Rename column C1 as HOURS and column C2 as MONTHS. Next, we will present a scatter plot with the values in C1 along the vertical axis and the values in C2 along the horizontal axis. To achieve this, select Graph® Plot and the Plot dialog box will be displayed. Enter the appropriate Y and X variables as shown in Figure 3.1. Note that the Display option that was selected is Symbol.
Figure 3.1: Display of the plot selections
Click on the OK button and the plot will be displayed. Figure 3.2 shows the resulting plot. Since this is a plot of the ordered pairs (MONTHS, HOURS), this will represent a scatter plot for the two variables.
Figure 3.2: Display of the Scatter plot of the number of hours vs. the number of months owned
From Figure 3.2, you can see a definite trend. The points appear to form a line that slopes from the upper left to the lower right of the screen. As you move along that (imaginary) line from left to right, the values on the vertical axis (hours riding) get smaller, while the values on the horizontal axis (months owned) get larger. Another way to express this is to say that the two variables are inversely related: the longer the bike was owned, the less the person tends to ride it.
We say that these two variables are correlated. More than that, they are correlated in a particular negative direction.
Example 2: Consider the following table, which contains the noise levels as measured by two different instruments. Present a scatter plot for this information with the variable NOISE1 along the y-axis and NOISE2 along the horizontal axis.
Noise1 |
Noise2 |
0.97299 |
0.98150 |
1.93680 |
2.13277 |
3.04045 |
3.13164 |
1.71018 |
2.06533 |
3.92119 |
4.46499 |
5.92306 |
6.20214 |
0.78743 |
1.24267 |
1.98965 |
2.18766 |
2.92915 |
3.35408 |
1.49930 |
2.10889 |
3.61674 |
4.47986 |
5.68941 |
5.72906 |
The plot is shown in Figure 3.3. The pattern for this scatter plot suggests a positive correlation.
Figure 3.3: Display of scatter plot with positive correlation
Sometimes there may be little or no relationship between the variables. For example, in Figure 3.4 the display is a scatter plot of a person?s cholesterol after two days on a special diet and with control diet. Observe that the scatter plot displays no particular pattern. That is, there is little or no correlation between these two variables.
Figure 3.4: Display of scatter plot with little or no correlation
In this section, we will compute the strength of the association between variables. That is, we will compute the correlation coefficient. However, we will first observe the scatter plots before computing the correlation coefficient.
Example 3: The table below shows the average weight, by height, of American men between the ages of 20 and 24.
Height (inches) |
Weight (pounds) |
62 |
130 |
64 |
139 |
66 |
148 |
68 |
157 |
70 |
167 |
72 |
176 |
Source: Grossman, Stanley. Applied Calculus. 2^{nd} Ed Wm.C. Brown Publishers.
Using the MINITAB procedures presented earlier in the exploration, the scatter plot is constructed and displayed in Figure 3.5.
Figure 3.5: Display of scatter plot with almost a perfect positive correlation
Observe from Figure 3.5, that the points are almost on a straight line with positive slope. Hence, one would expect a strong positive correlation value.
To compute the correlation between the two variables, select Stat® Basic Statistics® Correlation. The Correlation dialog box will appear. Select the two variables for the Variables box as shown in Figure 3.6.
Figure 3.6: Display of the correlation (coefficient) dialog box
Click on the OK button and the correlation coefficient will be computed and displayed in the Session window. Figure 3.7 shows the output in the Session window.
Figure 3.7: Correlation value for the variables Height and Weight
Observe that the computed correlation coefficient is +1. We observed from the scatter plot that the points are almost on a straight line with positive slope. Thus, for all practical purposes, there is a perfect positive correlation between these two variables. Thus, r = +1.
Example 4: Determine a linear regression model for the data given in Example 3.
In order to get the equation for the linear regression model, select Stat® Regression ® Regression and the Regression dialog box will appear. In the dialog box, the Response variable corresponds to the dependent (y) variable and the Predictors variable corresponds to the independent variables. Here we have only one independent (x) variable, which is HEIGHT, and the response variable is WEIGHT. Complete the dialog box as shown in Figure 3.8.
Figure 3.8: Regression dialog box for the dependent variable Weight and the independent variable Height
Click on the OK button and the analysis for the regression will be displayed in the Session window. Figure 3.9 displays the output. From the output we see that the regression equation (other terms are predictor equation, line of best fit, and least squares regression line) that relates WEIGHT to HEIGHT is given as WEIGHT = -156 + 4.61´ HEIGHT. Other information is also given in the Session window but we will ignore those for the mean time.
Example 5: What is the predicted WEIGHT for a person whose HEIGHT is 69 inches?
We can use the regression equation to predict WEIGHT values for a given HEIGHT value. For instance, in this example, HEIGHT = 69 and so by substituting this value into the regression equation, we have the predicted WEIGHT = -156 + 4.61´ 69 = 162.09 pounds. That is, based on this model, the predicted weight for a person who is 69 inches (5 feet 9 inches) tall is approximately 162 pounds.
NOTE: This model will work well for independent values within the observed range of values. The range of values for the HEIGHT variable was from 62 inches to 72 inches. Thus, one should not rely on the model to make accurate predictions outside this range of values for the independent variable HEIGHT.
Figure 3.9: Regression Analysis session window output for Example 4
Example 6: What is the coefficient of determination for the model in Example 4?
Recall that the Coefficient of Determination, denoted by or , is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. This value lies between 0 (0%) and 1 (100%). Thus, the closer the value is to 100%, the better the model is fitting the data. From Figure 3.9, R^{2} = 100.0%. Thus, from a practical standpoint, the model has captured all the variation in the dependent variable.
NOTE: We can use MINITAB to superimpose the regression line onto the scatter plot. To achieve this, select Stat® Regression ® Fitted Line Plot and the dialog box will be displayed. Fill out as in Figure 3.10 and select the OK button.
Figure 3.10: Fitted Line Plot dialog box for Example 4
The resulting plot is shown in Figure 3.11. Observe that the regression equation is given on the output as well as the coefficient of determination R-sq.
Figure 3.11: Fitted Line Plot output for Example 4
In this section we will investigate patterns and models that are non-linear in nature.
Example 7: Alcohol absorption and the risk of having an accident have been studied for years. Extensive research has provided the following data relating the risk of having an automobile accident to the blood alcohol level. Use MINITAB to present a scatter plot for the data. We will assume that the independent variable (x) is blood alcohol level and the dependent variable (y) is relative risk of accident.
Blood Alcohol Level (%) |
Relative Risk of Accident (%) |
0.00 |
1.00 |
0.05 |
2.90 |
0.10 |
8.50 |
0.15 |
24.8 |
0.20 |
72.2 |
0.21 |
89.5 |
First, we need to enter the data values into MINITAB. Follow the procedure in Example 3 to present a scatter plot. The scatter plot is presented in Figure 3.12.
Figure 3.12: Scatter Plot for Example 7
The plot indicates that the pattern is non-linear. The next example will allow us to determine a model for the data.
Example 8: Fit an appropriate model for the data in Example 7. The two other options we have in the Fitted Line Plot are Quadratic and Cubic. See Figure 3.10. Using the procedure for the NOTE in Example 6, select Quadratic for the Fitted Line Plot procedure. The quadratic model superimposed on the scatter plot is shown in Figure 3.13.
Figure 3.13: Quadratic Fitted Line Plot output for Example 8
The equation for the quadratic model is
Relative Risk = 4.34362 - 309.293 Blood Alcohol + 3295.02 Blood Alcohol**2
If we let y = Relative Risk and x = Blood Alcohol, then we can write the equation as
y = 4.34362 - 309.293x + 3295.02x^{2}
Observe that because of the square term in the equation, this will be a quadratic model. The R-Sq = 98.2%. Thus, the model explains 98.2% of the variability of the Relative Risk variable. Since this number is close to 100%, we can assume that the model is quite appropriate to describe the pattern of the scatter plot.
If we use the Cubic option, the fitted line plot as shown in Figure 3.14 will be generated.
Figure 3.14: Cubic Fitted Line Plot output for Example 8
The equation for the cubic model can be written in terms of x and y as
Relative Risk = 0.688590 + 171.944x + - 3119.78x^{2} + 20415.2x^{3}
Observe that because of the cubic term in the equation, this will be a cubic model. The R-Sq = 99.9%. Thus, the model explains 99.9% of the variability of the Relative Risk variable. Since this number is closer to 100%, we can conclude that the cubic model is more appropriate to describe the pattern of the scatter plot.
Note: One can use these models to predict the Relative Risk of an Accident for a given Blood Alcohol Level. Again these input values for the Blood Alcohol Levels should be within the range of observed values since the model was derived from this range of values.
NOTE: Here we have two models ? the quadratic and the cubic. The cubic model is a better fit for the data with an R-sq value (99.9%) that is slightly greater than that for the quadratic model (98.2%). From a practical standpoint, such a slight improvement in the R-sq value may not compensate for the increase in the complexity (the addition of the cubic term) of the model. Thus, when modeling data, one should look at all aspects and give a rational why the model was chosen.
In this section we will investigate multiple linear regression models
Example 9: An experiment was conducted to determine if the weight of an animal can be predicted after a given period of time on the basis of the initial weight of the animal and the amount of feed that was eaten. The following data, measured in kilograms, were recorded
Final weight, y |
Initial weight, x_{1} |
Feed weight, x_{2} |
95 |
42 |
272 |
77 |
33 |
226 |
80 |
33 |
259 |
100 |
45 |
292 |
97 |
39 |
311 |
70 |
36 |
183 |
50 |
32 |
173 |
80 |
41 |
236 |
92 |
40 |
230 |
84 |
38 |
235 |
Use MINITAB to display scatter plots for the dependent variable versus the two independent variables. We will assume that the independent variables are x_{1} (initial weight) and x_{2} (feed weight) and the dependent variable is y (the final weight). First, we need to enter the data values into MINITAB. Follow the procedure in Example 3 to present a scatter plot for both independent variables. The scatter plots are presented in Figure 3.15 and Figure 3.16.
Figure 3.15: Scatter Plot for Example 9
Figure 3.16: Scatter Plot for Example 9
The plots indicate an approximate linear association between the dependent and the independent variables. The next example will allow us to determine a model for the data.
Example 10: Fit an appropriate model for the data in Example 9. Select Stat® Regression and fill in the dialog box as shown in Figure 3.17. Observe that in Figure 3.17, we have two Predictors or independent variables.
Figure 3.17: Regression Dialog box for the Multiple Regression model for Example 10
Select the OK button and the resulting Session window display will be shown as in Figure 3.18.
The equation for the quadratic model is
Final weight y = - 23.0 + 1.40 Initial weight x1 + 0.218 Feed weight x2
Observe that because of the square term in the equation, this will be a quadratic model. The R-Sq = 87.3%. Thus, the model explains 87.3% of the variability of the Final weight variable. Since this number is rather close to 100%, we can assume that the model is quite appropriate to describe the relationship between the variables.
Note: One can use the model to predict the Final weight of an animal for a given initial weight and feed weight. Again these input values for the independent variables should be within the range of observed values since the model was derived from these ranges.
Example 10: Select the Graphs option in the Regression dialog box as shown in Figure 3.17 to display residual plots and normality plot for the data given in Example 9. Click on the Graphs option and in the resulting dialog box, select the options as shown in Figure 3.18.
Figure 3.18: Regression-Graphs Dialog box for the Multiple Regression model for Example 9
Two of the resulting graphs are shown in Figure 3.19 and 3.20.
Figure 3.19 shows the normality plot for the residuals for the model. Observe that the plot displays a linear pattern, which indicates that the normality assumption for the regression model has not been violated.
Figure 3.20 shows the plot for the residuals versus time order for the model. This plot is usually used to help determine visually whether the independence assumption for the model has been violated. Observe that the plot displays no apparent pattern, which indicates that the independence assumption for the regression model has not been violated.
Figure 3.19: Normality Plot for the Residuals for the Multiple Regression model in Example 9
Note: Refer to your text for detailed discussions on the assumptions for a regression model.
Figure 3.20: Residual Plot versus Order of Observation for the Residuals for the Multiple Regression model in Example 9
NOTES
EXPLORATION #3: HOMEWORK ASSIGNMENT
Name: _____________________ Date: ______________________
Course #: ___________________ Instructor: _________________
Chirps / second |
Temperature in degrees F |
20.0 |
88.6 |
16.0 |
71.6 |
19.8 |
93.3 |
18.4 |
84.3 |
17.1 |
80.6 |
15.5 |
75.2 |
14.7 |
69.7 |
17.1 |
82.0 |
15.4 |
69.4 |
16.2 |
83.3 |
15.0 |
79.6 |
17.2 |
82.6 |
16.0 |
80.6 |
17.0 |
83.5 |
14.4 |
76.3 |
Source: http://ericir.syr.edu/Virtual/Lessons/Mathematics/Statistics/STA0002.html
Note: You may want to review for directions on how to produce a scatter plot.
Year (t) |
Population ( in thousands) |
1960 (0) |
567 |
1961 (1) |
592 |
1962 (2) |
608 |
1963 (3) |
627 |
1964 (4) |
653 |
1965 (5) |
684 |
1966 (6) |
718 |
1967 (7) |
760 |
1968 (8) |
800 |
1969 (9) |
848 |
1970 (10) |
969 |
1971 (11) |
977 |
1972 (12) |
1020 |
1973 (13) |
1069 |
1974 (14) |
1141 |
1975 (15) |
1227 |
1976 (16) |
1290 |
1977 (17) |
1365 |
1978 (18) |
1443 |
1979 (19) |
1521 |
1980 (20) |
1559 |
1981 (21) |
1649 |
1982 (22) |
1720 |
1983 (23) |
1786 |
1984 (24) |
1848 |
1985 (25) |
1906 |
1986 (26) |
1963 |
1987 (27) |
2025 |
1988 (28) |
2075 |
1989 (29) |
2137 |
1990 (30) |
2180 |
1991 (31) |
2279 |
1992 (32) |
2347 |
1993 (33) |
2467 |
1994 (34) |
2542 |
1995 (35) |
2661 |
1996 (36) |
2692 |
Seconds since rocket launched |
Height of rocket in feet |
1 |
230 |
2 |
310 |
3 |
350 |
4 |
360 |
5 |
350 |
6 |
300 |
7 |
220 |
Note: To answer this question you may want to check several different model possibilities and choose the one with the best coefficient of determination. Discuss your reasoning
Length (in inches) |
Weight (in pounds) |
94 |
130 |
74 |
51 |
147 |
640 |
58 |
28 |
86 |
80 |
94 |
110 |
63 |
33 |
86 |
90 |
69 |
36 |
72 |
38 |
128 |
366 |
85 |
84 |
82 |
80 |
86 |
83 |
88 |
70 |
72 |
61 |
74 |
54 |
61 |
44 |
90 |
106 |
89 |
84 |
68 |
39 |
76 |
42 |
114 |
197 |
90 |
102 |
78 |
57 |
Year (t) |
Percent of Total Sold |
1960 (0) |
96 |
1965 (5) |
84 |
1970 (10) |
65 |
1975 (15) |
57 |
1980 (20) |
34 |
1985 (25) |
23 |
1990 (30) |
7 |
Source: Prentice Hall, Algebra (1998)
Year (t) |
Population (in millions) 15-19 year olds |
1960 (0) |
6586 |
1961 (1) |
6794 |
1962 (2) |
7376 |
1963 (3) |
7647 |
1964 (4) |
8008 |
1965 (5) |
8386 |
1966 (6) |
8842 |
1967 (7) |
8836 |
1968 (8) |
9013 |
1969 (9) |
9234 |
1970 (10) |
9437 |
1971 (11) |
9740 |
1972 (12) |
9988 |
1973 (13) |
10193 |
1974 (14) |
10349 |
1975 (15) |
10465 |
1976 (16) |
10582 |
1977 (17) |
10581 |
1978 (18) |
10555 |
1979 (19) |
10498 |
1980 (20) |
10413 |
1981 (21) |
10096 |
1982 (22) |
9809 |
1983 (23) |
9515 |
1984 (24) |
9287 |
1985 (25) |
9174 |
1986 (26) |
9206 |
1987 (27) |
9139 |
1988 (28) |
9029 |
1989 (29) |
8840 |
1990 (30) |
8709 |
1991 (31) |
8371 |
1992 (32) |
8324 |
1993 (33) |
8410 |
1994 (34) |
8580 |
1995 (35) |
8779 |
1996 (36) |
9043 |
Month |
1988 |
1989 |
1990 |
Jan |
1.56 |
5.76 |
5.28 |
Feb |
4.68 |
5.28 |
7.20 |
Mar |
7.20 |
5.88 |
9.72 |
Apr |
11.40 |
11.50 |
12.50 |
May |
17.30 |
16.70 |
18.40 |
Jun |
21.80 |
22.50 |
21.20 |
Jul |
24.70 |
26.00 |
25.10 |
Aug |
22.80 |
25.10 |
26.30 |
Sep |
21.60 |
22.70 |
24.00 |
Oct |
18.00 |
18.20 |
21.10 |
Nov |
12.80 |
13.60 |
13.00 |
Dec |
9.72 |
7.70 |
8.64 |
Plot for 1988
Plot for 1989
Plot for 1990
Equation 1988: ___________________________________________________
Equation for 1989: ___________________________________________________
Equation for 1990: ___________________________________________________
Year (t) |
Budget for Support Technology (in millions) |
1985 (1) |
748 |
1986 (2) |
1606 |
1987 (3) |
2025 |
1988 (4) |
2005 |
1989 (5) |
1865 |
1990 (6) |
1857 |
1991 (7) |
1431 |
1992 (8) |
1194 |
1993 (9) |
718 |
1994 (10) |
529 |
1995 (11) |
382 |
1996 (12) |
381 |
1997 (13) |
393 |
1998 (14) |
408 |
1999 (15) |
637 |
Source: BMDOFACTSHEETPO-99-02
:_______________________%
_____________________________________________________________
Estimated budget ($ millions) :___________________________________
Y |
X_{1} |
X_{2} |
X_{3} |
X_{4} |
410 |
69 |
125 |
59.00 |
55.66 |
569 |
57 |
131 |
31.75 |
63.97 |
425 |
77 |
141 |
80.50 |
45.32 |
344 |
81 |
122 |
75.00 |
46.67 |
324 |
0 |
141 |
49.00 |
41.21 |
505 |
53 |
152 |
49.35 |
43.83 |
235 |
77 |
141 |
60.75 |
41.61 |
501 |
76 |
132 |
41.25 |
64.57 |
400 |
65 |
157 |
50.75 |
42.41 |
584 |
97 |
166 |
32.25 |
57.95 |
434 |
76 |
141 |
54.50 |
57.90 |
© 1995-2002 by Prentice-Hall, Inc. A Pearson Company Legal Notice |