## 5 Steps to a 5 AP Statistics 2017 (2016)

### STEP __4__

### Review the Knowledge You Need to Score High

### CHAPTER 7

### Two-Variable Data Analysis

**IN THIS CHAPTER**

**Summary:** In the previous chapter we used *exploratory data analysis* to help us understand what a one-variable dataset was saying to us. In this chapter we extend those ideas to consider the relationships between two variables that might, or might not, be related. In this chapter, and in this course, we are primarily concerned with variables that have a *linear* relationship and, for the most part, leave other types of relationships to more advanced courses. We will spend some time considering nonlinear relationships that, through some sort of transformation, can be analyzed as though the relationship was linear. Finally, we”ll consider a statistic that tells the proportion of variability in one variable that can be attributed to the linear relationship with another variable.

**Key Ideas**

Scatterplots

Lines of Best Fit

The Correlation Coefficient

Least Squares Regression Line

Coefficient of Determination

Residuals

Outliers and Influential Points

Transformations to Achieve Linearity

**Scatterplots**

In the previous chapter, we looked at several different ways to graph one-variable data. By choosing from dotplots, stemplots, histograms, or boxplots, we were able to examine visually patterns in the data. In this chapter, we consider techniques of data analysis for two-variable **(bivariate)** data. Specifically, our interest is whether or not two variables have a linear relationship and how changes in one variable can predict changes in the other variable.

**example:** For an AP Statistics class project, a statistics teacher had her students keep diaries of how many hours they studied before their midterm exam. The following are the data for 15 of the students.

The teacher wanted to know if additional studying is associated with higher grades and drew the following graph, called a **scatterplot.** It seemed pretty obvious to the teacher that students who studied more tended to have higher grades on the exam and, in fact, the pattern appears to be roughly linear.

In the previous example, we were interested in seeing whether studying is associated with on test performance. To do this we drew a **scatterplot** , which is just a two-dimensional graph of ordered pairs. We put one variable on the horizontal axis and the other on the vertical axis. In the example, the horizontal axis is for “hours studied” and the vertical axis is for “score on test.” Each point on the graph represents the ordered pair for one student. If we have an **explanatory variable** , it should be on the horizontal axis, and the **response variable** should be on the vertical axis.

In the previous example, we observed a situation in which students with higher values on the vertical axis tend to have higher values on the horizontal axis. We say that two variables are **positively associated** if one of them increases as the other increases and **negatively associated** if one of them decreases as the other increases.

**Calculator Tip:** In order to draw a scatterplot on your calculator, first enter the data in two lists, say the horizontal-axis variable in L1 and the vertical-axis variable in L2 . Then go to STAT PLOT and choose the scatterplot icon from Type . Enter L1 for Xlist and L2 for Ylist . Choose whichever Mark pleases you. Be sure there are no equations active in the Y = list. Then do ZOOM ZoomStat (Zoom-9) and the calculator will draw the scatterplot for you. The calculator seems to do a much better job with scatterplots than it does with histograms but, if you wish, you can still go back and adjust the WINDOW in any way you want.

The scatterplot of the data in the example, drawn on the calculator, looks like this (the window used was [0, 6.5, 1, 40, 105, 5, 1]):

**example:** Which of the following statements best describes the scatterplot pictured?

- A line might fit the data well.
- The variables are positively associated.

III. The variables are negatively associated.

- I only
- II only
- III only
- I and II only
- I and III only

**solution:** e is correct. The data look as though a line might be a good model, and the *y* -variable decreases as the *x-* variable increases so that they are negatively associated.

**Correlation**

We have seen how to graph two-variable data in a scatterplot. Following the pattern we set in the previous chapter, we now want to do some numerical analysis of the data in an attempt to understand the relationship between the variables better.

In AP Statistics, we are primarily interested in determining the extent to which two variables are *linearly* associated. Two variables are *linearly related* to the extent that their relationship can be modeled by a line. Sometimes, and we will look at this more closely later in this chapter, variables may not be linearly related but can be transformed in such a way that the transformed data are linear. Sometimes the data are related but not linearly (e.g., the height of a thrown ball, *t* seconds after it is thrown, is a quadratic function of the number of seconds elapsed since it was released).

The first statistic we have to quantify a linear relationship is the Pearson product moment correlation, or more simply, the **correlation coefficient** , denoted by the letter *r* . The correlation coefficient is a measure of the *strength* of the linear relationship between two variables as well as an indicator of the *direction* of the linear relationship (whether the variables are positively or negatively associated).

If we have a sample of size *n* of paired data, say (*x,y* ), and assuming that we have computed summary statistics for *x* and *y* (means and standard deviations), the **correlation coefficient r **is defined as follows:

Because the terms after the summation symbol are nothing more than the *z* -scores of the individual *x* and *y* values, an easy way to remember this definition is:

**example:** Earlier in the section, we saw some data for hours studied and the corresponding scores on an exam. It can be shown that, for these data, *r* = 0.864 and the scatterplot appears roughly linear. Together, this indicates a strong positive linear relationship between hours studied and exam score. That is, the more hours studied, the higher the exam score.

The correlation coefficient *r* has a number of properties you should be familiar with:

- –1 ≤
*r*≤ 1. If*r*= –1 or*r*= 1, the points all lie on a line. - Although there are no hard-and-fast rules about how strong a correlation is based on its numerical value, the following guidelines might help you categorize
*r:* - If
*r*> 0, it indicates that the variables are positively associated. If*r*< 0, it indicates that the variables are negatively associated. - If
*r*= 0, it indicates that there is no linear association that would allow us to predict*y*from*x*. It*does not*mean that there is no relationship—just not a linear one. - It does not matter which variable you call
*x*and which variable you call*y. r*will be the same. In other words,*r*depends only on the paired points, not the*ordered*pairs. *r*does not depend on the units of measurement. In the previous example, convert “hours studied” to “minutes studied” and*r*would still equal 0.864.*r*is not resistant to extreme values because it is based on the mean. A single extreme value can have a powerful impact on*r*and may cause us to overinterpret the relationship. You must look at the scatterplot of the data as well as*r*.

**example:** To illustrate that *r* is not resistant, consider the following two graphs. The graph on the left, with 12 points, has a marked negative linear association between *x* and *y* . The graph on the right has the same basic visual pattern but, as you can see, the addition of the one outlier has a dramatic effect on *r* —making what is generally a negative association between two variables appear to have a moderate, positive association.

**example:** The following computer output, again for the hours studied versus exam score data, indicates R-sq, which is the square of *r* . Accordingly, There is a lot of other stuff in the box that doesn”t concern us just yet. We will learn about other important parts of the output as we proceed through the rest of this book. Note that we cannot determine the sign of *r* from R-sq. We need additional information.

(“R-sq” is called the “coefficient of determination” and has a lot of meaning in its own right in regression. It is difficult to show that R-sq is actually the square of *r* . We will consider the coefficient of determination later in this chapter.)

**Calculator Tip:** In order to find *r* on your calculator, you will first need to change a setting from the factory. Enter CATALOG and scroll down to “Diagnostic On.” Press ENTER twice. Now you are ready to find *r* .

Assuming you have entered the *x* - and *y* -values in L1 and L2 , enter STAT CALC LinReg(a +bx) [that”s STAT CALC 8 on the TI-83/84] and press ENTER . Then enter L1, L2 and press ENTER . You should get a screen that looks like this (using the data from the Study Time vs. Score on Test study):

(Note that reversing L1 and L2 in this operation—entering STAT CALC LinReg(a +bx) L2, L1 —will change *a* and *b* but will not change *r* since order doesn”t matter in correlation.) If you compare this with the computer output above, you will see that it contains some of the same data, including both *r* and *r* ^{2} . At the moment, all you care about in this printout is the value of *r* .

**Correlation and Causation**

Two variables, *x* and *y* , may have a strong correlation, but you need to take care not to interpret that as causation. That is, just because two things seem to go together does not mean that one caused the other—some third variable may be influencing them both. Seeing a fire truck at almost every fire doesn”t mean that fire trucks cause fires.

**example:** Consider the following dataset that shows the increase in the number of Methodist ministers and the increase in the amount of imported Cuban rum from 1860 to 1940.

For these data, it turns out that *r* = .999986.

Is the increase in number of ministers responsible for the increase in imported rum? Some cynics might want to believe so, but the real reason is that the population was increasing from 1860 to 1940, so the area needed more ministers, and more people drank more rum.

In this example, there was a *lurking variable* , increasing population—one we didn”t consider when we did the correlation—that caused both of these variables to change the way they did. We will look more at lurking variables in the next chapter, but in the meantime remember, always remember, that *correlation does not imply causation* .

**Lines of Best Fit**

When we discussed correlation, we learned that it didn”t matter which variable we called *x* and which variable we called *y* —the correlation *r* is the same. That is, there is no explanatory and response variable, just two variables that may or may not vary linearly. In this section we will be more interested in predicting, once we”ve determined the strength of the linear relationship between the two variables, the value of one variable (the response) based on the value of the other variable (the explanatory). In this situation, called linear regression, it matters greatly which variable we call *x* and which one we call *y* .

**Least-Squares Regression Line**

Recall again the data from the study that looked at hours studied versus score on test:

The scatterplot (see __page 94__ ) leads us to believe that the form of this relationship is linear. This, and given *r* = 0.864 for these data, leads us to say that we have a strong, positive, linear association between the variables. Suppose we wanted to predict the score of a person who studied for 2.75 hours. If we knew we were working with a linear model—a line that seemed to fit the data well—we would feel confident about using the equation of the line to make such a prediction. We are looking for a **line of best fit** . We want to find a **regression line** —a line that can be used for predicting response values from explanatory values. In this situation, we would use the regression line to predict the exam score for a person who studied 2.75 hours.

The line we are looking for is called the **least-squares regression line** . We could draw a variety of lines on our scatterplot trying to determine which has the best fit. Let ŷ be the predicted value of *y* for a given value of *x* . Then *y*– *ŷ* represents the error in prediction. We want our line to minimize errors in prediction, so we might first think that S(*y* – *ŷ* ) would be a good measure (*y* – *ŷ* is the *actual value* minus the *predicted value* ). However, because our line is going to average out the errors in some fashion, we find that S(*y* – *ŷ* ) = 0. To get around this problem, we use S(*y* –*ŷ* )_{2} . This expression will vary with different lines and is sensitive to the fit of the line. That is, S(*y* – *ŷ* )_{2} is small when the linear fit is good and large when it is not.

The **least-squares regression line** (LSRL) is the line that minimizes the sum of squared errors. If *ŷ* = *a* + *bx* is the LSRL, then *ŷ* minimizes S(*y* – *ŷ* )_{2} .

Digression for calculus students only: It should be clear that trying to find *a* and *b* for the line *ŷ* = *a* + *bx* that minimizes Σ(*y* – *ŷ* )^{2} is a typical calculus problem. The difference is that, since *ŷ* is a function of two variables, it requires multivariable calculus to derive it. That is, you need to be beyond first-year calculus to derive the results that follow.

For *n* ordered pairs (*x, y* ), we calculate: , , *s _{x} , s_{y} *, and

*r*. Then we have:

**example:** For the hours studied (*x* ) versus score (*y* ) study, the LSRL is *ŷ* = 59.03 + 6.77*x* . We asked earlier what score would we predict for someone who studied 2.75 hours. Plugging this value into the LSRL, we have *ŷ* = 59.03 + 6.77(2.75) = 77.63. It”s important to understand that this is the __predicted__ value, not the exact value such a person will necessarily get.

**example:** Consider once again the computer printout for the data of the preceding example:

**Exam Tip:** An AP Exam question in which you are asked to determine the regression equation from the printout has been common. Be sure you know where the intercept and slope of the regression line are located in the printout (they are under “Coef”).

The regression equation is given as “ = 59 + 6.77 Hours.” The *y* -intercept, which is the predicted score when the number of hours studied is zero, and the slope of the regression line are listed in the table under the column “Coef.”

**example:** We saw earlier that the calculator output for these data was

The values of *a* and *b* are given as part of the output. Remember that these values were obtained by putting the “Hours Studied” data in L1 , the “Test Score” data in L2 , and doing LinReg(ax+b)L1,L2 . When using LinReg(ax+b) , the explanatory variable *must* come first and the response variable second.

**example:** An experiment is conducted on the effects of having convicted criminals provide restitution to their victims rather than serving time. The following table gives the data for 10 criminals. The monthly salaries (*X* ) and monthly restitution payments (*Y* ) were as follows:

(a) Find the correlation between *X* and *Y* and the regression equation that can be used to predict monthly restitution payments from monthly salaries.

(b) Draw a scatterplot of the data and put the LSRL on the graph.

(c) Interpret the slope of the regression line in the context of the problem.

(d) How much would a criminal earning $1400 per month be expected to pay in restitution?

**solution:** Put the monthly salaries (*x* ) in L1 and the monthly restitution payments (*y* ) in L2 . Then enter STAT CALC LinReg(a+bx)L1,L2,Y1 .

(a) *r* = 0.97, = –56.22 + 0.46 (Salary). (If you answered *ŷ* = 56.22 + 0.46*x* , you must define *x* and *y* so that the regression equation can be understood in the context of the problem. An algebra equation, without a contextual definition of the variables, will not receive full credit.)

(b)

(c) The slope of the regression line is 0.46. This tells us that, for each $1 increase in the criminal”s salary, the amount of restitution is predicted to increase by $0.46. Or you could say that the average increase is $0.46.

(d) *Payment* = –56.22 + 0.46 (1400) = $587.78.

**Calculator Tip:** The fastest, and most accurate, way to perform the computation above, assuming you have stored the LSRL in Y1 (or some “Y=” location), is to do Y1(1400) on the home screen. To paste Y1 to the home screen, remember that you enter VARS Y-VARS Function Y1 . If you do this, you will get an answer of $594.64 with the difference caused by rounding due to the more accurate 12-place accuracy of the TI-83/84.

**Residuals**

When we developed the LSRL, we referred to *y – ŷ* (the *actual value* – the *predicted value* ) as an error in prediction. The formal name for *y – ŷ* is the **residual** . Note that the order is always “actual” – “predicted” so that a positive residual means that the prediction was too small and a negative residual means that the prediction was too large.

**example:** In the previous example, a criminal earning $1560/month paid restitution of $800/month. The predicted restitution for this amount would be *ŷ* = –56.22 + 0.46(1560) = $661.38. Thus, the residual for this case is $800 – $ 661.38 = $138.62.

**Calculator Tip:** The TI-83/84 will generate a complete set of residuals when you perform a LinReg . They are stored in a list called RESID which can be found in the LIST menu. RESID stores only the current set of residuals. That is, a new set of residuals is stored in RESID each time you perform a new regression.

Residuals can be useful to us in determining the extent to which a linear model is appropriate for a dataset. If a line is an appropriate model, we would expect to find the residuals more or less randomly scattered about the average residual (which is, of course, 0). In fact, we expect to find them approximately normally distributed about 0. A pattern of residuals that does not appear to be more or less randomly distributed about 0 (that is, there is a systematic nature to the graph of the residuals) is evidence that a line is not a good model for the data. If the residuals are small, the line may predict well even though it isn”t a good theoretical model for the data. The usual method of determining if a line is a good model is to examine visually a plot of the residuals plotted against the explanatory variable.

**Calculator Tip:** In order to draw a residual plot on the TI-83/84, and assuming that your *x* -data are in L1 and your *y* -data are in L2 , first do LinReg(a +bx)L1,L2 . Next, you create a STAT PLOT scatterplot, where Xlist is set to L1and Ylist is set to RESID. RESID can be retrieved from the LIST menu (remember that only the residuals for the most recent regression are stored in RESID ). ZOOM ZoomStat will then draw the residual plot for the current list of residuals. It”s a good idea to turn off any equations you may have in the Y = list before doing a residual plot or you may get an unwanted line on your plot.

**example:** The data given below show the height (in cm) at various ages (in months) for a group of children.

(a) Does a line seem to be a good model for the data? Explain.

(b) What is the value of the residual for a child of 19 months?

**solution:**

(a) Using the calculator (LinReg(a+bx) L1, L2, Y1) , we find = 64.94 + 0.634(*age* ), *r* = 0.993. The large value of *r* tells us that the points are close to a line. The scatterplot and LSLR are shown below on the graph at the left.

From the graph on the left, a line appears to be a good fit for the data (the points lie close to the line). The residual plot on the right shows no readily obvious pattern, so we have good evidence that a line is a good model for the data and we can feel good about using the LSRL to predict height from age.

(b) The residual (actual minus predicted) for *age* = 19 months is 77.1 – (64.94 + 0.634 **·** 19) = 0.114. Note that 77.1–Y1(19) = 0.112 .

(Note that you can generate a complete set of residuals, which will match what is stored in RESID , in a list. Assuming your data are in L1 and L2 and that you have found the LSRL and stored it in Y1 , let L3 = L2–Y1(L1) . The residuals for each value will then appear in L3 . You might want to let L4 = RESID (by pasting RESID from the LIST menu) and observe that L3 and L4 are the same.

**Digression:** Whenever we have graphed a residual plot in this section, the vertical axis has been the residuals and the horizontal axis has been the *x* -variable. On some computer printouts, you may see the horizontal axis labeled “Fits” (as in the graph below) or “Predicted Value.”

What you are interested in is the visual image given by the residual plot, and it doesn”t matter if the residuals are plotted against the *x* -variable or something else, like “FITS2”—the scatter of the points above and below 0 stays the same. All that changes are the horizontal distances between points. This is the way it must be done in multiple regression, since there is more than one independent variable and, as you can see, it can be done in simple linear regression.

If we are trying to predict a value of *y* from a value of *x* , it is called **interpolation** if we are predicting from an *x* -value within the range of *x* -values. It is called **extrapolation** if we are predicting from a value of *x* outside of the *x*-values.

**example:** Using the age/height data from the previous example, we are __interpolating__

if we attempt to predict height from an age between 18 and 29 months. It is interpolation if we try to predict the height of a 20.5-month-old baby. We are __extrapolating__ if we try to predict the height o f a child less than 18 months old or more than 29 months old.

If a line has been shown to be a good model for the data and if it fits the line well (i.e., we have a strong *r* and a more or less random distribution of residuals), we can have confidence in interpolated predictions. We can rarely have confidence in extrapolated values. In the example above, we might be willing to go slightly beyond the ages given because of the high correlation and the good linear model, but it”s good practice not to extrapolate beyond the data given. If we were to extrapolate the data in the example to a child of 12 years of age (144 months), we would predict the child to be 156.2 inches, or more than 13 feet tall!

**Coefficient of Determination**

In the absence of a better way to predict *y* -values from *x* -values, our best guess for any given *x* might well be , the mean value of *y* .

**example:** Suppose you had access to the heights and weights of each of the students in your statistics class. You compute the average weight of all the students. You write the heights of each student on a slip of paper, put the slips in a hat, and then draw out one slip. You are asked to predict the weight of the student whose height is on the slip of paper you have drawn. What is your best guess as to the weight of the student?

**solution:** In the absence of any known relationship between height and weight, your best guess would have to be the average weight of all the students. You know the weights vary about the average and that is about the best you could do.

If we guessed at the weight of each student using the average, we would be wrong most of the time. If we took each of those errors and squared them, we would have what is called the *sum of squares total* (SST). It”s the total squared error of our guesses when our best guess is simply the mean of the weights of all students, and represents the total variability of *y* .

Now suppose we have a least-squares regression line that we want to use as a model for predicting weight from height. It is, of course, the LSRL we discussed in detail earlier in this chapter, and our hope is that there will be less error in prediction than by using . Now, we still have errors from the regression line (called *residuals* , remember?). We call the sum of *those* errors the **sum of squared errors** (SSE). So, SST represents the total error from using as the basis for predicting weight from height, and SSE represents the total error from using the LSRL. SST – SSE represents the benefit of using the regression line rather than for prediction. That is, by using the LSRL rather than , we have *explained* a certain proportion of the total variability by regression.

The proportion of the total variability in *y* that is explained by the regression of *y* on *x* is called the **coefficient of determination** . The *coefficient of determination* is symbolized by *r* _{2} . Based on the above discussion, we note that

It can be shown algebraically, although it isn”t easy to do so, that this *r* _{2} is actually the square of the familiar *r* , the correlation coefficient. Many computer programs will report the value of *r* _{2} only (usually as “R-sq”), which means that we must take the square root of *r* _{2} if we only want to know *r* (remember that *r* and *b* , the slope of the regression line, are either both positive or negative so that you can check the sign of *b* to determine the sign of *r* if all you are given is *r* _{2} ). The TI-83/84 calculator will report both *r* and *r* _{2} , as well as the regression coefficient, when you do LinReg(a+bx) .

**example:** Consider the following output for a linear regression:

We can see that the LSRL for these data is *ŷ* = –1.95 + 0.8863*x. r* _{2} = 53.2% = 0.532. This means that 53.2% of the total variability in *y* can be explained by the regression of *y* on *x* . Further, (*r* is positive since *b* = 0.8863 is positive). We learn more about the other items in the printout later.

You might note that there are two standard errors (estimates of population standard deviations) in the computer printout above. The first is the “St Dev of *x* ” (0.2772 in this example). This is the standard error of the slope of the regression line, *s* _{b}_{ }, the estimate of the standard deviation of the slope (for information, although you don”t need to know this, . The second standard error given is the standard error of the residuals, the “*s* ” (*s* = 16.57) at the lower left corner of the table. This is the estimate of the standard deviation of the residuals (again, although you don”t need to know this, .

**Outliers and Influential Observations**

Some observations have an impact on correlation and regression. We defined an outlier when we were dealing with one-variable data (remember the 1.5 [IQR] rule?). There is no analogous definition when dealing with two-variable data, but it is the same basic idea: an **outlier** lies outside of the general pattern of the data. An outlier can certainly influence a correlation and, depending on where it is located, may also exert an influence on the slope of the regression line.

An **influential observation** is often an outlier in the *x* -direction. Its influence, if it doesn”t line up with the rest of the data, is on the slope of the regression line. More generally, an influential observation is a datapoint that exerts a strong influence on a measure.

**example:** Graphs I, II, and III are the same except for the point symbolized by the box in graphs II and III. Graph I below has no outliers or influential points. Graph II has an outlier that is an influential point that has an effect on the correlation. Graph III has an outlier that is an influential point that has an effect on the regression slope. Compare the correlation coefficients and regression lines for each graph. Note that the outlier in Graph II has some effect on the slope and a significant effect on the correlation coefficient. The influential point in Graph III has about the same effect on the correlation coefficient as the outlier in Graph II, but a major influence on the slope of the regression line.

**Transformations to Achieve Linearity**

Until now, we have been concerned with data that can be modeled with a line. Of course, there are many two-variable relationships that are nonlinear. The path of an object thrown in the air is parabolic (quadratic). Population tends to grow exponentially, at least for a while. Even though you could find an LSRL for nonlinear data, it makes no sense to do so. The AP Statistics course deals only with two-variable data that can be modeled by a line *OR* nonlinear two-variable data that can be *transformed* in such a way that the transformed data can be modeled by a line.

**example:** The number of a certain type of bacteria present (in thousands) after a certain number of hours is given in the following chart:

What would be the predicted quantity of bacteria after 3.75 hours?

**solution:** A scatterplot of the data and a residual plot [for = *a* + *b* (*Hour* )] shows that a line is not a good model for this data:

Now, take *ln* (*Number* ) to produce the following data:

The scatterplot of *Hours* versus *ln* (*Number* ) and the residual plot for *ln* ( ) = –0.0047 + 0.586(*Hours* ) are as follows:

The scatterplot looks much more linear, and the residual plot no longer has the distinctive pattern of the raw data. We have transformed the original data in such a way that the transformed data is well modeled by a line. The regression equation for the transformed data is: *ln* ( ) = –0.047 + 0.586(*Hours* ).

The question asked for how many bacteria are predicted to be present after 3.75 hours. Plugging 3.75 into the regression equation, we have *ln* ( ) = –0.0048 + 0.586(3.75) = 2.19. But that is *ln* ( ), not . We must back-transform this answer to the original units. Doing so, we have = *e* ^{2.19} = 8.94 thousand bacteria.

**Calculator Tip:** You do not need to take logarithms by hand in the above example—your calculator is happy to do it for you. Simply put the Hours data in L1 and the Number data in L2 . Then let L3 = LN(L2). The LSRL for the transformed data is then found by LinReg(a +bx) L1,L3,Y1 .

Remember that the easiest way to find the value of a number substituted into the regression equation is to simply find Y1(#) . Y1 is found by entering VARS Y-VARS Function Y1 .

**Interesting Diversion:** You will find a number of different regression expressions in the STAT CALC menu: LinReg(ax+b), QuadReg, CubicReg, QuartReg, LinReg(a+bx), LnReg, ExpReg, PwrReg, Logistic , and SinReg . While each of these has its use, only LinReg (a+bx) needs to be used in this course (well, LinReg (ax+b) gives the same equation—with the *a* and *b* values reversed, just in standard algebraic form rather than in the usual statistical form).

**Exam Tip:** Also remember, when taking the AP exam, NO calculatorspeak. If you do a linear regression on your calculator, simply report the result. The person reading your exam will know that you used a calculator and is NOT interested in seeing something like LinReg L1,L2,Y1 written on your exam.

It may be worth your while to try several different transformations to see if you can achieve linearity. Some possible transformations are: take the log of both variables, raise one or both variables to a power, take the square root of one of the variables, take the reciprocal of one or both variables, etc.

** Rapid Review**

- The correlation between two variables
*x*and*y*is 0.85. Interpret this statement.

*Answer:* There is a strong, positive, linear association between *x* and *y* . That is, as one of the variables increases, the other variable increases as well.

- The following is a residual plot of a least-squares regression. Does it appear that a line is a good model for the data? Explain.

*Answer:* The residual plot shows a definite pattern. If a line was a good model, we would expect to see a more or less random pattern of points about 0. A line is unlikely to be a good model for this data.

- Consider the following scatterplot. Is the point A an outlier, an influential observation, or both? What effect would its removal have on the slope of the regression line?

*Answer:* A is an *outlier* because it is removed from the general pattern of the rest of the points. It is an *influential observation* since its removal would have an effect on a calculation, specifically the slope of the regression line. Removing A would increase the slope of the LSRL.

- A researcher finds that the LSRL for predicting GPA based on average hours studied per week is = 1.75 + 0.11 (
*hours studied*). Interpret the slope of the regression line in the context of the problem.

*Answer:* For each additional hour studied, the GPA is predicted to increase by 0.11. Alternatively, you could say that the GPA will increase 0.11 on average for each additional hour studied.

- One of the variables that is related to college success (as measured by GPA) is socioeconomic status. In one study of the relationship,
*r*^{2}= 0.45. Explain what this means in the context of the problem.

*Answer: r* ^{2} = 0.45 means that 45% of the variability in college GPA is explained by the regression of GPA on socioeconomic status.

- Each year of Governor Jones”s tenure, the crime rate has decreased in a linear fashion. In fact,
*r*= –0.8. It appears that the governor has been effective in reducing the crime rate. Comment.

*Answer:* Correlation does not necessarily imply causation. The crime rate could have gone down for a number of reasons besides Governor Jones”s efforts.

- What is the regression equation for predicting weight from height in the following computer printout, and what is the correlation between height and weight?

*Answer:* = –104.64 + 3.4715(*Height* ); . *r* is positive since the slope of the regression line is positive and both must have the same sign.

- In the computer output for Exercise #7 above, identify the standard error of the slope of the regression line and the standard error of the residuals. Briefly explain the meaning of each.

*Answer* : The standard error of the slope of the regression line is 0.5990. It is an estimate of the change in the mean response *y* as the independent variable *x* changes. The standard error of the residuals is *s* = 7.936 and is an estimate of the variability of the response variable about the LSRL.

**Practice Problems**

**Multiple-Choice**

__Given a set of ordered pairs (__*x, y*) so that*s*_{x}_{ }= 1.6,*s*_{y}_{ }= 0.75,*r*= 0.55. What is the slope of the least-square regression line for these data?- 1.82
- 1.17
- 2.18
- 0.26
- 0.78

The regression line for the two-variable dataset given above is = 2.35 + 0.86*x* . What is the value of the residual for the point whose *x* -value is 29?

- 1.71
- -1.71
- 2.29
- 5.15
- -2.29
__A study found a correlation of__*r*= –0.58 between hours per week spent watching television and hours per week spent exercising. That is, the more hours spent watching television, the less hours spent exercising per week. Which of the following statements is most accurate?- About one-third of the variation in hours spent exercising can be explained by hours spent watching television.
- A person who watches less television will exercise more.
- For each hour spent watching television, the predicted decrease in hours spent exercising is 0.58 hrs.
- There is a cause-and-effect relationship between hours spent watching television and a decline in hours spent exercising.
- 58% of the hours spent exercising can be explained by the number of hours watching television.
__A response variable appears to be exponentially related to the explanatory variable. The natural logarithm of each__*y*-value is taken and the least-squares regression line is found to be*ln*( ) = 1.64 – 0.88*x*. Rounded to two decimal places, what is the predicted value of*y*when*x*= 3.1?- -1.09
- -0.34
- 0.34
- 0.082
- 1.09
__Consider the following residual plot:__

Which of the following statements is (are) true?

- The residual plot indicates that a line is a reasonable model for the data.
- The residual plot indicates that there is no relationship between the data.

III. The correlation between the variables is probably non-zero.

- I only
- II only
- I and III only
- II and III only
- I and II only
__Suppose the LSRL for predicting weight (in pounds) from height (in inches) is given by = –115 + 3.6 (__*Height*). Which of the following statements is correct?- A person who is 61 inches tall will weigh 104.6 pounds.
- For each additional inch of height, weight will increase on average by 3.6 pounds.

III. There is a strong positive linear relationship between height and weight.

- I only
- II only
- III only
- II and III only
- I and II only
__A least-squares regression line for predicting performance on a college entrance exam based on high school grade point average (GPA) is determined to be = 273.5 + 91.2 (__*GPA*). One student in the study had a high school GPA of 3.0 and an exam score of 510. What is the residual for this student?- 26.2
- 43.9
- –37.1
- –26.2
- 37.1
__The correlation between two variables__*x*and*y*is –0.26. A new set of scores,*x** and*y**, is constructed by letting*x** = –*x*and*y** =*y*+ 12. The correlation between*x** and*y** is- – 0.26
- 0.26
- 0
- 0.52
- – 0.52
__A study was done on the relationship between high school grade point average (GPA) and scores on the SAT. The following 8 scores were from a random sample of students taking the exam:__

What percent of the variation in SAT scores is explained by the regression of SAT score on GPA?

- 62.1%
- 72.3%
- 88.8%
- 94.2%
- 78.8%
__A study of mileage found that the least squares regression line for predicting mileage (in miles per gallon) from the weight of the vehicle (in hundreds of pounds) was = 32.50 – 0.45(__*weight*). The mean weight for the vehicles in the study was 2980 pounds. What was the mean miles per gallon in the study?- 19.09
- 15.27
- –1308.5
- 18.65
- 20.33

**Free-Response**

__Given a two-variable dataset such that = 14.5, = 20,__*s*_{x}_{ }= 4,*s*_{y}_{ }= 11,*r*= .80, find the least-squares regression line of*y*on*x*.__The data below give the first and second exam scores of 10 students in a calculus class.__

(a) Draw a scatterplot of these data.

(b) To what extent do the scores on the two tests seem related?

__The following is a residual plot of a linear regression. A line would not be a good fit for these data. Why not? Is the regression equation likely to underestimate or overestimate the__*y*-value of the point in the graph marked with the square?__The regional champion in 10 and under 100 m backstroke has had the following winning times (in seconds) over the past 8 years:__

How many years until you expect the winning time to be one minute or less? What”s wrong with this estimate?

__Measurements are made of the number of cockroaches present, on average, every 3 days, beginning on the second day, after apartments in one part of town are vacated. The data are as follows:__

How many cockroaches would you expect to be present after 9 days?

__A study found a strongly positive relationship between number of miles walked per week and overall health. A local news commentator, after reporting on the results of the study, advised everyone to walk more during the coming year because walking more results in better health. Comment on the reporter”s advice.____Carla, a young sociologist, is excitedly reporting on the results of her first professional study. The finding she is reporting is that 72% of the variation in math grades for girls can be explained by the girls” socioeconomic status. What does this mean, and is it indicative of a strong linear relationship between math grades and socioeconomic status for girls?____Which of the following statements are true of a least-squares regression equation?__

(a) It is the unique line that minimizes the sum of the residuals.

(b) The average residual is 0.

(c) It minimizes the sum of the squared residuals.

(d) The slope of the regression line is a constant multiple of the correlation coefficient.

(e) The slope of the regression line tells you how much the response variable will change for each unit change in the explanatory variable.

__Consider the following dataset:__

Given that the LSRL for these data is *ŷ* = 26.211 – 0.25*x* , what is the value of the residual for *x* = 73? Is the point (73, 7.9) above or below the regression line?

__Suppose the correlation between two variables is__*r*= –0.75. What is true of the correlation coefficient and the slope of the regression line if

(a) each of the *y* values is multiplied by –1?

(b) the *x* and *y* variables are reversed?

(c) the *x* and *y* variables are each multiplied by –1?

__Suppose the regression equation for predicting success on a dexterity task (__*y*) from the number of training sessions (*x*) is ŷ = 45 + 2.7*x*and that .

What percentage of the variation in *y* is not explained by the regression on *x* ?

__Consider the following scatterplot. The highlighted point is both an outlier and an influential point. Describe what will happen to the correlation and the slope of the regression line if that point is removed.____The computer printout below gives the regression output for predicting__*crime rate*(in crimes per 1000 population) from the*number of casino employees*(in 1000s).

Based on the output,

(a) give the equation of the LSRL for predicting *crime rate* from *number* .

(b) give the value of *r* , the correlation coefficient.

(c) give the predicted *crime rate* for 20,000 casino employees.

__A study was conducted in a mid-size U.S. city to investigate the relationship between the number of homes built in a year and the mean percentage appreciation for that year. The data for a 5-year period are as follows:__

(a) Obtain the LSRL for predicting appreciation from number of new homes built in a year.

(b) The following year, 85 new homes are built. What is the predicted appreciation?

(c) How strong is the linear relationship between number of new homes built and percentage appreciation? Explain.

(d) Suppose you didn”t know the number of new homes built in a given year. How would you predict appreciation?

__A set of bivariate data has__*r*^{2}= 0.81.

(a) *x* and *y* are both standardized, and a regression line is fitted to the standardized data. What is the slope of the regression line for the standardized data?

(b) Describe the scatterplot of the original data.

__Estimate__*r*, the correlation coefficient, for each of the following graphs:__The least-squares regression equation for the given data is__*ŷ*= 3 +*x*. Calculate the sum of the squared residuals for the LSRL.__Many schools require teachers to have evaluations done by students. A study investigated the extent to which student evaluations are related to grades. Teacher evaluations and grades are both given on a scale of 100. The results for Prof. Socrates (__*y*) for 10 of his students are given below together with the average for each student (*x*).

(a) Do you think student grades and the evaluations students give their teachers are related? Explain.

(b) What evaluation score do you think a student who averaged 80 would give Prof. Socrates?

__Which of the following statements are true?__

(a) The correlation coefficient, *r* , and the slope of the regression line, *b* , always have the same sign.

(b) The correlation coefficient is the same no matter which variable is considered to be the explanatory variable and which is considered to be the response variable.

(c) The correlation coefficient is resistant to outliers.

(d) *x* and *y* are measured in inches, and *r* is computed. Now, *x* and *y* are converted to feet, and a new *r* is computed. The two computed values of *r* depend on the units of measurement and will be different.

(e) The idea of a correlation between height and gender is not meaningful because gender is not numerical.

__A study of right-handed people found that the regression equation for predicting left-hand strength (measured in kg) from right-hand strength is__*left-hand strength*= 7.1 + 0.35 (*right-hand strength*).

(a) What is the predicted left-hand strength for a right-handed person whose right-hand strength is 12 kg?

(b) Interpret the intercept and the slope of the regression line in the context of the problem.

**Cumulative Review Problems**

__Explain the difference between a statistic and a parameter.____True–False. The area under a normal curve between__*z*= 0.1 and*z*= 0.5 is the same as the area between*z*= 0.3 and*z*= 0.7.__The following scores were achieved by students on a statistics test: 82, 93, 26, 56, 75, 73, 80, 61, 79, 90, 94, 93, 100, 71, 100, 60. Compute the mean and median for these data and explain why they are different.____Is it possible for the standard deviation of a set of data to be negative? Zero? Explain.____For the test scores of problem #3, compute the five-number summary and draw a boxplot of the data.__

**Solutions to Practice Problems**

**Multiple-Choice**

__The correct answer is (d).____The correct answer is (e). The value of a residual = actual value – predicted value = 25 – [2.35 + 0.86(29)] = –2.29.____The correct answer is (a).__*r*^{2}= (-0.58)^{2}= 0.3364. This is the*coefficient of determination*, which is the proportion of the variation in the response variable that is explained by the regression on the independent variable. Thus, about one-third (33.3%) of the variation in hours spent exercising can be explained by hours spent watching television.

(b) is incorrect since correlation does not imply causation. (c) would be correct if*b*= –0.58, but there is no obvious way to predict the response value from the explanatory value just by knowing*r*. (d) is incorrect for the same reason (b) is incorrect. (e) is incorrect since*r*, not*r*^{2}, is given. In this case*r*^{2}= 0.3364, which makes (a) correct.__The correct answer is (c).__*ln*(*y*) = 1.64 – 0.88(3.1) = –1.088 ⇒*y*=*e*^{–1.088}= 0.337.__The correct answer is (c). The pattern is more or less random about 0, which indicates that a line would be a good model for the data. If the data are linearly related, we would expect them to have a non-zero correlation.____The correct answer is (b). I is incorrect—the__*predicted*weight of a person 61 inches tall is 104.6 pounds. II is a correct interpretation of the slope of the regression line (you could also say that “For each additional inch of height, weight*is predicted to*increase by 3.6 pounds).” III is incorrect. It may well be true, but we have no way of knowing that from the information given.__The correct answer is (c). The predicted score for the student is 273.5 + (91.2)(3) = 547.1. The residual is the actual score minus the predicted score, which equals 510 – 547.1 = –37.1.____The correct answer is (b). Consider the expression for__*r*. .

Adding 12 to each *Y* -value would not change *s* _{y}_{ }. Although the average would be 12 larger, the differences *y* – would stay the same since each *y* -value is also 12 larger. By taking the negative of each *x* -value, each term would reverse sign (the mean also reverses sign) but the absolute value of each term would be the same. The net effect is to leave unchanged the absolute value of *r* but to reverse the sign.

__The correct answer is (e). The question is asking for the coefficient of determination,__*r*^{2}(R-sq on many computer printouts). In this case,*r*= 0.8877 and*r*^{2}= 0.7881, or 78.8%. This can be found on your calculator by entering the GPA scores in L1 , the SAT scores in L2 , and doing STAT CALC 1-Var Stats L1,L2.__The correct answer is (a). The point ( , ) always lies on the LSRL. Hence, can be found by simply substituting__*x*– into the LSRL and solving for . Thus, = 32.5 – 0.45(29.8) = 19.09 miles per gallon. Be careful: you are told that the equation uses the weights in hundreds of pounds. You must then substitute 29.8 into the regression equation, not 2980, which would get you answer (c).

**Free-Response**

__, a = –__*b*= 20 –(2.2)(14.5) = –11.9.

Thus, *ŷ* = –11.9 + 2.2x.

__(a)__

(b) There seems to be a moderate positive relationship between the scores: students who did better on the first test tend to do better on the second, but the relationship isn”t very strong; *r* = 0.55.

__A line is not a good model for the data because the residual plot shows a definite pattern: the first 8 points have negative residuals and the last 8 points have positive residuals. The box is in a cluster of points with positive residuals. We know that, for any given point, the residual equals actual value minus predicted value. Because actual – predicted > 0, we have actual > predicted, so that the regression equation is likely to underestimate the actual value.____The regression equation for predicting time from year is = 79.21 – 0.61(__*year*). We need*time*= 60. Solving 60 = 79.1 – 0.61(*year*), we get*year*= 31.3. So, we would predict that times will drop under one minute in about 31 or 32 years. The problem with this is that we are extrapolating far beyond the data. Extrapolation is dangerous in any circumstance, and especially so 24 years beyond the last known time. It”s likely that the rate of improvement will decrease over time.__A scatterplot of the data (graph on the left) appears to be exponential. Taking the natural logarithm of each__*y*-value, the scatterplot (graph on the right) appears to be more linear.

Taking the natural logarithm of each *y* -value and finding the LSRL, we have *ln* ( ) = 0.914 + 0.108 (*days* ) = 0.914 + 0.108(9) = 1.89. Then = *e* ^{1.89} = 6.62.

__The correlation between walking more and better health may or may not be causal. It may be that people who are healthier walk more. It may be that some other variable, such as general health consciousness, results in walking more and in better health. There may be a causal association, but in general, correlation does not imply causation.____Carla has reported the value of__*r*^{2}, the coefficient of determination. If she had predicted each girl”s grade based on the average grade only, there would have been a large amount of variability. But, by considering the regression of grades on socioeconomic status, she has reduced the total amount of variability by 72%. Because*r*^{2}= 0.72,*r*= 0.85, which is indicative of a strong positive linear relationship between grades and socioeconomic status. Carla has reason to be happy.__(a) is false. for the LSRL, but there is no unique line for which this is true.__

(b) is true.

(c) is true. In fact, this is the definition of the LSRL—it is the line that minimizes the sum of the squared residuals.

(d) is true since and is constant.

(e) is false. The slope of the regression lines tell you by how much the response variable changes *on average* for each unit change in the explanatory variable.

*ŷ*= 26.211 – 0.25*x*= 26.211 – 0.25(73) = 7.961. The residual for*x*= 73 is the actual value at 73 minus the predicted value at 73, or*y*–*ŷ*= 7.9 – 7.961 = –0.061. (73, 7.9) is below the LSRL since*y – ŷ*< 0 ⇒*y*<*ŷ*.__(a)__*r*= +0.75; the slope is positive and is the opposite of the original slope.

(b) *r* = –0.75. It doesn”t matter which variable is called *x* and which is called *y* .

(c) *r* = –0.75; the slope is the same as the original slope.

__We know that , so that 2.7 =__*r*(3.33) → . The proportion of the variability that is*not*explained by the regression of*y*on*x*is 1 –*r*^{2}= 1 – 0.66 = 0.34.__Because the linear pattern will be stronger, the correlation coefficient will increase. The influential point pulls up on the regression line so that its removal would cause the slope of the regression line to decrease.____(a) = –0.3980 + 0.1183 (__*number*).

(b) (*r* is positive since the slope is positive).

(c) = –0.3980 + 0.1183(20) = 1.97 crimes per thousand employees. Be sure to use 20, not 200.

__(a) = 1.897 + 0.115 (__*number*)

(b) = 1.897 + 0.115(85) = 11.67%.

(c) *r* = 0.82, which indicates a strong linear relationship between the number of new homes built and percent appreciation.

(d) If the number of new homes built was unknown, your best estimate would be the average percentage appreciation for the 5 years. In this case, the average percentage appreciation is 11.3%. [For what it”s worth, the average error (absolute value) using the mean to estimate appreciation is 2.3; for the regression line, it”s 1.3.]

__(a) If__*r*^{2}= 0.81, then*r*= ±0.9. The slope of the regression line for the standardized data is either 0.9 or –0.9.

(b) If *r* = +0.9, the scatterplot shows a strong positive linear pattern between the variables. Values above the mean on one variable tend to be above the mean on the other, and values below the mean on one variable tend to be below the mean on the other. If *r* = –0.9, there is a strong negative linear pattern to the data. Values above the mean on one variable are associated with values below the mean on the other.

__(a)__*r*= 0.8

(b) *r* = 0.0

(c) *r* = –1.0

(d) *r* = –0.5

__Each of the points lies on the regression line → every residual is 0 → the sum of the squared residuals is 0.____(a)__*r*= 0.90 for these data, indicating that there is a strong positive linear relationship between student averages and evaluations of Prof. Socrates. Furthermore,*r*^{2}= 0.82, which means that most of the variability in student evaluations can be explained by the regression of student evaluations on student average.

(b) If *y* is the evaluation score of Prof. Socrates and *x* is the corresponding average for the student who gave the evaluation, then *ŷ* = –29.3 + 1.34*x* . If *x* = 80, then *ŷ* = –29.3 + 1.34(80) = 77.9, or 78.

__(a) True, because__

and is positive.

(b) True. *r* is the same if explanatory and response variables are reversed. This is not true, however, for the slope of the regression line.

(c) False. Because *r* is defined in terms of the means of the *x* and *y* variables, it is not resistant.

(d) False. *r* does not depend on the units of measurement.

(e) True. The definition of *r* , necessitates that the variables be numerical, not categorical.

__(a) = 7.1 + 0.35(12) = 11.3 kg.__

(b) **Intercept:** The predicted left-hand strength of a person who has zero right-hand strength is 7.1 kg.**Slope:** On average, left-hand strength increases by 0.35 kg for each 1 kg increase in right-hand strength. Or left-hand strength is predicted to increase by 0.35 kg for each 1 kg increase in right-hand strength.

**Solutions to Cumulative Review Problems**

__A__*statistic*is a measurement that describes a sample. A*parameter*is a value that describes a population.__False. For an interval of fixed length, there will be a greater proportion of the area under the normal curve if the interval is closer to the center than if it is removed from the center. This is because the normal distribution is mound shaped, which implies that the terms tend to group more in the center of the distribution than away from the center.____The mean is 77.1, and the median is 79.5. The mean is lower than the median because the mean is not resistant to extreme values, while the median is resistant.____By definition, , which is a positive square root. Since__*n*>1 and Σ (*x*– )^{2}≥ 0,*s*cannot be negative. It*can*be zero, but only if*x*= for all values of*x*.__The 5-number summary is: [26, 66, 79.5, 93, 100].__

**CHAPTER 7**

**Two-Variable Data Analysis**

- In the scatterplot, the dashed lines represent the mean values of the two variables. Removing point A would have which of the following effects?

(A) The slope of the regression line would increase, and the correlation would decrease.

(B) The slope of the regression line would decrease, and the correlation would increase.

(C) The slope of the regression line would not change, and the correlation would increase.

(D) The slope of the regression line would not change, and the correlation would decrease.

(E) Neither the slope of the regression line nor the correlation would change.

- A statistics professor has collected data over the years on the midterm exam and final exam scores for his students. There is a positive linear association between the two sets of scores, and the correlation is 0.73. The standard deviation of midterm scores is 15 points, and the standard deviation of final exam scores is 25 points. If a student scores 18 points above the mean score on her midterm exam, about how far above the mean final exam score is her predicted final exam score?

(A) 18 points

(B) 22 points

(C) 30 points

(D) She would be predicted to get 0 points on the final exam.

(E) It is impossible to calculate this without knowing the mean scores.

- The Department of Transportation regularly collects and publishes data on airline performance. To see if there is a relationship between flight problems, such as delays or cancellations, and rates of mishandled baggage, a linear regression was computed for U.S. airlines from January to June 2014. The regression output is displayed below.

Dependent variable is baggage

No selector

R squared = 78.9% R squared (adjusted) = 78.2%

s = 0.25.05 with 32 – 2 = 30 degrees of freedom (df)

Which of the following statements is true, based on the computer output?

(A) An airline with one more mishandled baggage complaint than another tends to have about 10.5234 fewer flight problems on average.

(B) An airline with one more mishandled baggage complaint than another tends to have about 0.5101 more flight problems on average.

(C) An airline with one more flight problem than another tends to have about 10.5234 fewer baggage complaints on average.

(D) An airline with one more flight problem than another tends to have about 0.5101 more baggage complaints on average.

(E) An airline with one more flight problem than another tends to have about 5.6564 fewer baggage complaints on average.

- For 2015 midsize cars, the regression equation for predicting combined fuel economy (miles per gallon) from the engine size (in liters) is ŷ = 36–3.57
*x*, where*y*is the combined fuel economy and*x*is the engine size. One car with a 2 L engine has a combined fuel economy of 40 mpg. Which statement about this car”s residual is true?

(A) The residual is –11.14, which means this car gets 11.14 mpg less than the model predicts.

(B) The residual is –11.14, which means this car gets 11.14 mpg more than the model predicts.

(C) The residual is 11.14, which means this car gets 11.14 mpg less than the model predicts.

(D) The residual is 11.14, which means this car gets 11.14 mpg more than the model predicts.

(E) The residual is 28.86, which means this car gets 28.86 mpg more than the model predicts.

- The fertility rate (average number of children per woman) in a country can be used to predict the life expectancy of people in that country. The regression equation in 2012 was = 85 –5.14
*x*, where*y*is the life expectancy in years and*x*is the number of children per woman. The correlation is 0.806. What percent of the variation in life expectancy is explained by the regression model with fertility rate as the predictor?

(A) 5.14%

(B) 65.0%

(C) 80.6%

(D) 85.0%

(E) There is not enough information to answer this question.

** Answers**

**C****B****D****D****B**