One-Variable Data Analysis - Review the Knowledge You Need to Score High - 5 Steps to a 5 AP Statistics 2017 (2016)

5 Steps to a 5 AP Statistics 2017 (2016)

STEP 4

Review the Knowledge You Need to Score High

CHAPTER 6

One-Variable Data Analysis

IN THIS CHAPTER

Summary: We begin our study of statistics by considering distributions of data collected on a single variable. This will have both graphical and analytical components and is often referred to as exploratory data analysis (EDA). “Seeing” the data can often help us understand it and, to that end, we will look at a variety of graphs that display the data. Following that, we will consider a range of numerical measures for center and spread that help describe the dataset. When we are finished, we will be able to describe a dataset in terms of its shape , its center , and its spread (variability). We will consider both ways to describe the whole dataset and ways to describe individual terms within the dataset. A very important one-variable distribution, the normal distribution, will be considered in some detail.

Key Ideas

Shape of a Distribution

Dotplot

Stemplot

Histogram

Measures of Center

Measures of Spread

Five-Number Summary

Boxplot

z -Score

Density Curve

Normal Distribution

The 68-95-99.7 Rule

If you are given an instruction to “describe” a set of data, be sure you discuss the shape of the data (including gaps and clusters in the data), the center of the data (mean, median, mode), and the spread of the data (range, interquartile range, standard deviation).

Graphical Analysis

Our purpose in drawing a graph of data is to get a visual sense of it. We are interested in the shape of the data as well as gaps in the data, clusters of datapoints, and outliers (which are datapoints that lie well outside of the general pattern of the data).

Shape

When we describe shape , what we are primarily interested in is the extent to which the graph appears to be symmetric (has symmetry around some axis), mound-shaped (bell-shaped ), skewed (data are skewed to the left if the tail is to the left; to the right if the tail is to the right), bimodal (has more than one location with many scores), or uniform (frequencies of the various values are more-or-less constant).

This graph could be described as symmetric and mound-shaped (or bell-shaped) . Note that it doesn”t have to be perfectly symmetrical to be classified as symmetric.

This graph is of a uniform distribution. Again, note that it does not have to be perfectly uniform to be described as uniform .

This distribution is skewed left because the tail is to the left. If the tail were to the right, the graph would be described at skewed right .

There are four types of graph we want to look at in order to help us understand the shape of a distribution: dotplot, stemplot, histogram, and boxplot. We use the following 31 scores from a 50-point quiz given to a community college statistics class to illustrate the first three plots (we will look at a boxplot in a few pages):

Dotplot

A dotplot is a very simple type of graph that involves plotting the data values, with dots, above the corresponding values on a number line. A dotplot of the scores on the statistics quiz, drawn by a statistics computer package, looks like this:

[Calculator note: Most calculators do not have a built-in function for drawing dotplots. There are work-arounds that will allow you to draw a dotplot on a calculator, but they involve more effort than they are worth.]

Stemplot (Stem and Leaf Plot)

A stemplot is a bit more complicated than a dotplot. Each data value has a stem and a leaf . There are no mathematical rules for what constitutes the stem and what constitutes the leaf . Rather, the nature of the data will suggest reasonable choices for the stem and leaves. With the given score data, we might choose the first digit to be the stem and the second digit to be the leaf . So, the number 42 in a stem and leaf plot would show up as 4|2. All the leaves for a common stem are often on the same line. Often, these are listed in increasing order, so the line with stem 4 could be written as: 4 | 0112236. The complete stemplot of the quiz data looks like this:

Using the 10”s digit for the stem and the units digit for the leaf made good sense with this data set; other choices make sense depending on the type of data. For example, suppose we had a set of gas mileage tests on a particular car (e.g., 28.3, 27.5, 28.1, …). In this case, it might make sense to make the stems the integer part of the number and the leaf the decimal part. As another example, consider measurements on a microscopic computer part (0.0018, 0.0023, 0.0021, …). Here you”d probably want to ignore the 0.00 (since that doesn”t help distinguish between the values) and use the first nonzero digit as the stem and the second nonzero digit as the leaf.

Some data lend themselves to breaking the stem into two or more parts. For these data, the stem “4” could be shown with leaves broken up 0–4 and 5– 9. Done this way, the stemplot for the scores data would look like this (there is a single “1” because there are no leaves with the values 0–4 for a stem of 1; similarly, there is only one “5” since there are no values in the 55–59 range.):

The visual image is of data that are slightly skewed to the right (that is, toward the higher scores). We do notice a cluster of scores in the high 20s that was not obvious when we used an increment of 10 rather than 5. There is no hard-and-fast rule about how to break up the stems—it”s easy to try different arrangements on most computer packages.

Sometimes plotting more than one stemplot, side-by-side or back-to-back, can provide us with comparative information. The following stemplot shows the results of two quizzes given for this class (one of them the one discussed above):

It can be seen from this comparison that the scores on Quiz #1 (on the left) were generally higher than for those on Quiz #2—there are a lot more scores at the upper end. Although both distributions are reasonably symmetric, the one on the left is skewed somewhat toward the smaller scores, and the one on the right is skewed somewhat toward the larger numbers.

[Note: Most calculators do not have a built-in function for drawing stemplots. However, most computer programs do have this ability, and it”s quite easy to experiment with various stem increments.]

Histogram

A bar graph is used to illustrate qualitative data, and a histogram is used to illustrate quantitative data. The horizontal axis in a bar graph contains the categories, and the vertical axis contains the frequencies, or relative frequencies, of each category. The horizontal axis in a histogram contains numerical values, and the vertical axis contains the frequencies, or relative frequencies, of the values (often intervals of values).

example: Twenty people were asked to state their preferences for candidates in an upcoming election. The candidates were Arnold, Betty, Chuck, Dee, and Edward. Five preferred Arnold, three preferred Betty, six preferred Chuck, two preferred Dee, and four preferred Edward. A bar graph of their preferences is shown below:

A histogram is composed of bars of equal width, usually with common edges. When you choose the intervals, be sure that each of the datapoints fits into a category (it must be clear in which class any given value is).

Consider again the quiz scores we looked at when we discussed dotplots:

Because the data are integral and range from 15 to 50, reasonable intervals might be of size 10 or of size 5. The graphs below show what happens with the two choices:

Typically, the interval with midpoint 15 would have class boundaries 12.5 ≤ x <17.5; the interval with midpoint 20 would have class boundaries 17.5 ≤ x <22.5, etc.

There are no hard-and-fast rules for how wide to make the bars (called “class intervals”). You should use computer or calculator software to help you find a picture to which your eye reacts. In this case, intervals of size 5 give us a better sense of the data. Note that some computer programs will label the boundaries of each interval rather than the midpoint.

The following is the histogram of the same data (with bar width 5) produced by the TI-83/84 calculator:

Calculator Tip: If you are using your calculator to draw a histogram of data in a list, you might be tempted to use the ZoomStat command in the ZOOM menu. This command causes the calculator to choose a window for your graph. This might be a good idea for other types of graphs, but the TI-83/84 does a poor job getting the window right for a histogram. The bar width (or bin width) of a histogram is determined by XScl and you should choose this to fit your data. The graph above has XScl=5 and, therefore, a bar width of 5. Xmin and Xmax represent the upper and lower boundaries on the screen. In this example, they are set at 12.5 and 53.5, respectively.

example: For the histogram below, identify the boundaries for the class intervals.

solution: The midpoints of the intervals begin at 15 and go by increments of 5. So the boundaries of each interval are 2.5 above and below the midpoints. Hence, the boundaries for the class intervals are 12.5, 17.5, 22.5, 27.5, 32.5, 37.5, 42.5, 47.5, and 52.5.

example: For the histogram given in the previous example, what proportion of the scores are less than 27.5?

solution: From the graph we see that there is 1 value in the first bar, 2 in the second, 6 in the third, etc., for a total of 31 altogether. Of these, 1 + 2 + 6 = 9 are less than 27.5. 9/31 = 0.29.

example: The following are the heights of 100 college-age women, followed by a histogram and stemplot of the data. Describe the graph of the data using either the histogram, the stemplot, or both.

Height

solution: Both the stemplot and the histogram show symmetric, bell-shaped distributions. The graph is symmetric and centered about 66 inches. In the histogram, the boundaries of the bars are 59 ≤ x < 61, 61 ≤ x < 63, 63 ≤ x< 65, …, 71 ≤ x < 73. Note that for each interval, the lower bound is contained in the interval, while the upper bound is part of the next larger interval. Also note that the stemplot and the histogram convey the same visual image for the shape of the data.

Measures of Center

In the last example of the previous section, we said that the graph appeared to be centered about a height of 66″. In this section, we talk about ways to describe the center of a distribution. There are two primary measures of center: the mean and the median . There is a third measure, the mode , but it tells where the most frequent values occur more than it describes the center. In some distributions, the mean, median, and mode will be close in value, but the mode can appear at any point in the distribution.

Mean

Let x i represent any value in a set of n values (i = 1, 2, …, n ). The mean of the set is defined as the sum of the x ”s divided by n . Symbolically, . Usually, the indices on the summation symbol in the numerator are left out and the expression is simplified to . Σ x means “the sum of x ” and is defined as follows: = x 1 + x 2 + … + xn . Think of it as the “add-”em-up” symbol to help remember what it means. is used for a mean based on a sample (a statistic ). In the event that you have access to an entire distribution (such as in Chapters 9 and 10 ), its mean is symbolized by the Greek letter μ .

(Note: In the previous chapter, we made a distinction between statistics , which are values that describe sample data, and parameters , which are values that describe populations. Unless we are clear that we have access to an entire population, or that we are discussing a distribution, we use the statistics rather than parameters.)

example: During his major league career, Babe Ruth hit the following number of home runs (1914–1935): 0, 4, 3, 2, 11, 29, 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22, 6. What was the mean number of home runs per year for his major league career?

Calculator Tip: You should use a calculator to do examples like the above—it”s a waste of time to do it by hand, or even to use a calculator to add up the numbers and divide by n . To use the TI-83/84, press STAT ; select EDIT ; enter the data in a list, say L1 (you can clear the list, if needed, by moving the cursor on top of the L1 , pressing CLEAR and ENTER ). Once the data are in L1 , press STAT , select CALC , select 1-Var Stats and press ENTER. 1-Var Statswill appear on the home screen followed by a blinking cursor — the calculator wants to know where your data are. Enter L1 (It”s above the 1; enter 2ND 1 to get it). The calculator will return and a lot more. Note that, if you fail to enter a list name after 1-Var Stats (that is, you press ENTER at this point), the calculator will assume you mean L1 . It”s a good idea to get used to entering the list name where you”ve put your data, even if it is L1 .

Median

The median of an ordered dataset is the “middle” value in the set. If the dataset has an odd number of values, the median is a member of the set and is the middle value. If there are 3 values, the median is the second value. If there are 5, it is the third, etc. If the dataset has an even number of values, the median is the mean of the two middle numbers. If there are 4 values, the median is the mean of the second and third values. In general, if there are n values

in the ordered dataset, the median is at the position. If you have 28 terms in order, you will find the median at the position (that is, between the 14th and 15th terms). Be careful not to interpret as the value of the median rather than as the location of the median.

example: Consider once again the data in the previous example from Babe Ruth”s career. What was the median number of home runs per year he hit during his major league career?

solution: First, put the numbers in order from smallest to largest: 0, 2, 3, 4, 6, 11, 22, 25, 29, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60. There are 22 scores, so the median is found at the 11.5th position, between the 11th and 12th scores (35 and 41). So the median is

The 1-Var Stats procedure, described in the previous Calculator Tip box, will, if you scroll down to the second screen of output, give you the median (as part of the entire five-number summary of the data: minimum, lower quartile; median, upper quartile; maximum).

Resistant

Although the mean and median are both measures of center, the choice of which to use depends on the shape of the distribution. If the distribution is symmetric and mound shaped, the mean and median will be close. However, if the distribution has outliers or is strongly skewed, the median is probably the better choice to describe the center. This is because it is a resistant statistic, one whose numerical value is not dramatically affected by extreme values, while the mean is not resistant.

example: A group of five teachers in a small school have salaries of $32,700, $32,700, $38,500, $41,600, and $44,500. The mean and median salaries for these teachers are $38,160 and $38,500, respectively. Suppose the highest paid teacher gets sick, and the school superintendent volunteers to substitute for her. The superintendent”s salary is $174,300. If you replace the $44,500 salary with the $174,300 one, the median doesn”t change at all (it”s still $38,500), but the new mean is $64,120—almost everybody is below average if, by “average,” you mean mean .

example: For the graph given below, would you expect the mean or median to be larger? Why?

solution: You would expect the median to be larger than the mean. Because the graph is skewed to the left, and the mean is not resistant, you would expect the mean to be pulled to the left (in fact, the dataset from which this graph was drawn from has a mean of 5.4 and a median of 6, as expected, given the skewness).

Measures of Spread

Simply knowing about the center of a distribution doesn”t tell you all you might want to know about the distribution. One group of 20 people earning $20,000 each will have the same mean and median as a group of 20 where 10 people earn $10,000 and 10 people earn $30,000. These two sets of 20 numbers differ not in terms of their center but in terms of their spread, or variability. Just as there are measures of center based on the mean and the median, we also have measures of spread based on the mean and the median.

Variance and Standard Deviation

One measure of spread based on the mean is the variance . By definition, the variance is the average squared deviation from the mean. That is, it is a measure of spread because the more distant a value is from the mean, the larger will be the square of the difference between it and the mean.

Symbolically, the variance is defined by

Note that we average by dividing by n – 1 rather than n as you might expect. This is because there are only n – 1 independent datapoints, not n , if you know . That is, if you know n – 1 of the values and you also know , then the n th datapoint is determined.

One problem using the variance as a measure of spread is that the units for the variance won”t match the units of the original data because each difference is squared. For example, if you find the variance of a set of measurements made in inches, the variance will be in square inches. To correct this, we often take the square root of the variance as our measure of spread.

The square root of the variance is known as the standard deviation . Symbolically,

As discussed earlier, it is common to leave off the indices and write:

In practice, you will rarely have to do this calculation by hand because it is one of the values returned when you use you calculator to do 1-Var Stats on a list (it”s the Sx near the bottom of the first screen).

Calculator Tip: When you use 1-Var Stats , the calculator will, in addition to Sx , return σ x , which is the standard deviation of a distribution. Its formal definition is . Note that this assumes you know µ , the population mean, which you rarely do in practice unless you are dealing with a probability distribution (see Chapter 9 ). Most of the time in statistics, you are dealing with sample data and not a distribution. Thus, with the exception of the type of probability material found in Chapters 9 and 10 , you should use only s and not σ.

The definition of standard deviation has three useful qualities when it comes to describing the spread of a distribution:

  • It is independent of the mean. Because it depends on how far datapoints are from the mean, it doesn”t matter where the mean is.
  • It is sensitive to the spread. The greater the spread, the larger will be the standard deviation. For two datasets with the same mean, the one with the larger standard deviation has more variability.
  • It is independent of n. Because we are averaging squared distances from the mean, the standard deviation will not get larger just because we add more terms.

example: Find the standard deviation of the following 6 numbers: 3, 4, 6, 6, 7, 10.

solution:

Note that the standard deviation, like the mean, is not resistant to extreme values. Because it depends upon distances from the mean, it should be clear that extreme values will have a major impact on the numerical value of the standard deviation. Note also that, in practice, you will never have to do the calculation above by hand—you will rely on your calculator.

Interquartile Range

Although the standard deviation works well in situations where the mean works well (reasonably symmetric distributions), we need a measure of spread that works well when a mean-based measure is not appropriate. That measure is called the interquartile range.

Remember that the median of a distribution divides the distribution in two—it is the middle of the distribution. The medians of the upper and lower halves of the distribution, not including the median itself in either half, are called quartiles . The median of the lower half is called the lower quartile , or the first quartile (which is the 25th percentile—Q1 on the calculator). The median of the upper half is called the upper quartile , or the third quartile(which is in the 75th percentile—Q3 on the calculator). The median itself can be thought of as the second quartile or Q2 (although we usually don”t).

The interquartile range (IQR) is the difference between Q3 and Q1. That is, IQR = Q3 – Q1. When you do 1-Var Stats , the calculator will return Q1 and Q3 along with a lot of other stuff. You have to compute the IQR from Q1 and Q3. Note that the IQR comprises the middle 50% of the data.

example: Find Q1, Q3, and the IQR for the following dataset: 5, 5, 6, 7, 8, 9, 11, 13, 17.

solution: Because the data are in order, and there is an odd number of values (9), the median is 8. The bottom half of the data comprises 5, 5, 6, 7. The median of the bottom half is the average of 5 and 6, or 5.5 which is Q1. Similarly, Q3 is the median of the top half, which is the mean of 11 and 13, or 12. The IQR = 12 – 5.5 = 6.5.

example: Find the standard deviation and IQR for the number of home runs hit by Babe Ruth in his major league career. The number of home runs was: 0, 4, 3, 2, 11, 29, 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22, 6.

solution: We put these numbers into a TI-83/84 list and do 1-Var Stats on that list. The calculator returns S x = 20.21 , Q1 = 11 , and Q 3 = 47 . Hence the IQR = Q3 – Q1 = 47 – 11 = 36.

The range of the distribution is the difference between the maximum and minimum scores in the distribution. For the home run data, the range equals 60 – 0 = 60. Although this is sometimes used as a measure of spread, it is not very useful because we are usually interested in how the data spread out from the center of the distribution, not in just how far it is from the minimum to the maximum values.

Outliers

We have a pretty good intuitive sense of what an outlier is: it”s a value far removed from the others. There is no rigorous mathematical formula for determining whether or not something is an outlier, but there are a few conventions that people seem to agree on. Not surprisingly, some of them are based on the mean and some are based on the median!

A commonly agreed-upon way to think of outliers based on the mean is to consider how many standard deviations away from the mean a term is. Some texts identify a potential outlier as a datapoint that is more than two or three standard deviations from the mean.

In a mound-shaped, symmetric, distribution, this is a value that has only about a 5% chance (for two standard deviations) or a 0.3% chance (for three standard deviations) of being as far removed from the center of the distribution as it is. Think of it as a value that is way out in one of the tails of the distribution.

Most texts now use a median-based measure and identify potential outliers in terms of how far a datapoint is above or below the quartiles in a distribution. To find if a distribution has any outliers, do the following (this is known as the “1.5 (IQR) rule”):

  • Find the IQR.
  • Multiply the IQR by 1.5.
  • Find Q1 – 1.5(IQR) and Q3 + 1.5(IQR).
  • Any value below Q1 – 1.5(IQR) or above Q3 + 1.5(IQR) is a potentialoutlier .

Some texts call an outlier defined as above a mild outlier. An extreme outlier would then be one that lies more than 3 IQRs beyond Q1 or Q3.

example: The following data represent the amount of money, in British pounds, spent weekly on tobacco for 11 regions in Britain: 4.03, 3.76, 3.77, 3.34, 3.47, 2.92, 3.20, 2.71, 3.53, 4.51, 4.56. Do any of the regions seem to be spending a lot more or less than the other regions? That is, are there any outliers in the data?

solution: Using a calculator, we find , Sx = s = .59, Q1 = 3.2, Q3 = 4.03.

  • Using means: 3.62 ± 2(0.59) = (2.44, 4.8). There are no values in the dataset less than 2.44 or greater than 4.8, so there are no outliers by this method. We don”t need to check ± 3ssince there were no outliers using ± 2s .
  • Using the 1.5(IQR) rule: Q1 – 1.5(IQR) = 3.2 – 1.5(4.03 – 3.2) = 1.96, Q3 + 1.5(IQR) = 4.03 + 1.5(4.03 – 3.2) = 5.28. Because there are no values in the data less than 1.96 or greater than 5.28, there are no outliers by this method either.

Outliers are important because they will often tell us that something unusual or unexpected is going on with the data that we need to know about. A manufacturing process that produces products so far out of spec that they are outliers often indicates that something is wrong with the process. Sometimes outliers are just a natural, but rare, variation. Often, however, an outlier can indicate that the process generating the data is out of control in some fashion.

Position of a Term in a Distribution

Up until now, we have concentrated on the nature of a distribution as a whole. We have been concerned with the shape, center, and spread of the entire distribution. Now we look briefly at individual cases in the distribution.

Five-Number Summary

There are positions in a dataset that give us valuable information about the dataset. The five-number summary of a dataset is composed of the minimum value, the lower quartile, the median, the upper quartile, and the maximum value.

On the TI-83/84, these are reported on the second screen of data when you do 1-Var Stats as: minX , Q1 , Med , Q3 , and maxX .

example: The following data are standard of living indices for 20 cities: 2.8, 3.9, 4.6, 5.3, 10.2, 9.8, 7.7, 13, 2.1, 0.3, 9.8, 5.3, 9.8, 2.7, 3.9, 7.7, 7.6, 10.1, 8.4, 8.3. Find the 5-number summary for the data.

solution: Put the 20 values into a list on your calculator and do 1-Var Stats . We find: minX=0.3, Q1=3.9, Med=7.65, Q3=9.8 , and maxX=13 .

Boxplots (Outliers Revisited)

In the first part of this chapter, we discussed three types of graphs: dotplot, stemplot, and histogram. Using the five-number summary, we can add a fourth type of one-variable graph to this group: the boxplot . A boxplot is simply a graphical version of the five-number summary. A box is drawn that contains the middle 50% of the data (from Q1 to Q3) and “whiskers” extend from the lines at the ends of the box (the lower and upper quartiles) to the minimum and maximum values of the data if there are no outliers. If there are outliers, the “whiskers” extend to the last value before the outlier that is not an outlier. The outliers themselves are marked with a special symbol, such as a point, a box, or a plus sign.

The boxplot is sometimes referred to as a box and whisker plot.

example: Consider again the data from the previous example: 2.8, 3.9, 4.6, 5.3, 10.2, 9.8, 7.7, 13, 2.1, 0.3, 9.8, 5.3, 9.8, 2.7, 3.9, 7.7, 7.6, 10.1, 8.4, 8.3. A boxplot of this data, done on the TI-83/84, looks like this (the five-number summary was [0.3, 3.9, 7.65, 9.8, 13]):

Calculator Tip: To get the graph above on your calculator, go to the STAT PLOTS menu, open one of the plots, say Plot1 , and you will see a screen something like this:

Note that there are two boxplots available. The one that is highlighted is the one that will show outliers. The calculator determines outliers by the 1.5(IQR) rule. Note that the data are in L2 for this example. Once this screen is set up correctly, press ZOOM → 9:STAT to display the boxplot.

example: Using the same dataset as the previous example, but replacing the 10.2 with 20, which would be an outlier in this dataset (the largest possible non-outlier for these data would be 9.8 + 1.5(9.8 – 3.9) = 18.65), we get the following graph on the calculator:

Note that the “whisker” ends at the largest value in the dataset that is not an outlier, 13.

Percentile Rank of a Term

The percentile rank of a term in a distribution equals the proportion of terms in the distribution less than the term. A term that is at the 75th percentile is larger than 75% of the terms in a distribution. If we know the five-number summary for a set of data, then Q1 is at the 25th percentile, the median is at the 50th percentile, and Q3 is at the 75th percentile. Some texts define the percentile rank of a term to be the proportion of terms less than or equal to the term. By this definition, being at the 100th percentile is possible.

z-Scores

One way to identify the position of a term in a distribution is to note how many standard deviations the term is above or below the mean. The statistic that does this is the z-score:

The z -score is positive when x is above the mean and negative when it is below the mean.

example: z 3 = 1.5 tells us that the value 3 is 1.5 standard deviations above the mean. z 3 = –2 tells us that the value 3 is two standard deviations below the mean.

example: For the first test of the year, Harvey got a 68. The class average (mean) was 73, and the standard deviation was 3. What was Harvey”s z -score on this test?

solution:

Thus, Harvey was 1.67 standard deviations below the mean.

Suppose we have a set of data with mean and standard deviation s . If we subtract from every term in the distribution, it can be shown that the new distribution will have a mean of . If we divide every term by s , then the new distribution will have a standard deviation of s/s = 1. Conclusion: If you compute the z -score for every term in a distribution, the distribution of z scores will have a mean of 0 and a standard deviation of 1.

Calculator Tip: We have used 1-Var Stats a number of times so far. Each of the statistics generated by that command is stored as a variable in the VARS menu. To find, say, , after having done 1-Var Stats on L1 , press VARS and scroll down to STATISTICS . Once you press ENTER to access the STATISTICS menu, you will see several lists. is in the XY column (as is Sx ). Scroll through the other menus to see what they contain. (The EQ and TEST menus contain saved variables from procedures studied later in the course.)

To demonstrate the truth of the assertion about a distribution of z -scores in the previous paragraph, do 1-Var Stats on, say, data in L1 . Then move the cursor to the top of L2 and enter (L1 – )/Sx , getting the and Sx from the VARS menu. This will give you the z -score for each value in L1 . Now do 1-Var Stats L2 to verify that = 0 and Sx = 1. (Well, for , you might get something like 5.127273E–14. That”s the calculator”s quaint way of saying 5.127 × 10–14 , which is 0.00000000000005127. That”s basically 0.)

You need to be aware that only the most recently calculated set of statistics will be displayed in the VARS Statistics menu—it changes each time you perform an operation on a set of data.

Normal Distribution

We have been discussing characteristics of distributions (shape, center, spread) and of the individual terms (percentiles, z -scores) that make up those distributions. Certain distributions have particular interest for us in statistics, in particular those that are known to be symmetric and mound shaped. The following histogram represents the heights of 100 males whose average height is 70″ and whose standard deviation is 3″.

This is clearly approximately symmetric and mound shaped. We are going to model this with a curve that idealizes what we see in this sample of 100. That is, we will model this with a continuous curve that “describes” the shape of the distribution for very large samples. That curve is the graph of the normal distribution . A normal curve , when superimposed on the above histogram, looks like this:

The function that yields the normal curve is defined completely in terms of its mean and standard deviation. Although you are not required to know it, you might be interested to know that the function that defines the normal curve is:

One consequence of this definition is that the total area under the curve, and above the x-axis, is 1 (for you calculus students, this is because .

This fact will be of great use to us later when we consider areas under the normal curve as probabilities.

68-95-99.7 Rule

The 68-95-99.7 rule, or the empirical rule, states that approximately 68% of the terms in a normal distribution are within one standard deviation of the mean, 95% are within two standard deviations of the mean, and 99.7% are within three standard deviations of the mean. The following three graphs illustrate the 68-95-99.7 rule.

Standard Normal Distribution

Because we are dealing with a theoretical distribution, we will use μ and σ , rather than and s , when referring to the normal curve. If X is a variable that has a normal distribution with mean μ and standard deviation s (we say “Xhas N (μ ,s )”), there is a related distribution we obtain by standardizing the data in the distribution to produce the standard normal distribution . To do this, we convert the data to a set of z -scores, using the formula

The algebraic effect of this, as we saw earlier, is to produce a distribution of z -scores with mean 0 and standard deviation 1. Computing z -scores is just a linear transformation of the original data, which means that the transformed data will have the same shape as the original distribution. In this case then, the distribution of z -scores is normal. We say z has N (0,1). This simplifies the defining density function to

For the standardized normal curve, the 68-95-99.7 rule says that approximately 68% of the terms lie between z = 1 and z = –1, 95% between z = –2 and z = 2, and 99.7% between z = –3 and z = 3. (Trivia for calculus students: one standard deviation from the mean is a point of inflection .)

Because many naturally occurring distributions are approximately normal (heights, SAT scores, for example), we are often interested in knowing what proportion of terms lie in a given interval under the normal curve. Problems of this sort can be solved either by use of a calculator or a table of Standard Normal Probabilities (Table A in the appendix to this book). In a typical table, the marginal entries are z -scores, and the table entries are the areas under the curve to the left of a given z -score. All statistics texts have such tables.

example: What proportion of the area under a normal curve lies to the left of z = –1.37?

solution: There are two ways to do this problem, and you should be able to do it either way.

(i) The first way is to use the table of Standard Normal Probabilities. To read the table, move down the left column (titled “z ”) until you come to the row whose entry is –1.3. The third digit, the 0.07 part, is found by reading across the top row until you come to the column whose entry is 0.07. The entry at the intersection of the row containing –1.3 and the column containing 0.07 is the area under the curve to the left of z = –1.37. That value is 0.0853.

(ii) The second way is to use your calculator. It is the more accurate and more efficient way. In the DISTR menu, the second entry is normalcdf (see the next Calculator Tip for a full explanation of the normalpdf and normalcdffunctions). The calculator syntax for a standard normal distribution is normalcdf (lower bound, upper bound) . In this example, the lower bound can be any large negative number, say –100.normalcdf(-100,-1.37)= 0.0853435081 .

Calculator Tip: Part (ii) of the previous example explained how to use normalcdf for standard normal probabilities. If you are given a nonstandard normal distribution, the full syntax is normalcdf(lower bound, upper bound, mean,standard deviation) . If only the first two parameters are given, the calculator assumes a standard normal distribution (that is, μ = 0 and σ = 1). You will note that there is also a normalpdf function, but it really doesn”t do much for you in this course. Normalpdf(X) returns the y -value on the normal curve. You are almost always interested in the area under the curve and between two points on the number line, so normalcdf is what you will use a lot in this course.

For older versions of the TI-8s/84 calculators, it can be difficult to remember the parameters that go with the various functions on your calculator—knowing, for example, that for normalcdf , you put lower bound, upper bound, mean, standard deviation in the parentheses. The APP “CtlgHelp” can remember for you. It comes on the older versions of the TI-83/84 and is activated by choosing CtlgHelp from the APPS menu and pressing ENTER twice. Newer versions have menus with prompts, making this APP unnecessary.

To use CtlgHelp , move the cursor to the desired function on the DISTR menu and press +. The function syntax will be displayed. Then press ENTER to use the function on the home screen. Note that, at this writing, CtlgHelpdoes not work for invT or χ2 gof. The following is a screen capture of using CtlgHelp that displays the parameters for normalcdf:

example: What proportion of the area under a normal curve lies between z = –1.2 and z = 0.58?

solution: (i) Reading from Table A, the area to the left of z = –1.2 is 0.1151, and the area to the left of z = 0.58 is 0.7190. The geometry of the situation (see below) tells us that the area between the two values is 0.7190 – 0.1151 = 0.6039.

(ii) Using the calculator, we have normalcdf(-1.2, 0.58) = 0.603973005. Round to 0.6040 (difference from the answer in part (i) caused by rounding).

example: In an earlier example, we saw that heights of men are approximately normally distributed with a mean of 70 and a standard deviation of 3. What proportion of men are more than 6′ (72″¢) tall? Be sure to include a sketch of the situation.

solution: (i) Another way to state this is to ask what proportion of terms in a normal distribution with mean 70 and standard deviation 3 are greater than 72. In order to use the table of Standard Normal Probabilities, we must first convert to z -scores. The z -score corresponding to a height of 72″ is

The area to the left of z = 0.67 is 0.7486. However, we want the area to the right of 0.67, and that is 1 – 0.7486 = 0.2514.

(ii) Using the calculator, we have normal-cdf(0.67,100) = 0.2514. We could get the answer from the raw data as follows: normalcdf(72,1000,70,3) = 0.2525 , with the difference being due to rounding. (As explained in the last Calculator Tip, simply add the mean and standard deviation of a nonstandard normal curve to the list of parameters for normalcdf .)

example: For the population of men in the previous example, how tall must a man be to be in the top 10% of all men in terms of height?

solution: This type of problem has a standard approach. The idea is to express z x in two different ways (which are, of course, equal since they are different ways of writing the z -score for the same point): (i) as a numerical value obtained from Table A or from your calculator, and (ii) in terms of the definition of a z -score.

(i) We are looking for the value of x in the drawing. Look at Table A to find the nearest table entry equal to 0.90 (because we know an area, we need to read the table from the inside out to the margins). It is 0.8997 and corresponds to a z-score of 1.28.

So zx = 1.28. But also,

So

A man would have to be at least 73.84′′ tall to be in the top 10% of all men.

(ii) Using the calculator, the z -score corresponding to an area of 90% to the left of x is given by invNorm(0.90) = 1.28. Otherwise, the solution is the same as is given in part (i). See the following Calculator Tip for a full explanation of the invNorm function.

Calculator Tip: invNorm essentially reverses normalcdf . That is, rather than reading from the margins in, it reads from the table out (as in the example above). invNorm(A) returns the z -score that corresponds to an area equal to Alying to the left of z . invNorm(A, μ,σ ) returns the value of x that has area A to the left of x if x has N (μ,σ ).

Chebyshev”s Rule (Optional–not part of the AP Curriculum)

The 68-95-99.7 rule works fine as long as the distribution is approximately normal. But what do you do if the shape of the distribution is unknown or distinctly nonnormal (as, say, skewed strongly to the right)? Remember that the 68-95-99.7 rule told you that, in a normal distribution, approximately 68% of the data are within one standard deviation of the mean, approximately 95% are within two standard deviations, and approximately 99.7% are within three standard deviations. Chebyshev”s rule isn”t as strong as the empirical rule, but it does provide information about the percent of terms contained in an interval about the mean for any distribution.

Let k be a number of standard deviations. Then, according to Chebyshev”s rule, for k > 1, at least of the data lie within k standard deviations of the mean. For example, if k = 2.5, then Chebyshev”s rule says that at least of the data lie with 2.5 standard deviations of the mean. If k = 3, note the difference between the 68-95-99.7 rule and Chebyshev”s rule. The 68-95-99.7 rule says that approximately 99.7% of the data are within three standard deviations of . Chebyshev”s says that at least of the data are within three standard deviations of . This also illustrates what was said in the previous paragraph about the empirical rule being stronger than Chebyshev”s. Note that, if at least of the data are within k standard deviations of , it follows (algebraically) that at most lie more than k standard deviations from .

Knowledge of Chebyshev”s rule is not required in the AP Exam, but its use is certainly okay and is common enough that it will be recognized by AP readers.

Rapid Review

  1. Describe theshape of the histogram below:

Answer: Bimodal, somewhat skewed to the left.

  1. For the graph of problem #1, would you expect the mean to be larger than the median or the median to be larger than the mean? Why?

Answer: It is difficult to predict. The general guideline that the mean is lower than the median for distributions that are skewed left applies to smooth unimodal distributions. Bimodal distributions don”t necessarily follow that pattern.

  1. The first quartile (Q1) of a dataset is 12 and the third quartile (Q3) is 18. What is the largest value above Q3 in the dataset that would not be a potential outlier?

Answer: Outliers lie more than 1.5 IQRs below Q1 or above Q3. Q3 + 1.5(IQR) = 18 + 1.5(18 – 12) = 27. Any value greater than 27 would be an outlier. 27 is the largest value that would not be a potential outlier.

  1. A distribution of quiz scores has = 35 and s = 4. Sara got 40. What was her z -score? What information does that give you if the distribution is approximately normal?

Answer :

This means that Sara”s score was 1.25 standard deviations above the mean, which puts it at the 89.4th percentile (normalcdf(-100,1.25)).

  1. In a normal distribution with mean 25 and standard deviation 7, what proportion of terms are less than 20?

Answer : Area = 0.2389.

(By calculator: normalcdf(-100, 20, 25, 7)=0.2375. )

  1. What are the mean, median, mode, and standard deviation of astandard normal curve?

Answer: Mean = median = mode = 0. Standard deviation = 1.

  1. Find the five-number summary and draw the modified box plot for the following set of data: 12, 13, 13, 14, 16, 17, 20, 28.

Answer: The five-number summary is [12, 13, 15, 18.5, 28]. 28 is an outlier (anything larger than 18.5 + 1.5(18.5 – 13) = 26.75 is an outlier by the 1.5(IQR) rule). Since 20 is the largest nonoutlier in the dataset, it is the end of the upper whisker, as shown in the following diagram:

  1. A distribution is strongly skewed to the right. Would you prefer to use the mean and standard deviation, or the median and interquartile range, to describe the center and spread of the distribution?

Answer: Because the mean is not resistant and is pulled toward the tail of the skewed distribution, you would prefer to use the median and IQR.

  1. A distribution is strongly skewed to the left (like a set of scores on an easy quiz) with a mean of 48 and a standard deviation of 6. What can you say about the proportion of scores that are between 40 and 56?

Answer : Since the distribution is skewed to the left, we can”t say much. We could get a rough estimate if we use Chebyshev”s rule. We note that the interval given is the same distance (8) above and below = 48. Solving 48 + k(6) = 56 gives k = 1.33. Hence, there are at least of the scores between 40 and 56.

Practice Problems

Multiple-Choice

  1. The following list is ordered from smallest to largest: 25, 26, 26, 30,y, y, y , 33, 150. Which of the following statements is (are) true?
  2. The mean is greater than the median
  3. The mode is 26

III. There are no outliers in the data

  1. I only
  2. I and II only
  3. III only
  4. I and III only
  5. II and III only
  6. Jenny is 5′10″ tall and is wondering about her height. The heights of girls in the school are approximately normally distributed with a mean of 5′5″ and a standard deviation of 2.6″. What is the percentile rank of Jenny”s height?
  7. 59
  8. 65
  9. 74
  10. 92
  11. 97
  12. The mean and standard deviation of a normally distributed dataset are 19 and 4, respectively. 19 is subtracted from every term in the dataset and then the result is divided by 4. Which of the following best describes the resulting distribution?
  13. It has a mean of 0 and a standard deviation of 1.
  14. It has a mean of 0, a standard deviation of 4, and its shape is normal.
  15. It has a mean of 1 and a standard deviation of 0.
  16. It has a mean of 0, a standard deviation of 1, and its shape is normal.
  17. It has a mean of 0, a standard deviation of 4, and its shape is unknown.
  18. The five-number summary for a one-variable dataset is {5, 18, 20, 40, 75}. If you wanted to construct a modified boxplot for the dataset (that is, one that would show outliers if there are any), what would be the maximum possible length of the right side “whisker”?
  19. 35
  20. 33
  21. 5
  22. 55
  23. 53
  24. A set of 5,000 scores on a college readiness exam are known to be approximately normally distributed with mean 72 and standard deviation 6. To the nearest integer value, how many scores are there between 63 and 75?
  25. 0.6247
  26. 4,115
  27. 3,650
  28. 3,123
  29. 3,227
  30. For the data given in #5 above, suppose you were not told that the scores were approximately normally distributed. What can be said about the number of scores that are less than 58 (to the nearest integer)?
  31. There are at least 919 scores less than 58.
  32. There are at most 919 scores less than 58.
  33. There are approximately 919 scores less than 58.
  34. There are at most 459 scores less than 58.
  35. There are at least 459 scores less than 58.
  36. The following histogram pictures the number of students who visited the Career Center each week during the school year.

The shape of this graph could best be described as

  1. Mound-shaped and symmetric
  2. Bimodal
  3. Skewed to the left
  4. Uniform
  5. Skewed to the right
  6. Which of the following statements is (are) true?
  7. The median is resistant to extreme values.
  8. The mean is resistant to extreme values.

III. The standard deviation is resistant to extreme values.

  1. I only
  2. II only
  3. III only
  4. II and III only
  5. I and III only
  6. One of the values in a normal distribution is 43 and its z -score is 1.65. If the mean of the distribution is 40, what is the standard deviation of the distribution?
  7. 3
  8. –1.82
  9. 0.55
  10. 1.82
  11. –0.55
  12. Free-response questions on the AP Statistics Exam are graded on 4, 3, 2, 1, or 0 basis. Question #2 on the exam was of moderate difficulty. The average score on question #2 was 2.05 with a standard deviation of 1. To the nearest tenth, what score was achieved by a student who was at the 90th percentile of all students on the test? You may assume that the scores on the question were approximately normally distributed.
  13. 3.5
  14. 3.3
  15. 2.9
  16. 3.7
  17. 3.1

Free-Response

  1. Mickey Mantle played with the New York Yankees from 1951 through 1968. He had the following number of home runs for those years: 13, 23, 21, 27, 37, 52, 34, 42, 31, 40, 54, 30, 15, 35, 19, 23, 22, 18. Were any of these years outliers? Explain.
  2. Which of the following are properties of the normal distribution? Explain your answers.
  3. It has a mean of 0 and a standard deviation of 1.
  4. Its mean = median = mode.
  5. All terms in the distribution lie within four standard deviations of the mean.
  6. It is bell-shaped.
  7. The total area under the curve and above the horizontal axis is 1.
  8. Make a stemplot for the number of home runs hit by Mickey Mantle during his career (from question #1, the numbers are: 13, 23, 21, 27, 37, 52, 34, 42, 31, 40, 54, 30, 15, 35, 19, 23, 22, 18). Do it first using an increment of 10, then do it again using an increment of 5. What can you see in the second graph that was not obvious in the first?
  9. A group of 15 students were identified as needing supplemental help in basic arithmetic skills. Two of the students were put through a pilot program and achieved scores of 84 and 89 on a test of basic skills after the program was finished. The other 13 students received scores of 66, 82, 76, 79, 72, 98, 75, 80, 76, 55, 77, 68, and 69. Find the z -scores for the students in the pilot program and comment on the success of the program.
  10. For the 15 students whose scores were given in question #4, find the five-number summary and construct a boxplot of the data. What are the distinguishing features of the graph?
  11. Assuming that the batting averages in major league baseball over the years have been approximately normally distributed with a mean of 0.265 and a standard deviation of 0.032, what would be the percentile rank of a player who bats 0.370 (as Barry Bonds did in the 2002 season)?
  12. In problem #1, we considered the home runs hit by Mickey Mantle during his career. The following is a stemplot of the number of doubles hit by Mantle during his career. What is the interquartile range (IQR) of this data? (Hint: n = 18.)

Note: The column of numbers to the left of the stemplot gives the cumulative frequencies from each end of the stemplot (e.g., there are 5 values, reading from the top, when you finish the second row). The (5) identifies the location of the row that contains the median of the distribution. It is standard for computer packages to draw stemplots in this manner.

  1. For the histogram pictured below, what proportion of the terms are less than 3.5?
  2. The following graph shows boxplots for the number of career home runs for Hank Aaron and Barry Bonds. Comment on the graphs. Which player would you rather have on your team most seasons? A season in which you needed a lot of home runs?
  3. Suppose that being in the top 20% of people with high blood cholesterol level is considered dangerous. Assume that cholesterol levels are approximately normally distributed with mean 185 and standard deviation 25. What is the maximum cholesterol level you can have and not be in the top 20%?
  4. The following are the salaries, in millions of dollars, for members of the 2001–2002 Golden State Warriors: 6.2, 5.4, 5.4, 4.9, 4.4, 4.4, 3.4, 3.3, 3.0, 2.4, 2.3, 1.3, .3, .3. Which gives a better “picture” of these salaries, mean-based or median-based statistics? Explain.
  5. The following table gives the results of an experiment in which the ages of 525 pennies from current change were recorded. “0” represents the current year, “1” represents pennies one year old, etc.

Describe the distribution of ages of pennies (remember that the instruction “describe” means to discuss center, spread, and shape). Justify your answer.

  1. A wealthy woman is trying to decide whether or not to buy a coin collection that contains 1450 coins. She will buy the collection only if at least 225 of the coins are worth more than $170. The present owner of the collection reports that the average coin in the collection is worth $130 with a standard deviation of $15. Should the woman buy the collection?
  2. The mean of a set of 150 values is 35, its median is 33, its standard deviation is 6, and its IQR is 12. A new set is created by first subtracting 10 from every term and then multiplying by 5. What are the mean, median, variance, standard deviation, and IQR of the new set?
  3. The following graph shows the distribution of the heights of 300 women whose average height is 65″ and whose standard deviation is 2.5″. Assume that the heights of women are approximately normally distributed. How many of the women would you expect to be less than 5′2″ tall?
  4. Which of the following are properties of the standard deviation ? Explain your answer.
  5. It”s the square root of the average squared deviation from the mean.
  6. It”s resistant to extreme values.
  7. It”s independent of the number of terms in the distribution.
  8. If you added 25 to every value in the dataset, the standard deviation wouldn”t change.
  9. The interval ± 2s contains 50% of the data in the distribution.
  10. Look again at the salaries of the Golden State Warriors in problem 11 (in millions, 6.2, 5.4, 5.4, 4.9, 4.4, 4.4, 3.4, 3.3, 3.0, 2.4, 2.3, 1.3, .3, .3). Erick Dampier was the highest paid player at $6.2 million. What sort of raise would he need so that his salary would be an outlier among these salaries?
  11. Given the histogram below, draw, as best you can, the boxplot for the same data.
  12. On the first test of the semester, the class average was 72 with a standard deviation of 6. On the second test, the class average was 65 with a standard deviation of 8. Nathan scored 80 on the first test and 76 on the second. Compared to the rest of the class, on which test did Nathan do better?
  13. What is the mean of a set of data where Σ = 20, Σx = 245, and Σ(x – )2 = 13,600?

Cumulative Review Problems

  1. Which of the following are examples of quantitative data?
  2. The number of years each of your teachers has taught
  3. Classifying a statistic as quantitative or qualitative
  4. The length of time spent by the typical teenager watching television in a month
  5. The daily amount of money lost by the airlines in the 15 months after the 9/11 attacks
  6. The colors of the rainbow
  7. Which of the following are discrete and which are continuous ?
  8. The weights of a sample of dieters from a weight-loss program
  9. The SAT scores for students who have taken the test over the past 10 years
  10. The AP Statistics exam scores for the almost 50,000 students who took the exam in 2009
  11. The number of square miles in each of the 20 largest states
  12. The distance between any two points on the number line
  13. Just exactly what is statistics and what are its two main divisions?
  14. What are the main differences between the goals of a survey and an experiment ?
  15. Why do we need to understand the concept of a random variable in order to do inferential statistics?

Solutions to Practice Problems

Multiple-Choice

  1. The correct answer is (a). I is correct since the mean is pulled in the direction of the large maximum value, 150 (well, large compared to the rest of the numbers in the set). II is not correct because the mode is y —there are three y s and only two 26s. III is not correct because 150 is an outlier (you can”t actually compute the upper boundary for an outlier since the third quartile is y , but even if you use a larger value, 33, in place of y , 150 is still an outlier).
  2. The correct answer is (e).

(On the TI-83/84, normalcdf ( –100,1.92) = normalcdf( –1000,70,65,206)=0.9726 up to rounding error.)

  1. The correct answer is (d). The effect on the mean of a dataset of subtracting the same value is to reduce the old mean by that amount (that is, μ x k = μ x k ). Because the original mean was 19, and 19 has been subtracted from every term, the new mean is 0. The effect on the standard deviation of a dataset of dividing each term by the same value is to divide the standard deviation by that value, that is,

Because the old standard deviation was 4, dividing every term by 4 yields a new standard deviation of 1. Note that the process of subtracting the mean from each term and dividing by the standard deviation creates a set of z-scores

so that any complete set of z -scores has a mean of 0 and a standard deviation of 1. The shape is normal since any linear transformation of a normal distribution will still be normal.

  1. The correct answer is (b). The maximum length of a “whisker” in a modified boxplot is 1.5(IQR) = 1.5(40 – 18) = 33.
  2. The correct (best) answer is (d). Using Table A, the area under a normal curve between 63 and 75 is 0.6247 (z 63 = –1.5 ⇒ A1 = 0.0668, z 75 = 0.5 ⇒ A2 = 0.6915 ⇒ A2 – A1 = 0.6247). Then (0.6247)(5,000) = 3123.5. Using the TI-83/84, normalcdf(63,75,72,6) × 5000 = 3123.3 .
  3. The correct answer is (b). Since we do not know that the 68-95-99.7 rule applies, we must use Chebyshev”s rule.

Since 72 – k (6) = 58, we find k = 2.333. Hence, there are at most of the scores less than 58. Since there are 5000 scores, there are at most (0.1837)(5,000) = 919 scores less than 58. Note that it is unlikely that there are this many scores below 58 (since some of the 919 scores could be more than 2.333 standard deviation above the mean)—it”s just the strongest statement we can make.

  1. The correct answer is (e). The graph is clearly not symmetric, bimodal, or uniform. It is skewed to the right since that”s the direction of the “tail” of the graph.
  2. The correct answer is (a). The median is resistant to extreme values, and the mean is not (that is, extreme values will exert a strong influence on the numerical value of the mean but not on the median). II and III involve statistics equal to or dependent upon the mean, so neither of them is resistant.
  3. The correct answer is (d).
  4. The correct answer is (b). A score at the 90th percentile has a z -score of 1.28. Thus,

Free-Response

  1. Using the calculator, we find that = 29.78, s = 11.94, Q1 = 21, Q3 = 37. Using the 1.5(IQR) rule, outliers are values that are less than 21 – 1.5(37 – 21) = –3 or greater than 37 + 1.5(37 – 21) = 61. Because no values lie outside of those boundaries, there are no outliers by this rule.

Using the ± 2s rule, we have ± 2s = 29.78 ± 2(11.94) = (5.9, 53.66). By this standard, the year he hit 54 home runs would be considered an outlier.

  1. (a) is a property of the standard normal distribution , not a property of normal distributions in general. (b) is a property of the normal distribution. (c) is not a property of the normal distribution—almost all of the terms are within four standard deviations of the mean but, at least in theory, there are terms at any given distance from the mean. (d) is a property of the normal distribution—the normal curve is the perfect bell-shaped curve. (e) is a property of the normal distribution and is the property that makes this curve useful as a probability density curve.

What shows up when done by 5 rather than 10 is the gap between 42 and 52. In 16 out of 18 years, Mantle hit 42 or fewer home runs. He hit more than 50 only twice.

  1. = 76.4 and s = 10.17.

Using the Standard Normal Probability table, a score of 84 corresponds to the 77.34th percentile, and a score of 89 corresponds to the 89.25th percentile. Both students were in the top quartile of scores after the program and performed better than all but one of the other students. We don”t know that there is a cause-and-effect relationship between the pilot program and the high scores (that would require comparisons with a pretest), but it”s reasonable to assume that the program had a positive impact. You might wonder how the student who got the 98 did so well!

The most distinguishing feature is that the range (43) is quite large compared to the middle 50% of the scores (13). That is, we can see from the graph that the scores are packed somewhat closely about the median. The shape of a histogram of the data would be symmetric and mound shaped.

  1. Area to the left of 3.28 is 0.9995.

That is, Bond”s average in 2002 would have placed him in the 99.95th percentile of batters.

  1. There are 18 values in the stemplot. The median is 17 (actually between the last two 7s in the row marked by the (5) in the count column of the plot —it”s still 17). Because there are 9 values in each half of the stemplot, the median of the lower half of the data, Q1, is the 5th score from the top. So, Q1 = 14. Q3 = the 5th score counting from the bottom = 24. Thus, IQR = 24 – 14 = 10.
  2. There are 3 values in the first bar, 6 in the second, 2 in the third, 9 in the fourth, and 5 in the fifth for a total of 25 values in the dataset. Of these, 3 + 6 + 2 = 11 are less than 3.5. There are 25 terms altogether, so the proportion of terms less than 3.5 is 11/25 = 0.44.
  3. With the exception of the one outlier for Bonds, the most obvious thing about these two is just how similar the two are. The medians of the two are almost identical and the IQRs are very similar. The data do not show it, but with the exception of 2001, the year Bonds hit 73 home runs, neither batter ever hit 50 or more home runs in a season. So, for any given season, you should be overjoyed to have either on your team, but there is no good reason to choose one over the other. However, if you based your decision on who had the most home runs in a single season, you would certainly choose Bonds.
  4. Let x be the value in question. Because we do not want to be in the top 20%, the area to the left of x is 0.8. Hence z x = 0.84 (found by locating the nearest table entry to 0.8, which is 0.7995 and reading the corresponding z -score as 0.84). Then

[Using the calculator, the solution to this problem is given by invNorm (0.8,185,25). ]

  1. = $3.36 million, s = $1.88 million, Med = $3.35 million, IQR = $2.6 million. A boxplot of the data looks like this:

The fact that the mean and median are virtually the same, and that the boxplot shows that the data are more or less symmetric, indicates that either set of measures would be appropriate.

  1. The easiest way to do this is to use the calculator. Put the age data in L1 and the frequencies in L2 . Then do 1-Var Stats L1,L2 (the calculator will read the second list as frequencies for the first list).
  • The mean is 2.48 years, and the median is 2 years. This indicates that the mean is being pulled to the right—and that the distribution is skewed to the right or has outliers in the direction of the larger values.
  • The standard deviation is 2.61 years. Because one standard deviation to left would yield a negative value, this also indicates that the distribution extends further to the right than the left.
  • A histogram of the data, drawn on the TI–83/84, is shown below. This definitely indicates that the ages of these pennies is skewed to the right.
  1. Since we don”t know the shape of the distribution of coin values, we must use Chebyshev”s rule to help us solve this problem. Let k = the number of standard deviations that 170 is above the mean. Then 130 + k ·(15) = 170. So, k ≈ 2.67. Thus, at most , or 14%, of the coins are valued at more than $170. Her requirement was that , or 15.5%, of the coins must be valued at more than $170. Since at most 14% can be valued that highly, she should not buy the collection.
  2. The new mean is 5(35 – 10) = 125.

The new median is 5(33 – 10) = 115.

The new variance is 52 (62 ) = 900.

The new standard deviation is 5(6) = 30.

The new IQR is 5(12) = 60.

  1. First we need to find the proportion of women who would be less than 62′′ tall:

So 0.1151 of the terms in the distribution would be less than 62′′. This means that 0.1151(300) = 34.53, so you would expect that 34 or 35 of the women would be less than 62′′ tall.

  1. a, c, and d are properties of the standard deviation. (a) serves as a definition of the standard deviation. It is independent of the number of terms in the distribution in the sense that simply adding more terms will not necessarily increase or decrease s. (d) is another way of saying that the standard deviation is independent of the mean—it”s a measure of spread, not a measure of center.

The standard deviation is not resistant to extreme values (b) because it is based on the mean, not the median. (e) is a statement about the interquartile range. In general, unless we know something about the curve, we don”t know what proportion of terms are within 2 standard deviations of the mean.

  1. For these data, Q1 = $2.3 million, Q3 = $4.9 million. To be an outlier, Erick would need to make at least 4.9 + 1.5(4.9 – 2.3) = 8.8 million. In other words, he would need a $2.6 million dollar raise in order to have his salary be an outlier.
  2. You need to estimate the median and the quartiles. Note that the histogram is skewed to the left, so that the scores tend to pack to the right. This means that the median is to the right of center and that the boxplot would have a long whisker to the left. The boxplot looks like this:
  3. If you standardize both scores, you can compare them on the same scale. Accordingly,

Nathan did slightly, but only slightly, better on the second test.

Solutions to Cumulative Review Problems

  1. a, c, and d are quantitative.
  2. a, d, and e are continuous; b and c are discrete. Note that (d) could be considered discrete if what we meant by “number of square miles” was the integer number of square miles.
  3. Statistics is the science of data. Its two main divisions are data analysis and inference . Data analysis (EDA) utilizes graphical and analytical methods to try to see what the data “say.” That is, EDA looks at data in a variety of ways in order to understand them. Inference involves using information from samples to make statements or predictions about the population from which the sample was drawn.
  4. A survey, based on a sample from some population, is usually given in order to be able to make statements or predictions about the population. An experiment, on the other hand, usually has as its goal studying the differential effects of some treatment on two or more samples, which are often composed of volunteers.
  5. Statistical inference is based on being able to determine the probability of getting a particular sample statistic from a population with a hypothesized parameter. For example, we might ask how likely it is to get 55 heads on 100 flips of a fair coin. If it seems unlikely, we might reject the notion that the coin we actually flipped is fair. The probabilistic underpinnings of inference can be understood through the language of random variables. In other words, we need random variables to bridge the gap between simple data analysis and inference.

CHAPTER 6

One-Variable Data Analysis

  1. The stemplot shows the 2012 median household income for all 50 U.S. states to the nearest thousand dollars. Which statement below is NOT true about the distribution of median incomes?

(A) The majority of states had a median income of at least $50,000.

(B) The median of the median household incomes is less than the mean of the median household incomes.

(C) The distribution is skewed toward the smaller values.

(D) The 20th percentile is a median income of about $44,000.

(E) Three states had a median household income of about $64,000.

  1. Eggs are sorted into sizes (peewee, small, medium, large, extra large, and jumbo) by their mass. In Canada, eggs between 56 and 62 grams are classified as large. Hillside Farm produces eggs whose masses are approximately normally distributed with a mean mass of 59 grams and a standard deviation of 5 grams. The eggs from Blue Sky Farm are also approximately normally distributed, with a mean of 59 grams. A larger proportion of eggs from Blue Sky Farm are classified as large than are those from Hillside Farm. Which statement is true about the distributions of the masses of eggs from the two farms?

(A) The masses of eggs from Blue Sky Farm have a standard deviation that is more than 5 grams.

(B) The masses of eggs from Blue Sky Farm have a standard deviation that is less than 5 grams.

(C) The masses of eggs from Blue Sky Farm have a standard deviation that is about 5 grams.

(D) The masses of eggs from Blue Sky Farm weigh more, on average, than those from Hillside Farm.

(E) There is not enough information to compare the distributions of the masses of eggs from the two farms.

  1. The following histogram shows the poverty rates by household for the 50 states and the District of Columbia.

Which boxplot shows the same distribution?

(A)

(B)

(C)

(D)

(E)

  1. The heights of women ages 30 to 39 years are approximately normally distributed, with a mean of 66 inches and a standard deviation of 2.6 inches. If a particular woman is 61.25 inches tall, her z -score is about

(A) –4.75

(B) –1.83

(C) 0.027

(D) 1.83

(E) 4.75

  1. States provide data on average ACT scores by school. One state reported the average scores for 450 schools. The average scores were approximately normally distributed, with a mean of 21.9 and a standard deviation of 2.5. The shortest interval of average scores that contains 90 schools is

(A) [20.58, 23.22]

(B) [18.69, 21.90]

(C) [21.90, 25.11]

(D) [21.27, 22.53]

(E) [18.69, 25.11]

Answers

  1. C
  2. B
  3. C
  4. B
  5. D