## 5 Steps to a 5 AP Statistics 2017 (2016)

### STEP __4__

### Review the Knowledge You Need to Score High

__CHAPTER__ **5** __ Overview of Statistics/Basic Vocabulary__

__CHAPTER__ **6** __ One-Variable Data Analysis__

__CHAPTER__ **7** __ Two-Variable Data Analysis__

__CHAPTER__ **8** __ Design of a Study: Sampling, Surveys, and Experiments__

__CHAPTER__ **9** __ Probability and Random Variables__

__CHAPTER__ **10** __ Binomial Distributions, Geometric Distributions, and Sampling Distributions__

__CHAPTER__ **11** __ Confidence Intervals and Introduction to Inference__

__CHAPTER__ **12** __ Inference for Means and Proportions__

__CHAPTER__ **13** __ Inference for Regression__

__CHAPTER__ **14** __ Inference for Categorical Data: Chi-Square__

### CHAPTER 5

### Overview of Statistics/Basic Vocabulary

**IN THIS CHAPTER**

**Summary:** Statistics is the science of data analysis. This involves activities such as the collection of data, the organization and analysis of data, and drawing inferences from data. *Statistical methods* and *statistical thinking* can be thought of as using common sense and statistical tools to analyze and draw conclusions from data.

Statistics has been developing as a field of study since the sixteenth century. Historical figures, some of whom you may have heard of, such as Isaac Newton, Abraham De Moivre, Carl Friedrich Gauss, Adolphe Quetelet, Florence Nightingale (yes, Florence Nightingale!), Sir Francis Galton, Karl Pearson, Sir Ronald Fisher, and John Tukey have been major contributors to what we know today as the science of statistics.

Statistics is one of the most practical subjects studied in school. A mathematics teacher may have some trouble justifying the everyday use of algebra for the average citizen, but no statistics teacher ever has that problem. We are bombarded constantly with statistical arguments in the media and, through real-life examples, we can develop the skills to become intelligent consumers of numerical-based knowledge claims.

**Key Ideas**

The Meaning of Statistics

Quantitative versus Qualitative Data

Descriptive versus Inferential Statistics

Collecting Data

Experiments versus Observational Studies

Random Variables

**Quantitative versus Qualitative Data**

**Quantitative data** or **numerical data** are data measured or identified on a numerical scale. **Qualitative data** or **categorical data** are data that can be classified into a group.

**Examples of Quantitative (Numerical) Data:** The heights of students in an AP Statistics class; the number of freckles on the face of a redhead; the average speed on a busy expressway; the scores on a final exam; the concentration of DDT in a creek; the daily temperatures in Death Valley; the number of people jailed for marijuana possession each year

**Examples of Qualitative (Categorical) Data:** Gender; political party preference; eye color; ethnicity; level of education; socioeconomic level; birth order of a person (first-born, second-born, etc.)

There are times that the distinction between quantitative and qualitative data is somewhat less clear than in the examples above. For example, we could view the variable “family size” as a categorical variable if we were labeling a person based on the size of his or her family. That is, a woman would go in category “TWO” if she was married but there were no children. Another woman would be in category “FOUR” if she was married and had two children. On the other hand, “family size” would be a quantitative variable if we were observing families and recording the number of people in each family (2, 4, …). In situations like this, the context will make it clear whether we are dealing with quantitative or qualitative data.

**Discrete and Continuous Data**

Quantitative data can be either **discrete** or **continuous. Discrete data** are data that can be listed or placed in order. **Continuous data** can be measured, or take on values in an interval. The number of heads we get on 20 flips of a coin is discrete; the time of day is continuous. We will see more about discrete and continuous data later on.

**Descriptive versus Inferential Statistics**

Statistics has two primary functions: to *describe* data and to make *inferences* from data. **Descriptive statistics** is often referred to as **exploratory data analysis (EDA)** . The components of EDA are *analytical* and *graphical* . When we have collected some one-variable data, we can examine these data in a variety of ways: look at measures of center for the distribution (such as the mean and median); look at measures of spread (variance, standard deviation, range, interquartile range); graph the data to identify features such as shape and whether or not there are clusters or gaps (using dotplots, boxplots, histograms, and stemplots).

With two-variable data, we look for relationships between variables and ask questions like: “Are these variables related to each other, and, if so, what is the nature of that relationship?” Here we consider such analytical ideas as correlation and regression, and graphical techniques such as scatterplots. __Chapters 6__ and __7__ of this book are primarily concerned with exploratory data analysis.

Procedures for collecting data are discussed in __Chapter 8__ . __Chapters 9__ and __10__ are concerned with the probabilistic underpinnings of inference.

**Inferential statistics** involves using data from samples to make *inferences* about the population from which the sample was drawn. If we are interested in the average height of students at a local community college, we could select a random sample of the students and measure their heights. Then we could use the average height of the students in our sample to *estimate* the true average height of the population from which the sample was drawn. In the real world, we often are interested in some characteristic of a population (e.g., what percentage of the voting public favors the outlawing of handguns?), but it is often too difficult or too expensive to do a census of the entire population. The common technique is to select a *random sample* from the population and, based on an analysis of the data, make *inferences* about the population from which the sample was drawn. __Chapters 11__ –__14__ of this book are primarily concerned with inferential statistics.

**Parameters versus Statistics**

Values that describe a sample are called **statistics** , and values that describe a population are called **parameters** . In *inferential statistics* , we use *statistics* to estimate *parameters* . For example, if we draw a sample of 35 students from a large university and compute their mean GPA (that is, the grade point average, usually on a 4-point scale, for each student), we have a *statistic* . If we could compute the mean GPA for *all* students in the university, we would have a *parameter* .

**Collecting Data: Surveys, Experiments, Observational Studies**

In the preceding section, we discussed data analysis and inferential statistics. A question not considered in many introductory statistics courses (but considered in detail in AP Statistics) is how the data are collected. Oftentimes we are interested in collecting data in order to make generalizations about a population. One way to do this is to conduct a **survey** . In a well-designed survey, you take a random sample of the population of interest, compute statistics of interest (like the proportion of baseball fans in the sample who think Pete Rose should be in the Hall of Fame), and use those to make predictions about the population.

We are often more interested in seeing the reactions of persons or things to certain stimuli. If so, we are likely to conduct an **experiment** or an **observational study** . We discuss the differences between these two types of studies in __Chapter 8__ , but both basically involve collecting comparative data on groups (called **treatment** and **control** ) constructed in such a way that the only difference between the groups (we hope) is the focus of the study. Because experiments and observational studies are usually done on volunteers, rather than on random samples from some population of interest (it”s been said that most experiments are done on graduate students in psychology), the results of such studies may lack generalizability to larger populations. Our ability to generalize involves the degree to which we are convinced that the *only* difference between our groups is the variable we are studying (otherwise some other variable could be producing the responses).

It is *extremely* important to understand that data must be gathered correctly in order to have analysis and inference be meaningful. You can do all the number crunching you want with bad data, but the results will be meaningless.

In 1936, the magazine *The Literary Digest* did a survey of some 10 million people in an effort to predict the winner of the presidential election that year. They predicted that Alf Landon would defeat Franklin Roosevelt by a landslide, but the election turned out just the opposite. The *Digest* had correctly predicted the outcome of the preceding five presidential elections using similar procedures, so this was definitely unexpected. Its problem was not in the size of the sample it based its conclusions on. Its problem was in the way it collected its data—the *Digest* simply failed to gather a random sample of the population. It turns out that its sampling frame (the population from which it drew its sample) was composed of a majority of Republicans. The data were extensive (some 2.4 million ballots were returned), but they weren”t representative of the voting population. In part because of the fallout from this fiasco, the *Digest* went bankrupt and out of business the following year. If you are wondering why the *Digest* was wrong this time with essentially the same techniques used in earlier years, understand that 1936 was the heart of the Depression. In earlier years the lists used to select the sample may have been more reflective of the voting public, but in 1936 only the well-to-do, Republicans generally, were in the *Digest* ”s sample taken from its own subscriber lists, telephone books, etc.

We look more carefully at sources of bias in data collection in __Chapter 8__ , but the point you need to remember as you progress through the next couple of chapters is that conclusions based on data are only meaningful to the extent that the data are representative of the population being studied.

In an experiment or an observational study, the analogous issue to a biased sample in a survey is the danger of treatment and control groups being somehow systematically different. For example, suppose we wish to study the effects of exercise on stress reduction. We let 100 volunteers for the study *decide* if they want to be in the group that exercises or in the group that doesn”t. There are many reasons why one group might be systematically different from the other, but the point is that any comparisons between these two groups is confounded by the fact that the two groups could be different in substantive ways.

**Random Variables**

We consider random variables in detail in __Chapter 9__ , but it is important at the beginning to understand the role they play in statistics. A **random variable** can be thought of as a numerical outcome of a random phenomenon or an experiment. As an example of a *discrete* random variable, we can toss three fair coins, and let *X* be the count of heads; we then note that *X* can take on the values 0, 1, 2, or 3. An example of a *continuous* random variable might be the number of centimeters a child grows from age 5 to age 6.

An understanding of random variables is what will allow us to use our knowledge of probability (__Chapter 9__ ) in statistical inference. Random variables give rise to **probability distributions** (a way of matching outcomes with their probabilities of success), which in turn give rise to our ability to make probabilistic statements about **sampling distributions** (distributions of sample statistics such as means and proportions). This language, in turn, allows us to talk about the probability of a given sample being as different from expected as it is. This is the basis for inference. All of this will be examined in detail later in this book, but it”s important to remember that random variables are the foundation for inferential statistics.

There are a number of definitions in this chapter and many more throughout the book (summarized in the Glossary). Although you may not be asked specific definitions on the AP Exam, you are expected to have the working vocabulary needed to understand any statistical situation you might be presented with. In other words, you need to know and understand the vocabulary presented in the course in order to do your best on the AP Exam.

** Rapid Review**

- True or False: A study is done in which the data collected are the number of cars a person has owned in his or her lifetime. This is an example of
*qualitative*data.

*Answer:* False. The data are measured on a numerical, not categorical, scale.

- True or False: The data in the study of question number 1 are
*discrete*.

*Answer:* True. The data are countable (e.g., Leroy has owned 8 cars).

- What are the names given to values that describe
*samples*and values that describe*populations*?

*Answer:* Values that describe samples are called *statistics* , and values that describe populations are called *parameters* . (A mnemonic for remembering this is that both *statistic* and *sample* begin with “s,” and both *parameter* and *population* begin with “p.”)

- What is a
*random variable*?

*Answer:* A numerical outcome of an experiment or random phenomenon.

- Why do we need to
*sample*?

*Answer:* Because it is usually too difficult or too expensive to observe every member of the population. Our purpose is to make inferences about the unknown, and probably unknowable, parameters of a population.

- Why do we need to take care with data collection?

*Answer:* In order to avoid bias. It makes no sense to try to predict the outcome of the presidential election if we survey *only* Republicans.

- Which of the following are examples of
*qualitative*data?

(a) The airline on which a person chooses to book a flight

(b) The average number of women in chapters of the Gamma Goo sorority

(c) The race (African American, Asian, Hispanic, Pacific Islander, White) of survey respondents

(d) The closing Dow Jones average on 50 consecutive market days

(e) The number of people earning each possible grade on a statistics test

(f) The scores on a given examination

*Answer:* a, c, and e are qualitative. f could be either depending on how the data are reported: qualitative if letter grades were given, quantitative if number scores were given.

- Which of the following are
*discrete*and which are*continuous*?

(a) The number of jelly beans in a jar

(b) The ages of a group of students

(c) The humidity in Atlanta

(d) The number of ways to select a committee of three from a group of ten

(e) The number of people who watched the Super Bowl in 2002

(f) The lengths of fish caught on a sport fishing trip

*Answer:* Discrete: a, d, e

Continuous: b, c, f

Note that (b) could be considered discrete if by age we mean the integer part of the age—a person is considered to be 17, for example, regardless of where that person is after his/her 17th birthday and before his/her 18th birthday.

- Which of the following are
*statistics*and which are*parameters?*

(a) The proportion of all voters who will vote Democrat in the next election

(b) The proportion of voters in a Gallup Poll who say they will vote Democrat in the next election

(c) The mean score of the home team in all NFL games in 2005

(d) The proportion of Asian students among all students attending high school in the state of California

(e) The mean difference score between a randomly selected class taught statistics by a new method and another class taught by an old method

(f) The speed of a car

*Answer:* Statistics: b, e, f

Parameters: a, c, d

### CHAPTER 5

### Overview

- Which of the following would not be considered a quantitative variable?

(A) Heights of students in inches

(B) Intensity of sounds measured in decibels

(C) Numbers of students in various classrooms during the first class period

(D) Zip codes for various locations in your state

(E) Proportions of students in each class that have blue eyes

- Which of the following is a parameter?

(A) The proportion of voters surveyed who favor Candidate A in the upcoming election

(B) The mean amount of television watched per day by all U.S. teens

(C) The proportion of subjects in a medical study that were cured in the group that received the medication

(D) The difference in the proportion of subjects cured who received the medication and the proportion of subjects cured who received the placebo

(E) The number of people in a sample who say they always wear their seatbelt when in a car

- Of the following, which is NOT a random variable?

(A) The average number of children per family in the United States

(B) The number of children in a randomly selected U.S. household

(C) The speed of the next car to drive past your house

(D) The average gas mileage of the next car to drive past your house

(E) The sum of the numbers showing on the next roll of two dice

- A cable news show asks viewers to call in to indicate whether they agree with a particular policy. The proportion of Americans who agree will be reported at the end of the show. Which of these is NOT true about this situation?

(A) The fact that viewers of the cable news show may have similar opinions introduces bias into the estimate of the proportion of Americans who agree.

(B) The fact that those with strong opinions are more likely to call in introduces bias into the estimate of the proportion of Americans who agree.

(C) The sample is unlikely to be representative of the American population.

(D) A random sample would be more likely to give a representative sample.

(E) If enough people call in, the large sample size will compensate for any other biases in the sampling procedure.

- Which of these would be a sampling technique for a well-designed survey of students in a high school?

(A) Number each student on a list of all students” names. Use a random number generator on a calculator to randomly select 50 different numbers. The corresponding students will be in the sample.

(B) Put up posters in the hallways and on classroom doors instructing students to text their responses to a particular number. Those who respond will be in the sample.

(C) Have student council members ask the students in their third-hour class. The students in those third-hour classes will be in the sample.

(D) Put a flyer on each student”s locker with the survey questions on it. Those who return their responses to the office will be in the sample.

(E) Put a flyer with the survey questions on it next to the line in the cafeteria. Students can fill them out and drop them in a box in the cafeteria. Those who return the surveys will be in the sample.

** Answers**

**D****B****A****E****A**