Data Collection - Inferences and Conclusions from Data - High School Algebra II Unlocked (2016)

High School Algebra II Unlocked (2016)

Chapter 8. Inferences and Conclusions from Data

GOALS

By the end of this chapter, you will be able to:


•Understand the difference between sample surveys, experiments, and observational studies, and identify methods that provide representative sample populations for each of these

•Find the standard deviation or estimated standard deviation, as appropriate, for a data set, and understand the relationship between standard deviations, mean, and the normal distribution curve

•Use proportional reasoning, given ratios in a large representative sample population, to make inferences about the target population

•Understand the difference between discrete uniform, continuous uniform, normal, skewed, and bimodal probability distributions

•Use area under a continuous probability distribution curve, either uniform or normal, to calculate probabilities, when appropriate to the situation

•Use simulations to assess how well experimental results match a given model and to determine when results are statistically significant

•Calculate the confidence interval and margin of error for a given sample proportion at various confidence levels and for various sample sizes

•Use statistics and probability concepts to interpret and evaluate given statements and to analyze options in real-world situations

Lesson 8.1. Data Collection

REVIEW

An experiment is a repeated test under controlled conditions, with a known set of possible outcomes, to determine the effect of a treatment or action, in terms of probability.

Experimental probability is probability based on the results of an experiment, or empirical results. Experimental probability is the ratio of the actual number of favorable outcomes to the total number of outcomes in the experiment.

Probability expresses the likelihood of an event, as a ratio of the number of favorable outcomes to the total number of outcomes.

Theoretical probability is an established, generally accepted probability based in theory. Theoretical probability is the ratio of the number of favorable outcomes to the total number of possible outcomes in the sample space.

Statistics is the collection, analysis, and interpretation of numerical data, typically for the purpose of making inferences and conclusions about a larger population through use of a sample subset of that population. It would be very time-consuming, if not impossible, to count every individual soccer player in the United States. However, using statistics, you can estimate the total number of Americans who play soccer based on a smaller sample of the general population.

Probability is an important concept behind statistical analysis, because the true parameters of the larger population (the target population) are typically unknown. We cannot make absolute statements about those parameters, but we can make conclusions in terms of how likely they are. (A parameter is some numerical characteristic of the population, such as its mean or range.)

The selection of the sample population for a study is very important. To obtain the most accurate results in a study of some treatment, or factor that we want to study in terms of its effect, we must attempt to equalize other factors as much as possible. For a fair test of the treatment, we must choose a representative sample of the target population, to minimize introduction of other factors that might influence results. The only way to extract a representative sample is through a random sampling method, in which each member of the target population has as equal a chance of being chosen as possible.

In Heather’s family, one person is chosen each night to take out the trash. What are some examples of fair selection processes, in which each member of the family has an equal probability of being chosen?

Each member of the family could write his or her name on a piece of paper and put the paper into a hat. Someone could then draw a name from the hat, without looking.

Each person could draw a straw from a group of straws (as many straws as there are members of the family), one of which is shorter than the others (but with the bases hidden from view so that no one can tell which is shorter). The person with the short straw takes out the trash.

Each person could write down a number between 1 and 100, and the person whose number first appears in a list of randomly generated numbers between 1 and 100 has to take out the trash. One way to randomly choose a number between 1 and 100, or a series of such numbers, is by using a random number generator (generally a computer program).

An option of choosing
the person whose
number is closest to
a single randomly
generated number
would not be fair,
because certain
numbers, such as
1, 2, 99, and 100
are generally less
likely to be closest
to the randomly
generated number.

Such methods work for small populations, in everyday choices for fairness. For very large populations, you must use other methods, but the goal is still to make probabilities of being chosen as similar as possible across the entire population, for a random sample. If a sample is not chosen randomly, the method of choosing members of the sample introduces bias. Bias is the tendency of a measurement process to favor one group or outcome over others, producing a lopsided view of the population as a whole.

Three ways of obtaining a sample are sample surveys, experiments, and observational studies.

SAMPLE SURVEYS

A sample survey is a survey of members of a smaller sample group from the target population, with the goal of learning something about the target population. For example, you might poll randomly collected samples of Americans to ask if they play soccer, in the hopes of extrapolating to find the total number of soccer players in the country.

The sampling method should be random, so that the sample subset is representative of the target population. If a survey to determine American opinions on the importance of education is limited to a sample group of only people who have college degrees, then the results of the survey will be biased; they will not reflect the opinions of all Americans. Even when the specific correlation is not obvious, a certain subset of the population, sharing some characteristic, may also tend to share other characteristics, affecting how they will respond to the given survey.

Bias in the sample
selection can also be
viewed as a method of
exclusion. Surveying
only those Americans
who have college
degrees excludes
all of the Americans
who do not have
college degrees. This
means the sample is
not representative
of the entire target
population.

Principal Ruiz wants to use student opinion to choose one of two options to become the school’s new mascot. Which of the following sample methods will give him the most accurate reflection of overall student opinion on the issue?

A) Poll the 80 students that belong to sports teams at the school.

B) Poll the 150 students in the freshman class.

C) Poll the first 30 students who arrive at school the next morning.

D) Poll every fifth student from an alphabetized list of all 640 students at the school.

A group of just the sports team members or just the freshmen does not accurately represent the entire student body. Students within one of those groups may vote differently, on the whole, from other students in the school, for one reason or another.

The first 30 students who arrive at school the next morning provide more of a random sample, as do the group of students made up of every fifth student on an alphabetized list of all students. To compare these two options, look at the number of people being surveyed. Every fifth student from a list of 640 students means 1/5 of 640, or 128. This is a much larger sample size than 30, so it is much more likely to provide an accurate representation of the opinions of the full student body. Also, the students arriving earliest at the school may share certain characteristics and provide less of a representative sample. Principal Ruiz should poll every fifth student from an alphabetized list of all students at the school.

Even though the
150 students in
the freshman class
constitute a larger
sample population,
they are not
representative of the
entire student body
and may skew the
data. The 128 students
chosen randomly
are a better sample
population for the
survey, especially
considering that the
sample size is not
substantially different.

The most important factors determining the accuracy of a sample survey, in terms of sampling methods, are the randomization of the selection process and the sample size. A sample should ideally be both randomly selected and large.

There are other ways in which a survey or its sample selection may be biased. If a survey is open to anyone who wants to take it, then there is a voluntary bias, because the types of people who choose to take the survey may share similar characteristics and not represent the opinions of the entire population. The ways in which survey questions are worded can also skew the data, by encouraging certain answers.

Anna wants to determine how town residents feel about a recent proposal to replace Maplewood Park with a mall. In her survey, she included the following question: “Maplewood Park has been a valued part of our town for the past 175 years. Would you rather preserve this public space or allow the proposed Maplewood Mall to take it over?” How is this question biased? How might Anna reword the question so that it is unbiased?

This question shows the park in a positive light, as both a “valued part of our town for the past 175 years” and as a “public space” to be “preserve(d).” In this way, Anna is encouraging respondents to choose the preservation of the park over the building of the mall.

An unbiased version of the question should present both options on an equal level, without endorsing, directly or indirectly, either option. Anna might simply rewrite the question as “Would you prefer to keep Maplewood Park or replace it with Maplewood Mall?”

Here is how you may see sampling errors on the SAT.

Freewheeling Four is a company that designed a new wheelchair. They want to determine whether this wheelchair will be popular among senior citizens, so they decided to do product testing at five different community centers around the country. They invited all interested community members to attend a demonstration of the wheelchair in the backyard of each community center. Through surveys, Freewheeling Four found that 58% of the senior citizens who attended the demonstrations were interested in purchasing the wheelchair. Based on this finding, would Freewheeling Four be justified in claiming that the majority of senior citizens would likely purchase the new wheelchair?

A) Yes; Freewheeling Four’s study used a representative sample of the target population, indicating that the conclusion is accurate.

B) Yes; the majority of senior citizens who attended the demonstrations indicated that they would likely purchase the new wheelchair.

C) No; the senior citizens were just one part of the group of all community members who attended the demonstrations, so they do not accurately represent the target population.

D) No; while 58% of the senior citizens who attended the demonstrations indicated that they would likely purchase the new wheelchair, the sample failed to accurately represent the target population.

EXPERIMENTS

Another form of data collection is through experiments. If Michael tosses a coin repeatedly, recording each result as heads or tails, this is an experiment that can be used to determine the probability that a coin tossed in the air will land heads up. The sample size, in this case, is the number of times Michael tosses the coin. This kind of experiment is by its nature random, because the person running the experiment has no way of affecting the outcomes of the tosses. Because this experiment is already random, the sample size becomes very important for determining the accuracy of the results.

In his first experiment, Michael tossed a penny 5 times, with the result {T, T, T, H, T}, where H represents heads and T represents tails. In his second experiment, he tossed a penny 20 times, with the results {H, T, T, T, T, H, H, T, H, T, H, H, H, T, T, H, H, H, H, H}. What was the experimental probability of tails in the first experiment? What was the experimental probability of tails in the second experiment? If Michael combines all the data from both series of coin tosses, as one complete data set, what is the probability of tails in the complete experiment?

In the first experiment, the penny landed on tails 4 out of the 5 times it was tossed, so the probability of tails was 4/5, or 0.8. In the second experiment, the penny landed on tails 8 out of the 20 times it was tossed, so the probability of tails was 8/20, or 0.4. If we combine all the coin toss data, the penny landed on tails a total of 4 + 8 = 12 times, out of a total of 5 + 20 = 25 tosses. For Michael’s complete experiment, the probability of the penny landing on tails was 12/25, or 0.48.

Notice that the experimental results get closer to the theoretical probability of a penny landing on tails, 1/2 or 0.5, as the sample size increases. For a very small sample size, it is actually unlikely that a coin will land on tails exactly half the time. On the other hand, for a very large sample size, the portion of the time that the penny lands on tails will be very close to 0.5.

It’s also very unlikely
that a coin will land on
tails exactly half the
time in a very large
number of tosses.
However, the ratio
of outcomes of tails
to total number of
outcomes becomes
very close to 0.5.

The average of the results of performing the same experiment a large number of times will be close to the expected value, or theoretical probability. And, it will continue, overall, to become closer to that expected value as more trials are performed.

Experiments are not always related to processes with known (theoretical) probabilities. When a researcher tests a new medication on a sample group of people, the method is again called an experiment. As with sample surveys, it is important to randomly select the sample group from the target population. Also, for a controlled experiment, such as testing a new medication, there must also be a control group, which does not receive the treatment being tested, so that the researcher can compare results. The effects of the treatment are only measurable in reference to a baseline of how the subjects would fare without the treatment.

Freewheeling Four visited only five locations nationwide, which is a very small sample size for the entire country. The locations visited were community centers, but not all senior citizens visit community centers. (It’s possible, for example, that senior citizens with more money are more likely to purchase the new wheelchair, while being less likely to visit community centers.) Finally, by having their demonstrations in the backyard, with only those curious to watch it attending, they introduced a voluntary bias.

So, even though the majority of senior citizens who attended the demonstrations would be interested in purchasing the new wheelchair, the sample does not accurately represent the target population of all senior citizens in the United States. The correct answer is (D).

Neal tested his vitamin formula on a randomly selected group of 50 subjects who had colds. The cold symptoms went away within a week for 34 of the members of this sample population. Neal advertises his formula by saying, “68% of people with colds who take my formula recover within a week.” If a group of 60 different subjects with colds, also randomly selected from the same population but not given Neal’s formula, included 39 people whose symptoms went away within a week, how might this affect someone’s view of Neal’s results?

Neal’s statement may be true, because 68% of his sample population recovered within a week after taking his vitamin formula. However, in the second group, 39/50 = 65% of subjects with colds who did not take Neal’s formula also recovered within a week. These percentages are not very far apart, so there doesn’t appear to be much of an impact of the vitamin formula. Without a point of reference for comparison, someone might be impressed by Neal’s statement. With the control results, he or she would likely find little if any correlation between consumption of the vitamin formula and recovery rate.

Notice that in the case of Neal’s experiment, the target population is people with colds, so any sample, whether for the test group or for the control group, must come from the population of people with colds. It would not make sense to give healthy test subjects the vitamin formula, nor to include already healthy people in the control group. In a study of how a formula affects cold recovery rates, the only meaningful comparison is of people with colds who take the formula and people with colds who do not take the formula, with all other characteristics or factors equalized as much as possible between the two groups (achieved largely through random sample selection).

OBSERVATIONAL STUDIES

An observational study is similar to an experiment in that the researcher is studying the effect of some variable or treatment on some other variable, but is different in that it only involves observations. While an experiment is a method of applying a treatment (or treatments) to one or more groups and comparing the results with a control group, an observational study is a method of collecting and interpreting observations, without interfering with the subjects or variables in any way. The researcher is still responsible for randomly selecting a representative sample population from the target population, but the separation of subjects into a treatment group and a control group is beyond his or her control.

Lucille hypothesizes that a certain pesticide has negative impacts on the health of those who come into regular contact with it. To test this theory, she uses an observational study with a sample group of fruit pickers who regularly handle fruit at farms using this pesticide, as well as a control group of people who do not come in direct contact with the pesticide. What are the advantages of using an observational study instead of a controlled experiment in this case? How might using this observational study create bias that skews the results?

In a controlled experiment, Lucille would “treat” certain randomly selected subjects with the pesticide. Considering that there are potential health risks associated with exposure to the pesticide, it is unethical to expose people to the pesticide who would not otherwise be exposed. By using an observational study, Lucille can compare the health of people who are in regular contact with the pesticide to the health of those who are not, without subjecting anyone who is not normally in contact with it to the pesticide.

Lucille has no control over the composition of the population of fruit pickers at these farms, so she cannot ensure that they are a representative sample of the target population, or that they are similar in composition to the randomly selected control group. The fruit pickers at the farms using the pesticide may share certain characteristics, such as ethnicity or economic level, that can make it harder to draw a direct correlation between health and pesticide exposure, because there may be other variables or factors influencing the results.