MCAT Physics and Math Review
Chapter 12: Data-Based and Statistical Reasoning
Academic papers are extremely predictable. They generally begin with an abstract that reflects the major points of the rest of the paper. The authors then provide an expanded introduction, materials and methods, data, and discussion. The key to a high-quality research paper is making this discussion unnecessary—any scientists, when given the prior sections, should be led to the same conclusions as those given by the author. The testmakers are keenly aware of this fact. On Test Day, you may be presented with research in the form of an experiment-based passage and part of your task will be inferring the important conclusions that can be supported by the findings of the study.
This chapter covers the last of the Strategic Inquiry and Reasoning Skills tested on the MCAT: the statistical analysis of raw data, interpretation of visual representations of this data, and application of data to answer research questions. We’ll begin by examining basic statistical principles like distribution types, measures of central tendency, and measures of distribution. We’ll also discuss probability, and the semantics of this branch of mathematics. We’ll conclude our discussion of probability and statistics with an exploration of statistical significance in basic hypothesis testing and confidence intervals. Then, we’ll move on to the interpretation of charts and graphs. Finally, we’ll link all of this information with the skills we gained in the last chapter and assess the future use and validity of studies.
12.1 Measures of Central Tendency
Measures of central tendency are those that describe the middle of a sample. How we define middle can vary. Is it the mathematical average of the numbers in the data set? Is it the result in a data set that divides the set into two—with half the sample values above this result and half the sample values below? Both of these data can be important, and the difference between them can also provide useful information on the shape of a distribution.
The mean or average of a set of data (more accurately, the arithmetic mean) is calculated by adding up all of the individual values within the data set and dividing the result by the number of values:
where xi to xn are the values of all of the data points in the set and n is the number of data points in the set. As we discussed in the last chapter, the mean may be a parameter or a statistic (as is true of all of the measures of central tendency) depending on whether we are discussing a population or a sample. Mean values are a good indicator of central tendency when all of the values tend to be fairly close to one another. Having an outlier—an extremely large or extremely small value compared to the other data values—can shift the mean toward one end of the range. For example, the average income in the United States is about $70,000, but half of the population makes less than $50,000. In this case, the small number of extremely high-income individuals in the distribution shifts the mean to the high end of the range.
The following data were collected on the ages of attendees at Ray’s birthday party:
23, 22, 25, 22, 22, 24, 36, 20
What is the mean age of the attendees? Is this an appropriate measure for this data?
The mean is the sum of the data points divided by the number of data points:
Because the mean is relatively near most of the values collected for this data set, it may be appropriate. Keep in mind, though, that the presence of an outlier and the fact that the mean is greater than all but two of the values collected indicates that the mean has been shifted toward the high end of the range. The presence of a single outlier does not invalidate the mean, but it does make interpretation in context necessary.
The median value for a set of data is its midpoint, where half of data points are greater than the value and half are smaller. In data sets with an odd number of values, the median will actually be one of the data points. In data sets with an even number of values, the median will be the mean of the two central data points. To calculate the median, a data set must first be listed in increasing fashion. The position of the median can be calculated as follows:
where n is the number of data values. In a data set with an even number of data points, this equation will solve for a noninteger number; for example, in a data set with 18 points, it will be The median in this case will be the arithmetic mean of the ninth and tenth items in the data set when sorted in ascending order.
The median tends to be the least susceptible to outliers, but may not be useful for data sets with very large ranges (the distance between the largest and smallest data point, as discussed later in this chapter) or multiple modes.
Using the same data from the last question, find the median age of the attendees. Comparing this value to the mean, is the median a better or worse indicator of central tendency in this sample?
The first step in finding the median is to order the data from smallest to largest. Our original data was:
23, 22, 25, 22, 22, 24, 36, 20
Reordered, this becomes:
20, 22, 22, 22, 23, 24, 25, 36
n, the number of data points, is 8, so the median will be the average of the fourth and fifth data points. The median is therefore The median is a better indicator of central tendency for this data than the mean of 24.25. The median is unaffected by the outlier and lies close to most of the values in the data set. One could improve the representativeness of the mean by excluding 36 from the data set, in which case the mean would be 22.57 while the median would be 22.
If the mean and the median are far from each other, this implies the presence of outliers or a skewed distribution, as discussed later in this chapter. If the mean and median are very close, this implies a symmetrical distribution.
The median divides the data set into two groups with 50% of values higher than the median and 50% of values lower than it.
The mode, quite simply, is the number that appears the most often in a set of data. There may be multiple modes in a data set, or—if all numbers appear equally—there can even be no mode for a data set. When we examine distributions, the peaks represent modes. The mode is not typically used as a measure of central tendency for a set of data, but the number of modes, and their distance from one another, is often informative. If a data set has two modes with a small number of values between them, it may be useful to analyze these portions separately or to look for confounding variables that may be responsible for dividing the distribution into two parts.
MCAT Concept Check 12.1:
Before you move on, assess your understanding of the material with these questions.
1. What types of data sets are best analyzed using the mean as a measure of central tendency?
2. Calculate the mean, median, and mode of the following data set:
18, 23, 23, 6, 9, 21, 4, 4, 2
3. True or False: The mean of a sample is considered a parameter.