MCAT Physics and Math Review

Chapter 12: Data-Based and Statistical Reasoning

Conclusion

Congratulations on completing MCAT Physics and Math Review! While it has been a challenging journey, you are now equipped with all of the physics content knowledge and Strategic Inquiry and Reasoning Skills (SIRS) you need to perform well on Test Day. We completed our discussion of the MCAT SIRS by covering the transformation of raw data to actionable information. When taking the MCAT, these concepts may present themselves as the opportunity to use statistical methods and interpretation to draw conclusions, as well as the analysis of figures used as adjuncts to passages and discrete questions. We also briefly reviewed the connections between the real world and research by determining when and how our newfound data can be applied. Ultimately, this will be your role as a physician: constructing a foundation of content knowledge, seeking out new research, and drawing conclusions from that research to improve your patients’ lives and well-being. Good luck as you continue preparing for your MCAT—and your future as an excellent physician.

Concept Summary

Measures of Central Tendency

·        Measurements of central tendency provide a single value representation for the middle of a group of data.

·        The arithmetic mean or average is a measure of central tendency that equally weighs all values; it is most affected by outliers.

·        The median is the value that lies in the middle of the data set. Fifty percent of data points are above and below the median.

·        The mode is the data point that appears most often; there may be multiple (or zero) modes in a data set.

Distributions

·        Distributions have characteristic features that are exemplified by their shape. Distributions can be classified by measures of central tendency and measures of distribution.

·        The normal distribution is symmetrical. The mean, median, and mode are all the same in the normal distribution.

o   The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one; it is used for most calculations.

o   68% of data points occur within one standard deviation of the mean, 95% within two, and 99% within three.

·        Skewed distributions have differences in their mean, median, and mode; the skew direction is the direction of the tail of the distribution.

·        Bimodal distributions have multiple peaks, although not necessarily multiple modes, strictly speaking. It may be useful to perform data analysis on the two groups separately.

Measures of Distribution

·        Range is the difference between the largest and smallest values in a data set.

·        Interquartile range is the difference between the value of the third quartile and first quartile; interquartile range can be used to determine outliers.

·        Standard deviation is a measurement of variability about the mean; standard deviation can also be used to determine outliers.

·        Outliers may be a result of true population variability, measurement error, or a non-normal distribution.

·        Procedures for handling outliers should be formulated before the beginning of a study.

Probability

·        The probability of independent events does not change based on the outcomes of other events.

·        The probability of a dependent event changes depending on the outcomes of other events.

·        Mutually exclusive outcomes cannot occur simultaneously.

Statistical Testing

·        Hypothesis tests use a known distribution to determine whether a hypothesis of no difference (the null hypothesis) can be rejected.

·        Whether or not a finding is statistically significant is determined by the comparison of a p-value to the selected significance level (α). A significance level of 0.05 is commonly used.

·        Confidence intervals are a range of values about a sample mean that are used to estimate the population mean. A wider interval is associated with a higher confidence level (95% is common).

Charts, Graphs, and Tables

·        Pie charts (circle charts) and bar charts are both used to compare categorical data.

·        Histograms and box plots (box-and-whisker plots) are both used to compare numerical data.

·        Maps are used to compare up to two demographic indicators.

·        Linearsemilog, and log–log plots can be distinguished by their axes.

·        Slope can be calculated from linear plots.

·        Tables may contain related or unrelated categorical data.

Applying Data

·        Correlation and causation are separate concepts that are linked by Hill’s criteria.

·        Data must be interpreted in the context of the current hypothesis and existing scientific knowledge.

·        Statistical and practical significance are distinct.

Answers to Concept Checks

·        12.1

1.    The mean is the best measure of central tendency for a data set with a relatively normal distribution. The mean performs poorly in data sets with outliers.

2.     

§  Mean:

§  Median: The fifth position of 2, 4, 4, 6, 9, 18, 21, 23, 23 is 9

§  Mode: There are two numbers that each appear twice: 4 and 23. These are both modes of this data set.

3.    False. The mean of a sample is a statistic; the mean of a population is a parameter.

·        12.2

1.    The mean of a right (positively) skewed distribution is to the right of the median, which is to the right of the mode.

2.    Any distribution can be mathematically or procedurally transformed to follow a normal distribution by virtue of the central limit theorem, which is beyond the scope of the MCAT. Regardless, a distribution that is not normal may still be analyzed with these measures.

3.    Bimodal distributions have two peaks whereas normal or skewed distributions have only one.

·        12.3

1.    Outliers can be defined as data points more than 1.5 × IQR below Q1 or above Q3. The can also be defined as data points more than 3σ above or below the mean. The cutoff values calculated through the two methods are likely to be different, and the selection of one method over the other is one of preference and study design. In general, the use of the standard deviation method is superior.

2.    Where the data are not available, the range can be approximated as four times the standard deviation. For this data set, the relationship fails. The range is 9, which is only a little more than twice the standard deviation. This is because the data set does not fall in a normal distribution.

3.    The average distance from the mean will always be zero. This is why, in calculations of standard deviation, we always square the distance from the mean and then take the square root at the end—it forces all of the values to be positive numbers, which will not cancel out to zero.

·        12.4

1.    Simplify this question by rewording it as the probability of not having all girls. Having at least one boy and having all girls are mutually exclusive events, and no other possibilities can occur. Thus, the probability of having all girls is (0.5)10 and the probability of having at least one boy is 1 – (0.5)10, or 99.90%.

2.    Independence is a condition of events wherein the outcome of one event has no effect on the outcome of the other. Mutual exclusivity is a condition wherein two outcomes cannot occur simultaneously.

·        12.5

1.    Hypothesis tests are used to validate or invalidate a claim that two populations are different, or that one population differs from a given parameter. In a hypothesis test, we calculate a p-value and compare it to a chosen significance level (α) to conclude if an observed difference between two populations (or between a population and the parameter) is significant or not. Confidence intervals are used to determine a potential range of values for the true mean of a population.

2.    If the p-value is greater than α, then we fail to reject the null hypothesis.

3.    After the test statistic is calculated, a computer program or table is consulted to determine the p-value of the statistic.

4.    True. Power is the probability that the individual rejects the null hypothesis when the alternative hypothesis is true for the population.

·        12.6

1.    Linear relationships can be analyzed without any data or axis transformation into semilog or log–log plots.

2.     

Type of Visual Aid

Pros

Cons

Pie Chart

Easily constructed; useful for categorical data with a small number of categories.

Easily overwhelmed with multiple categories. Difficult to estimate values with circles.

Bar Graph

Multiple organization strategies. Good for large categorical data sets.

Axes are often misleading because of sizeable breaks.

Box Plot

Information-dense; can be useful for comparison.

May not highlight outliers or mean value of a data set. Only useful for numerical data.

Map

Provide relevant and integrated geographic and demographic information.

May only be used to represent at most two variables coherently.

Graph

Provide information about relationships. Useful for estimation.

Axis labels and logarithmic scales require careful interpretation.

Table

Categorical data can be presented without comparison. Does not require estimation for calculations.

Disorganized or unrelated data may be presented together.

3.    Exponential and parabolic curves both have a steep component; however, exponential curves have horizontal asymptotes and become flat on one side while parabolic curves are symmetrical and have steep components on both sides of a center point.

·        12.7

1.    False. As discussed in the last chapter, there must be practical (clinical) as well as statistical significance for a conclusion to be useful.

2.    True. While two variables that are correlated are not necessarily causally related, all variables that are causally related must be correlated in some way (direct relationship, inverse relationship, or otherwise).

Equations to Remember

(12.1) Arithmetic mean

(12.2) Median position

(12.3) Range: range = xmax − xmin

(12.4) Interquartile range: IQR = Q3 − Q1

(12.5) Standard deviation

(12.6) Probability of two independent events co-occurring:

P(A ∩ B) = P(A and B) = P(A) × P(B)

(12.7) Probability of at least one event occurring:

P(A ∪ B) = P(A or B) = P(A) + P(B) − P(A and B)

(12.8) Slope

Shared Concepts

·        Behavioral Sciences Chapter 11

o   Social Structure and Demographics

·        Biology Chapter 12

o   Genetics and Evolution

·        General Chemistry Chapter 5

o   Chemical Kinetics

·        Physics and Math Chapter 1

o   Kinematics and Dynamics

·        Physics and Math Chapter 10

o   Mathematics

·        Physics and Math Chapter 11

o   Reasoning About the Design and Execution of Research