Population, sample, distributions, hypothesis testing, p-values, Central Limit Theorem
These four terms form the foundation of statistics. Confusing them is a common mistake.
| Term | Definition | Example |
|---|---|---|
| Population | The ENTIRE group you want to study | All 1,000 employees at a company |
| Parameter | A number that describes the whole population | The actual drug use rate across all 1,000 = 10% |
| Sample | A SUBSET of the population selected for study | 200 employees selected for a survey |
| Statistic | A number calculated from the sample; estimates the parameter | Drug use rate in the sample of 200 = 13% |
Probability goes: Population → Sample (the sampling process creates the sample)
Descriptive statistics: Summarizes and describes the SAMPLE itself
Inferential statistics: Uses the sample to draw conclusions about the POPULATION
We study a sample because measuring the entire population is usually impossible (too expensive, too large, or destructive).
| Measure | Formula | What it tells you | Symbol |
|---|---|---|---|
| Mean | Sum of all values / count | The average; center of the data | Population: μ Sample: x̄ |
| Median | Middle value when sorted | Robust to extreme values (outliers) | — |
| Variance | Average squared distance from mean | How spread out the data is | Population: σ² Sample: s² |
| Std Dev | Square root of variance | Spread in original units | Population: σ Sample: s |
| CV | σ/μ (or s/x̄) | Relative variability; compare across different scales | CV |
Example: Exam mean = 70, SD = 10. Student scores 90. z = (90-70)/10 = 2. They are 2 standard deviations above average.
Used when: n identical, independent trials; each trial has exactly 2 outcomes (success/failure); probability of success p is constant.
Defined by exactly 2 parameters: n and p. NOT by mean and standard deviation.
Example: Family has 5 children, probability of blood type O = 25% (p=0.25). What is probability exactly 3 have type O?
P(X=3) = C(5,3) × 0.25³ × 0.75² = 10 × 0.0156 × 0.5625 = 0.0879 (about 8.8%)
E(X) = 5 × 0.25 = 1.25 (expect about 1 or 2 children with type O blood)
The classic bell-shaped curve. Most common distribution in nature (heights, weights, IQ scores).
| Range | % of data within this range |
|---|---|
| Mean ± 1 standard deviation (μ ± 1σ) | 68% |
| Mean ± 2 standard deviations (μ ± 2σ) | 95% |
| Mean ± 3 standard deviations (μ ± 3σ) | 99.7% |
Example: Exam scores have mean = 84, SD = 6. Then:
• 68% of students scored between 78 and 90 (84 ± 6)
• 95% scored between 72 and 96 (84 ± 12)
• 99.7% scored between 66 and 102 (84 ± 18)
Measures the strength and direction of the linear relationship between two variables.
If you take many samples from ANY population (regardless of the population's shape) and compute the mean of each sample, those sample means will form a distribution that approaches normal as the sample size n increases.
Why this matters: It allows us to use normal-distribution-based statistical tests even when the underlying population is NOT normally distributed, provided our sample is large enough (n ≥ 30 is the common rule of thumb).
Also: As sample size increases, the sample mean gets closer and closer to the true population mean.
Hypothesis testing is a formal procedure for deciding whether evidence from a sample is strong enough to support a claim about the population.
| Test | Used for | Key assumptions |
|---|---|---|
| One-sample t-test | Continuous variable: does sample differ from population mean? | n ≥ 30 or normally distributed; population ≥ 10n |
| Chi-square goodness of fit | Categorical: does sample distribution match expected? | All expected counts ≥ 5; population ≥ 10n |
| Chi-square independence | Categorical: are two variables associated with each other? | H₀ = no association; same count assumptions |
| Kolmogorov-Smirnov (KS) | Non-parametric: are two samples from the same distribution? | Compares cumulative distribution functions (CDFs) |
Claim: batteries last > 40 hours. Sample: n=15, mean=44.9hr, SD=8.9hr.
Parameter = population. Statistic = sample. Binomial: 2 parameters only (n, p); E(X)=np; V(X)=np(1-p). Normal: 2 parameters (μ, σ); bell-shaped; continuous. Empirical rule: 68/95/99.7% within 1/2/3 standard deviations. z-score: standardizes; |z|>3 = outlier. Correlation: -1 to +1; 0=none; correlation ≠ causation. CLT: large samples → sample means become normally distributed. H₀ = no difference (default). Reject H₀ when p < 0.05. t-test for continuous; chi-square for categorical.
Q1. The binomial distribution depends on which two parameters?
Answer: B
The binomial distribution is completely defined by n (number of independent trials) and p (probability of success on each trial). NOT by mean and standard deviation — those describe the normal distribution. Mean and SD are derived FROM n and p: E(X) = np, V(X) = np(1-p).
Q2. For a normal distribution, approximately what percentage of data falls within one standard deviation of the mean?
Answer: D
The empirical rule (68-95-99.7): 68% of data falls within 1 standard deviation (μ ± σ), 95% within 2 standard deviations, and 99.7% within 3 standard deviations. Memorize these three numbers.
Q3. What is the difference between a parameter and a statistic?
Answer: A
A parameter is a numerical measurement describing a characteristic of a POPULATION (e.g., 10% drug use rate among all 1,000 employees). A statistic is a numerical measurement describing a characteristic of a SAMPLE (e.g., 13% drug use rate found in a sample of 200 employees). Memory tip: Parameter = Population, Statistic = Sample.
Q4. The null hypothesis in a statistical test is:
Answer: B
The null hypothesis (H₀) states there is NO difference or NO relationship — it is the default "nothing interesting is happening" assumption. A hypothesis test defaults to H₀ and only rejects it when evidence is strong enough. Importantly, we never "accept" H₀ — we only "fail to reject" it.
Q5. A test gives a p-value of 0.03 and the significance level is 0.05. The correct conclusion is:
Answer: B
Decision rule: if p-value < α, reject H₀. Here 0.03 < 0.05, so we reject H₀. The p-value means: "if H₀ were true, there is only a 3% chance of seeing data this extreme by chance." That's unlikely enough to reject H₀.
Q6. The Central Limit Theorem states that:
Answer: C
The CLT says: take many samples of size n from ANY population. Compute the mean of each sample. The distribution of those sample means will approach normal as n increases — even if the original population is not normally distributed (e.g., it could be bimodal, skewed, or uniform). This is why we can use t-tests with n ≥ 30 without needing a normal population.
Q7. Which test is appropriate for determining whether a continuous variable from a sample differs significantly from a known population mean?
Answer: D
The one-sample t-test is used for continuous (numerical) variables to determine if a sample came from the same distribution as a known population. Chi-square tests are for categorical variables. The KS test is non-parametric and compares two sample distributions. Use t-test for continuous; chi-square for categorical.
Q8. An exam has mean 84 and standard deviation 6. Using the empirical rule, approximately what percentage of students scored between 72 and 96?
Answer: B
72 = 84 - 12 = 84 - 2(6), so 72 is 2 standard deviations below the mean. 96 = 84 + 12 = 84 + 2(6), so 96 is 2 standard deviations above. The empirical rule says 95% of data falls within 2 standard deviations of the mean.
Q9. Correlation coefficients have what range of possible values?
Answer: C
Correlation coefficients range from -1 to +1. +1 = perfect positive linear relationship. -1 = perfect negative linear relationship. 0 = no linear relationship. The closer the absolute value is to 1, the stronger the relationship. Weakest = 0. Strongest = -1 or +1.
Q10. A common misconception about hypothesis tests is that:
Answer: B
A very common misconception is that hypothesis tests select the more likely hypothesis. This is INCORRECT. A hypothesis test defaults to H₀ (no difference) and ONLY rejects it when there is sufficient evidence against it. Even if Hₐ seems more likely based on prior knowledge, the test won't reject H₀ unless the p-value falls below alpha.
Q11. The chi-square test for independence tests:
Answer: B
The chi-square test for independence determines whether two categorical variables are associated (dependent) or independent. H₀ = no association / variables are independent. Example: testing whether website choice (A or B) is associated with sign-up rate. Both variables are categorical.
Q12. A z-score of -2.5 means:
Answer: B
A z-score of -2.5 means the value is 2.5 standard deviations below the mean. Negative z = below mean; positive z = above mean. The magnitude tells you how far. |z| = 2.5 is not extreme enough to call an outlier (the typical cutoff is |z| > 3). A z-score simply standardizes: z = (x - mean)/SD.
Q13. Which statement about the coefficient of variation (CV) is correct?
Answer: B
CV = standard deviation / mean. It expresses variability as a proportion of the mean, which allows comparison across groups with different scales. Example: comparing spread in mailroom salaries ($25k mean, $2k SD, CV=8%) vs executive salaries ($124k mean, $42k SD, CV=33.9%). Even though executives have a larger SD in dollars, the CV reveals executives have much more relative variability.
Q14. A family has 5 children. The probability of each child having blood type O is 25%. Using the binomial distribution, what is the expected number of children with type O blood?
Answer: C
For a binomial distribution, E(X) = n × p = 5 × 0.25 = 1.25. So on average, 1.25 children in this family would have type O blood. This means the family can expect about 1 or 2 children with type O blood. The variance = np(1-p) = 5 × 0.25 × 0.75 = 0.9375.