L3: Statistics

Heavy exam topic. Binomial distribution parameters, empirical rule percentages, parameter vs statistic, null hypothesis definition, and p-value interpretation all appeared as sample exam questions in the review slides.

Population, Parameter, Sample, Statistic

These four terms form the foundation of statistics. Confusing them is a common mistake.

Term	Definition	Example
Population	The ENTIRE group you want to study	All 1,000 employees at a company
Parameter	A number that describes the whole population	The actual drug use rate across all 1,000 = 10%
Sample	A SUBSET of the population selected for study	200 employees selected for a survey
Statistic	A number calculated from the sample; estimates the parameter	Drug use rate in the sample of 200 = 13%

Memory rule: Parameter = Population. Statistic = Sample. The two pairs start with the same letter.

Central Dogma of Statistics

Probability goes: Population → Sample (the sampling process creates the sample)

Descriptive statistics: Summarizes and describes the SAMPLE itself

Inferential statistics: Uses the sample to draw conclusions about the POPULATION

We study a sample because measuring the entire population is usually impossible (too expensive, too large, or destructive).

Sampling Methods

Random Sampling — Every member of the population has an EQUAL chance of being selected. The gold standard for avoiding bias.

Probability Sampling — Every person has a KNOWN (but not necessarily equal) probability of being selected.

Unequal Probability Sampling — Intentionally over-samples minority groups so their voices are not drowned out. Example: if 75% of employees are men and 25% women, simple random sampling might under-represent women.

Sampling Bias — A flaw in the sampling process that systematically favors some outcome. Makes your sample unrepresentative of the population.

Confounding Factor — A hidden variable that connects the variables you are measuring, making it look like one causes the other. Example: ice cream sales and drowning rates are correlated — the confounding factor is hot weather (people buy ice cream AND swim more in summer).

Measures of Center and Spread

Measure	Formula	What it tells you	Symbol
Mean	Sum of all values / count	The average; center of the data	Population: μ Sample: x̄
Median	Middle value when sorted	Robust to extreme values (outliers)	—
Variance	Average squared distance from mean	How spread out the data is	Population: σ² Sample: s²
Std Dev	Square root of variance	Spread in original units	Population: σ Sample: s
CV	σ/μ (or s/x̄)	Relative variability; compare across different scales	CV

Z-Score

z = (x - x̄) / s

x = data point | x̄ = sample mean | s = sample standard deviation

Tells you how many standard deviations a value is from the mean. Standardizes all variables to the same scale. Values with |z| > 3 are outlier candidates.

Example: Exam mean = 70, SD = 10. Student scores 90. z = (90-70)/10 = 2. They are 2 standard deviations above average.

Binomial Distribution EXAM HOT

Used when: n identical, independent trials; each trial has exactly 2 outcomes (success/failure); probability of success p is constant.

P(X = x) = C(n,x) × p^x × (1-p)^(n-x)

n = number of trials | p = probability of success | x = number of successes desired

Expected value: E(X) = n × p

Variance: V(X) = n × p × (1-p)

Defined by exactly 2 parameters: n and p. NOT by mean and standard deviation.

Example: Family has 5 children, probability of blood type O = 25% (p=0.25). What is probability exactly 3 have type O?
P(X=3) = C(5,3) × 0.25³ × 0.75² = 10 × 0.0156 × 0.5625 = 0.0879 (about 8.8%)
E(X) = 5 × 0.25 = 1.25 (expect about 1 or 2 children with type O blood)

Normal Distribution EXAM HOT

The classic bell-shaped curve. Most common distribution in nature (heights, weights, IQ scores).

Defined by exactly 2 parameters: mean (μ) and standard deviation (σ)
Continuous: x can be any real number (not just integers)
Symmetric around the mean
Is a generalization of the binomial when n → ∞
Not all bell-shaped distributions are normal, but it is a reasonable starting assumption

Empirical Rule (68-95-99.7) EXAM HOT

Range	% of data within this range
Mean ± 1 standard deviation (μ ± 1σ)	68%
Mean ± 2 standard deviations (μ ± 2σ)	95%
Mean ± 3 standard deviations (μ ± 3σ)	99.7%

Example: Exam scores have mean = 84, SD = 6. Then:
• 68% of students scored between 78 and 90 (84 ± 6)
• 95% scored between 72 and 96 (84 ± 12)
• 99.7% scored between 66 and 102 (84 ± 18)

Correlation Coefficient

Measures the strength and direction of the linear relationship between two variables.

Range: -1 to +1
+1 = perfect positive (as X increases, Y increases proportionally)
-1 = perfect negative (as X increases, Y decreases proportionally)
0 = no linear relationship
Closer to ±1 = stronger relationship

Critical caveat: Correlation does NOT imply causation. Ice cream sales and drowning rates are positively correlated, but eating ice cream does not cause drowning (hot weather causes both). Also not appropriate for nominal or dichotomous variables.

Point Estimates and Central Limit Theorem

Point Estimate — An estimate of a population parameter calculated from sample data. Example: the sample mean x̄ is a point estimate of the population mean μ.

Central Limit Theorem (CLT)

If you take many samples from ANY population (regardless of the population's shape) and compute the mean of each sample, those sample means will form a distribution that approaches normal as the sample size n increases.

Why this matters: It allows us to use normal-distribution-based statistical tests even when the underlying population is NOT normally distributed, provided our sample is large enough (n ≥ 30 is the common rule of thumb).

Also: As sample size increases, the sample mean gets closer and closer to the true population mean.

Hypothesis Testing EXAM HOT

Hypothesis testing is a formal procedure for deciding whether evidence from a sample is strong enough to support a claim about the population.

Null Hypothesis (H₀) — The default statement being tested: "there is NO difference / NO relationship." A test defaults to H₀ and only rejects it if there is strong enough evidence against it.

Alternative Hypothesis (Hₐ) — The statement that there IS a difference or relationship. What you are trying to find evidence for.

p-value — The probability that you would observe data this extreme (or more extreme) purely by chance, ASSUMING the null hypothesis is true. Small p-value = unlikely to be chance = evidence against H₀.

Significance level (α) — Your threshold for decision-making. Usually α = 0.05. If p < α, reject H₀. If p ≥ α, fail to reject H₀.

Critical misconception: Hypothesis tests do NOT select the more likely hypothesis. They start at H₀ (no difference) and only abandon it when evidence is strong enough. Failing to reject H₀ does NOT prove H₀ is true.

5 Steps of Hypothesis Testing

Specify hypotheses — Write out H₀ (no difference) and Hₐ (there is a difference)

Determine sample size — Large enough to use CLT (usually n ≥ 30); population must be at least 10× sample size

Choose significance level — Usually α = 0.05 (meaning 5% chance of false rejection)

Collect data and calculate test statistic

Decision — If p-value < α: reject H₀. If p-value ≥ α: fail to reject H₀

Statistical Tests

Test	Used for	Key assumptions
One-sample t-test	Continuous variable: does sample differ from population mean?	n ≥ 30 or normally distributed; population ≥ 10n
Chi-square goodness of fit	Categorical: does sample distribution match expected?	All expected counts ≥ 5; population ≥ 10n
Chi-square independence	Categorical: are two variables associated with each other?	H₀ = no association; same count assumptions
Kolmogorov-Smirnov (KS)	Non-parametric: are two samples from the same distribution?	Compares cumulative distribution functions (CDFs)

T-test Worked Example

Claim: batteries last > 40 hours. Sample: n=15, mean=44.9hr, SD=8.9hr.

H₀: μ = 40 Hₐ: μ > 40 (one-tailed, upper)

t = (x̄ - μ) / (s / √n) = (44.9 - 40) / (8.9 / √15) = 4.9 / 2.298 = 2.13

df = n - 1 = 14 p-value = 0.026

Since p=0.026 < α=0.05, we REJECT H₀. Evidence supports that batteries last more than 40 hours.

Lecture 3 Summary — 5 Minute Revision

Parameter = population. Statistic = sample. Binomial: 2 parameters only (n, p); E(X)=np; V(X)=np(1-p). Normal: 2 parameters (μ, σ); bell-shaped; continuous. Empirical rule: 68/95/99.7% within 1/2/3 standard deviations. z-score: standardizes; |z|>3 = outlier. Correlation: -1 to +1; 0=none; correlation ≠ causation. CLT: large samples → sample means become normally distributed. H₀ = no difference (default). Reject H₀ when p < 0.05. t-test for continuous; chi-square for categorical.

Practice Questions

Q1. The binomial distribution depends on which two parameters?

A. Mean and standard deviation
B. Number of trials (n) and probability of success (p)
C. Standard deviation and number of successes
D. Mean and probability of success

Show Answer

Answer: B

The binomial distribution is completely defined by n (number of independent trials) and p (probability of success on each trial). NOT by mean and standard deviation — those describe the normal distribution. Mean and SD are derived FROM n and p: E(X) = np, V(X) = np(1-p).

Q2. For a normal distribution, approximately what percentage of data falls within one standard deviation of the mean?

A. 99.7%
B. 95%
C. 50%
D. 68%

Show Answer

Answer: D

The empirical rule (68-95-99.7): 68% of data falls within 1 standard deviation (μ ± σ), 95% within 2 standard deviations, and 99.7% within 3 standard deviations. Memorize these three numbers.

Q3. What is the difference between a parameter and a statistic?

A. A parameter describes a population; a statistic describes a sample
B. A parameter describes a sample; a statistic describes a population
C. A parameter measures central tendency; a statistic measures variability
D. Parameters are always larger than statistics

Show Answer

Answer: A

A parameter is a numerical measurement describing a characteristic of a POPULATION (e.g., 10% drug use rate among all 1,000 employees). A statistic is a numerical measurement describing a characteristic of a SAMPLE (e.g., 13% drug use rate found in a sample of 200 employees). Memory tip: Parameter = Population, Statistic = Sample.

Q4. The null hypothesis in a statistical test is:

A. The statement of what the researcher hopes to find
B. A statement that there is no difference or no relationship between the variables
C. Always proven correct if the p-value is large
D. A statement based on prior research findings

Show Answer

Answer: B

The null hypothesis (H₀) states there is NO difference or NO relationship — it is the default "nothing interesting is happening" assumption. A hypothesis test defaults to H₀ and only rejects it when evidence is strong enough. Importantly, we never "accept" H₀ — we only "fail to reject" it.

Q5. A test gives a p-value of 0.03 and the significance level is 0.05. The correct conclusion is:

A. Fail to reject H₀ because p is small
B. Reject H₀ because p (0.03) < α (0.05)
C. Accept the null hypothesis as proven
D. The test is inconclusive

Show Answer

Answer: B

Decision rule: if p-value < α, reject H₀. Here 0.03 < 0.05, so we reject H₀. The p-value means: "if H₀ were true, there is only a 3% chance of seeing data this extreme by chance." That's unlikely enough to reject H₀.

Q6. The Central Limit Theorem states that:

A. All datasets follow a normal distribution
B. You need a normally distributed population to use any statistical test
C. The sampling distribution of sample means approaches normal as n increases, regardless of population shape
D. Larger samples always reduce measurement error

Show Answer

Answer: C

The CLT says: take many samples of size n from ANY population. Compute the mean of each sample. The distribution of those sample means will approach normal as n increases — even if the original population is not normally distributed (e.g., it could be bimodal, skewed, or uniform). This is why we can use t-tests with n ≥ 30 without needing a normal population.

Q7. Which test is appropriate for determining whether a continuous variable from a sample differs significantly from a known population mean?

A. Chi-square goodness of fit
B. Chi-square independence
C. Kolmogorov-Smirnov test
D. One-sample t-test

Show Answer

Answer: D

The one-sample t-test is used for continuous (numerical) variables to determine if a sample came from the same distribution as a known population. Chi-square tests are for categorical variables. The KS test is non-parametric and compares two sample distributions. Use t-test for continuous; chi-square for categorical.

Q8. An exam has mean 84 and standard deviation 6. Using the empirical rule, approximately what percentage of students scored between 72 and 96?

A. 68%
B. 95%
C. 99.7%
D. 50%

Show Answer

Answer: B

72 = 84 - 12 = 84 - 2(6), so 72 is 2 standard deviations below the mean. 96 = 84 + 12 = 84 + 2(6), so 96 is 2 standard deviations above. The empirical rule says 95% of data falls within 2 standard deviations of the mean.

Q9. Correlation coefficients have what range of possible values?

A. 0 to 1
B. 0 to 100
C. -1 to 1
D. -100 to 100

Show Answer

Answer: C

Correlation coefficients range from -1 to +1. +1 = perfect positive linear relationship. -1 = perfect negative linear relationship. 0 = no linear relationship. The closer the absolute value is to 1, the stronger the relationship. Weakest = 0. Strongest = -1 or +1.

Q10. A common misconception about hypothesis tests is that:

A. They are used only for continuous data
B. They select the more likely of the two hypotheses
C. They always require normally distributed data
D. They produce exact probabilities

Show Answer

Answer: B

A very common misconception is that hypothesis tests select the more likely hypothesis. This is INCORRECT. A hypothesis test defaults to H₀ (no difference) and ONLY rejects it when there is sufficient evidence against it. Even if Hₐ seems more likely based on prior knowledge, the test won't reject H₀ unless the p-value falls below alpha.

Q11. The chi-square test for independence tests:

A. Whether a continuous variable equals a specific value
B. Whether two categorical variables are associated with each other
C. Whether a sample comes from a normal distribution
D. Whether the variance of two groups is equal

Show Answer

Answer: B

The chi-square test for independence determines whether two categorical variables are associated (dependent) or independent. H₀ = no association / variables are independent. Example: testing whether website choice (A or B) is associated with sign-up rate. Both variables are categorical.

Q12. A z-score of -2.5 means:

A. The value is 2.5 times larger than the mean
B. The value is 2.5 standard deviations BELOW the mean
C. The value is 2.5% away from the mean
D. The value is an outlier and should be deleted

Show Answer

Answer: B

A z-score of -2.5 means the value is 2.5 standard deviations below the mean. Negative z = below mean; positive z = above mean. The magnitude tells you how far. |z| = 2.5 is not extreme enough to call an outlier (the typical cutoff is |z| > 3). A z-score simply standardizes: z = (x - mean)/SD.

Q13. Which statement about the coefficient of variation (CV) is correct?

A. CV = mean / standard deviation
B. CV allows comparison of variability across groups with different units or scales
C. CV is always between -1 and 1
D. CV is used instead of the p-value in hypothesis tests

Show Answer

Answer: B

CV = standard deviation / mean. It expresses variability as a proportion of the mean, which allows comparison across groups with different scales. Example: comparing spread in mailroom salaries ($25k mean, $2k SD, CV=8%) vs executive salaries ($124k mean, $42k SD, CV=33.9%). Even though executives have a larger SD in dollars, the CV reveals executives have much more relative variability.

Q14. A family has 5 children. The probability of each child having blood type O is 25%. Using the binomial distribution, what is the expected number of children with type O blood?

A. 0.25
B. 0.9375
C. 1.25
D. 2.5

Show Answer

Answer: C

For a binomial distribution, E(X) = n × p = 5 × 0.25 = 1.25. So on average, 1.25 children in this family would have type O blood. This means the family can expect about 1 or 2 children with type O blood. The variance = np(1-p) = 5 × 0.25 × 0.75 = 0.9375.

Statistical Analysis

Population, Parameter, Sample, Statistic

Central Dogma of Statistics

Sampling Methods

Measures of Center and Spread

Z-Score

Binomial Distribution EXAM HOT

Normal Distribution EXAM HOT

Empirical Rule (68-95-99.7) EXAM HOT

Correlation Coefficient

Point Estimates and Central Limit Theorem

Central Limit Theorem (CLT)

Hypothesis Testing EXAM HOT

5 Steps of Hypothesis Testing

Statistical Tests

T-test Worked Example

Lecture 3 Summary — 5 Minute Revision

Practice Questions