Focus on understanding what each formula computes and when to apply it. Memory tips are included to help retention.

Statistics

Binomial Distribution EXAM HOT

P(X = x) = C(n,x) * p^x * (1-p)^(n-x)
n = number of trials  |  p = probability of success  |  x = number of successes
Two parameters only: p and n. NOT mean and standard deviation.
E(X) = np
V(X) = np(1-p)
Shortcut formulas for expected value and variance of a binomial

Memory tip: "B for Binary trials — only needs n (trials) and p (probability)."

Normal Distribution EXAM HOT

f(x; mu, sigma^2) = (1 / sqrt(2*pi*sigma)) * e^( -(x-mu)^2 / 2*sigma^2 )
Bell-shaped. Defined by exactly two parameters: mu (mean) and sigma (standard deviation).

Memory tip: "Normal = mu + sigma. Bell curve."

Empirical Rule EXAM HOT

mu +/- 1*sigma ==> 68% of data
mu +/- 2*sigma ==> 95% of data
mu +/- 3*sigma ==> 99.7% of data
Only applies to normal (bell-shaped) distributions

Memory tip: "68-95-99.7 — memorize as three numbers."

Z-Score (Standardization)

z = (x - x_bar) / s
x = data point  |  x_bar = sample mean  |  s = sample standard deviation
How many standard deviations is x from the mean? |z| > 3 = outlier candidate.

Memory tip: "Subtract the mean, divide by SD."

One-Sample T-Test

t = (x_bar - mu) / (s / sqrt(n))
x_bar = sample mean  |  mu = population mean  |  s = sample SD  |  n = sample size
df = n - 1
Reject H0 when p-value < alpha (usually 0.05)

Coefficient of Variation (CV)

CV = sigma / mu (or s / x_bar)
Compares variability across groups with different scales.

Linear Regression

Linear Regression Model

h(x) = theta_0 + theta_1 * x
Single variable. theta_0 = intercept, theta_1 = slope.
f(x) = w_0 + w_1*x_1 + w_2*x_2 + ... + w_(m-1)*x_(m-1)
Multi-variable linear regression

Residual Error EXAM HOT

r_i = y_i - f(x_i)
y_i = actual value  |  f(x_i) = predicted value
Geometrically = VERTICAL distance from data point to regression line

Memory tip: "Residual = actual MINUS predicted. Always vertical."

Cost Function (MSE / Least Squares)

J(w0, w1) = (1/2n) * SUM[ (y_i - f(x_i))^2 ]
Minimize this to find optimal parameters. The 1/2 is for mathematical convenience (cancels with derivative).

Closed Form Solution

w = (A^T * A)^(-1) * A^T * b
A = feature matrix (n x m)  |  b = target vector (n x 1)
Exact solution. Requires matrix inversion — slow for very large systems.

Gradient Descent EXAM HOT

theta_j := theta_j - alpha * d/d(theta_j) J(theta)
alpha = learning rate (step size)
Repeat for all j simultaneously until convergence.
alphaEffect
Too smallSlow convergence
Too largeOvershoots, fails to converge

Memory tip: "Walk downhill — step in direction of steepest descent."

Regularization (Ridge) EXAM HOT

J(theta) = (1/2m) * [ SUM(h - y)^2 + lambda * SUM(theta_j^2) ]
lambda = regularization parameter
Adds penalty for large coefficients. Large lambda = simpler model.
LambdaEffectRisk
LargeForces parameters to zero, simpler modelUnderfitting
SmallMinimal penalty, uses all parametersOverfitting

Memory tip: "Lambda = how much do I punish large parameters?"

Logistic Regression

Sigmoid Function EXAM HOT

f(x) = 1 / (1 + e^(-cx))
c = steepness parameter. Output always between 0 and 1.
xf(x)
00.5 (decision boundary)
+infinity1
-infinity0

Memory tip: "S-curve squashes everything to [0, 1]. At zero input, output is 0.5."

Cross-Entropy Loss (Log Loss) EXAM HOT

For y=1: cost = -log( f(x) )
For y=0: cost = -log( 1 - f(x) )
J(w) = -(1/n) * SUM[ y_i*log(f(x_i)) + (1-y_i)*log(1-f(x_i)) ]
Convex function — gradient descent finds global minimum. NOT MSE.

Memory tip: "Log loss penalizes confident wrong predictions most heavily."

Machine Learning

Bayes' Theorem

P(B|A) = P(A|B) * P(B) / P(A)
Naive Bayes: C(X) = argmax P(Ci) * PRODUCT[ P(xj | Ci) ]
Naive assumption: features are conditionally independent given the class

Entropy (Decision Trees)

H(S) = -SUM[ fi * log2(fi) ]
fi = fraction of items in class i
H = 0 when all same class (pure). Lower H = purer = better for classification.
IG(S) = H(S) - SUM[ (|Si|/|S|) * H(Si) ]
Information gain = entropy reduction from a split. Choose max IG.

Memory tip: "Entropy = messiness. Lower is cleaner = better split."

Distance and Clustering

Lk Distance EXAM HOT

d(x,y) = ( SUM |xi - yi|^k )^(1/k)
k = 1Manhattan: d = SUM |xi - yi|
k = 2Euclidean: d = SQRT( SUM (xi-yi)^2 )

Memory tip: "k=1 is Manhattan (grid), k=2 is Euclidean (straight line, most common)."

Quick Reference Table

ConceptKey numbers / facts
Binomial parametersp and n (NOT mean/SD)
Normal parametersmu and sigma
Empirical Rule68 / 95 / 99.7 %
Significance level alpha0.05 (common default)
Reject H0 whenp-value < alpha
H0 definitionNo difference / no relationship
Sigmoid at x = 0f(0) = 0.5
Logistic regression lossCross-Entropy (Log Loss)
Euclidean distance kk = 2
Manhattan distance kk = 1
Large lambda effectSimpler model → underfitting risk
High learning rateOvershoots, fails to converge
High entropy nodeImpure — bad split
Residual interpretationVertical distance from point to line
Support vectorsPoints closest to decision boundary
AdaBoost weight increaseWhen misclassified by weak classifier in round t
Occam's RazorSimplest model that fits is preferable
Primary goal of modelingGeneralize well to unseen data