Formula Sheet | ICT583

Focus on understanding what each formula computes and when to apply it. Memory tips are included to help retention.

Statistics

Binomial Distribution EXAM HOT

P(X = x) = C(n,x) * p^x * (1-p)^(n-x)

n = number of trials | p = probability of success | x = number of successes

Two parameters only: p and n. NOT mean and standard deviation.

E(X) = np

V(X) = np(1-p)

Shortcut formulas for expected value and variance of a binomial

Memory tip: "B for Binary trials — only needs n (trials) and p (probability)."

Normal Distribution EXAM HOT

f(x; mu, sigma^2) = (1 / sqrt(2*pi*sigma)) * e^( -(x-mu)^2 / 2*sigma^2 )

Bell-shaped. Defined by exactly two parameters: mu (mean) and sigma (standard deviation).

Memory tip: "Normal = mu + sigma. Bell curve."

Empirical Rule EXAM HOT

mu +/- 1*sigma ==> 68% of data

mu +/- 2*sigma ==> 95% of data

mu +/- 3*sigma ==> 99.7% of data

Only applies to normal (bell-shaped) distributions

Memory tip: "68-95-99.7 — memorize as three numbers."

Z-Score (Standardization)

z = (x - x_bar) / s

x = data point | x_bar = sample mean | s = sample standard deviation

How many standard deviations is x from the mean? |z| > 3 = outlier candidate.

Memory tip: "Subtract the mean, divide by SD."

One-Sample T-Test

t = (x_bar - mu) / (s / sqrt(n))

x_bar = sample mean | mu = population mean | s = sample SD | n = sample size

df = n - 1

Reject H0 when p-value < alpha (usually 0.05)

Coefficient of Variation (CV)

CV = sigma / mu (or s / x_bar)

Compares variability across groups with different scales.

Linear Regression

Linear Regression Model

h(x) = theta_0 + theta_1 * x

Single variable. theta_0 = intercept, theta_1 = slope.

f(x) = w_0 + w_1*x_1 + w_2*x_2 + ... + w_(m-1)*x_(m-1)

Multi-variable linear regression

Residual Error EXAM HOT

r_i = y_i - f(x_i)

y_i = actual value | f(x_i) = predicted value

Geometrically = VERTICAL distance from data point to regression line

Memory tip: "Residual = actual MINUS predicted. Always vertical."

Cost Function (MSE / Least Squares)

J(w0, w1) = (1/2n) * SUM[ (y_i - f(x_i))^2 ]

Minimize this to find optimal parameters. The 1/2 is for mathematical convenience (cancels with derivative).

Closed Form Solution

w = (A^T * A)^(-1) * A^T * b

A = feature matrix (n x m) | b = target vector (n x 1)

Exact solution. Requires matrix inversion — slow for very large systems.

Gradient Descent EXAM HOT

theta_j := theta_j - alpha * d/d(theta_j) J(theta)

alpha = learning rate (step size)

Repeat for all j simultaneously until convergence.

alpha	Effect
Too small	Slow convergence
Too large	Overshoots, fails to converge

Memory tip: "Walk downhill — step in direction of steepest descent."

Regularization (Ridge) EXAM HOT

J(theta) = (1/2m) * [ SUM(h - y)^2 + lambda * SUM(theta_j^2) ]

lambda = regularization parameter

Adds penalty for large coefficients. Large lambda = simpler model.

Lambda	Effect	Risk
Large	Forces parameters to zero, simpler model	Underfitting
Small	Minimal penalty, uses all parameters	Overfitting

Memory tip: "Lambda = how much do I punish large parameters?"

Logistic Regression

Sigmoid Function EXAM HOT

f(x) = 1 / (1 + e^(-cx))

c = steepness parameter. Output always between 0 and 1.

x	f(x)
0	0.5 (decision boundary)
+infinity	1
-infinity	0

Memory tip: "S-curve squashes everything to [0, 1]. At zero input, output is 0.5."

Cross-Entropy Loss (Log Loss) EXAM HOT

For y=1: cost = -log( f(x) )

For y=0: cost = -log( 1 - f(x) )

J(w) = -(1/n) * SUM[ y_i*log(f(x_i)) + (1-y_i)*log(1-f(x_i)) ]

Convex function — gradient descent finds global minimum. NOT MSE.

Memory tip: "Log loss penalizes confident wrong predictions most heavily."

Machine Learning

Bayes' Theorem

P(B|A) = P(A|B) * P(B) / P(A)

Naive Bayes: C(X) = argmax P(Ci) * PRODUCT[ P(xj | Ci) ]

Naive assumption: features are conditionally independent given the class

Entropy (Decision Trees)

H(S) = -SUM[ fi * log2(fi) ]

fi = fraction of items in class i

H = 0 when all same class (pure). Lower H = purer = better for classification.

IG(S) = H(S) - SUM[ (|Si|/|S|) * H(Si) ]

Information gain = entropy reduction from a split. Choose max IG.

Memory tip: "Entropy = messiness. Lower is cleaner = better split."

Distance and Clustering

Lk Distance EXAM HOT

d(x,y) = ( SUM |xi - yi|^k )^(1/k)

k = 1	Manhattan: d = SUM \|xi - yi\|
k = 2	Euclidean: d = SQRT( SUM (xi-yi)^2 )

Memory tip: "k=1 is Manhattan (grid), k=2 is Euclidean (straight line, most common)."

Quick Reference Table

Concept	Key numbers / facts
Binomial parameters	p and n (NOT mean/SD)
Normal parameters	mu and sigma
Empirical Rule	68 / 95 / 99.7 %
Significance level alpha	0.05 (common default)
Reject H0 when	p-value < alpha
H0 definition	No difference / no relationship
Sigmoid at x = 0	f(0) = 0.5
Logistic regression loss	Cross-Entropy (Log Loss)
Euclidean distance k	k = 2
Manhattan distance k	k = 1
Large lambda effect	Simpler model → underfitting risk
High learning rate	Overshoots, fails to converge
High entropy node	Impure — bad split
Residual interpretation	Vertical distance from point to line
Support vectors	Points closest to decision boundary
AdaBoost weight increase	When misclassified by weak classifier in round t
Occam's Razor	Simplest model that fits is preferable
Primary goal of modeling	Generalize well to unseen data