L7: Linear Regression

Linear regression finds the straight line (or hyperplane) that best fits a set of data points and uses it to predict continuous numerical output values. It is one of the most fundamental and widely used models in data science.

What is Linear Regression?

Given a collection of n data points, find the line y = f(x) which best approximates those points. Once found, you can plug in any x value to get a predicted y value.

Why Linear Regression?

Linear relationships are common and easy to understand (income grows linearly with hours worked, weight increases with food eaten)
Simplification: replaces a large noisy dataset with a simple line that captures the main pattern
Forecasting: use the fitted line to estimate y for any new x value
It produces continuous numerical output (not discrete labels — use logistic regression for classification)

Linear Regression Model Formula

h(x) = θ₀ + θ₁ × x

θ₀ = intercept (bias term) | θ₁ = slope (coefficient)

Single variable. For multiple variables: f(x) = w₀ + w₁x₁ + w₂x₂ + ... + wₑ₋₁xₑ₋₁

Residual Error EXAM HOT

The residual for a data point is the difference between the actual (observed) value and the predicted (fitted) value:

rᵢ = yᵢ - f(xᵢ)

yᵢ = actual value | f(xᵢ) = value predicted by the model

Geometrically: the VERTICAL distance from the data point to the regression line.

Positive residual: point is ABOVE the line
Negative residual: point is BELOW the line
Zero residual: point lies exactly on the line

Exam answer: "Geometric interpretation of a residual?" → Vertical distance from a data point to the regression line (NOT perpendicular, NOT horizontal)

Least Squares — Finding the Best Fit

Least squares regression finds the line that minimizes the sum of squared residuals across all training points. We square the residuals because:

Squaring makes all errors positive (we don't want positive and negative errors to cancel)
Squaring penalizes large errors more than small ones
It has a nice mathematical closed form

J(w₀, w₁) = (1/2n) × ∑(yᵢ - f(xᵢ))²

Minimize this cost function J to find the best parameters w₀ and w₁

Exam answer: "Goal of linear regression?" → Minimize the sum of squared differences between actual and predicted values

Two Ways to Solve Linear Regression

Method 1: Closed Form Solution (Normal Equation)

w = (AᵀA)⁻¹ Aᵀb

A = feature matrix (n rows × m columns) | b = target values vector

Gives exact solution in one step. However, requires matrix inversion which is slow for large datasets.

Use when: Small datasets. Avoid when: Large datasets (inversion scales poorly).

Method 2: Gradient Descent EXAM HOT

An iterative optimization algorithm that finds the minimum of the cost function by taking repeated steps "downhill" in the direction that reduces the cost fastest.

θ₁ := θ₁ - α × ∂J/∂θ₁

α = learning rate (step size) | Repeat for all parameters simultaneously

Repeat until convergence (when updates become tiny)

How it works:

Start with initial parameter values (often random)

Calculate the gradient (slope) of the cost function at current position

Step in the opposite direction of the gradient (downhill)

Repeat until the cost function stops decreasing significantly

Learning Rate (α) — Critical! EXAM HOT

α value	Effect	Problem
Too small	Takes very tiny steps downhill	Slow convergence — takes forever to reach minimum
Too large	Takes huge steps, overshoots the minimum	Bounces back and forth, may fail to converge
Just right	Efficient steps toward minimum	None

Exam answer: "If learning rate is too high?" → The algorithm may overshoot and fail to converge

Strategy: Start larger, decrease as you get closer. Monitor the value of J() over iterations — if J is increasing, your α is too large.

Convex vs Non-Convex Functions

The cost function for linear regression is convex — it has exactly ONE minimum (the global minimum). Imagine a bowl shape: no matter where you start, rolling downhill always leads to the same bottom.

For convex functions: gradient descent is guaranteed to find the global optimum.

For non-convex functions: gradient descent can get stuck in local minima (false bottoms). Solution: run from multiple starting points and take the best result.

Stochastic Gradient Descent (SGD)

Standard gradient descent uses ALL n training points to compute each update step — expensive for large datasets. SGD uses a small random batch of examples instead.

Much faster per update step
Can be used for very large datasets that don't fit in memory
Slightly noisier than full gradient descent, but usually converges well in practice

Improving Regression Models

Feature Scaling (Z-scores)

When features have very different numerical ranges (e.g., age 0-100 vs income 0-1,000,000), the coefficients become extreme and numerically unstable. Scale all features to z-scores BEFORE regression.

For features with a very skewed distribution (power-law, like income), use log(x) or sqrt(x) transformation to compress the range before z-scoring.

Avoid Perfectly Correlated Features

If two features measure the same thing (e.g., height in feet AND height in meters), they are perfectly correlated. Adding the second feature:

Provides NO new information to the model
Causes the matrix in the closed-form solution to be singular (cannot be inverted)
Makes numerical methods unreliable

Use a correlation matrix to identify highly correlated features and remove one from each pair.

Regularization EXAM HOT

Regularization adds a penalty term to the cost function that discourages large parameter values. This is how Occam's Razor is applied mathematically — it pushes the model toward simplicity.

J(θ) = (1/2m) [ ∑(prediction - actual)² + λ × ∑θ₁² ]

λ (lambda) = regularization parameter. Controls how much we penalize complexity.

λ value	Effect on model	Risk
Large λ	Strongly penalizes large parameters → forces them toward zero → simpler model with fewer effective features	Underfitting
Small λ	Minimal penalty → model freely uses all parameters → complex model	Overfitting
Optimal λ	Balance between complexity and fit	None

Exam answer: "Effect of increasing λ?" → Emphasizes small parameters, avoiding overfitting issues

Lecture 7 Summary — 5 Minute Revision

Linear regression fits a line to predict continuous output. Residual = actual - predicted = vertical distance from point to line. Least squares minimizes the sum of squared residuals. Gradient descent iteratively minimizes the cost function: α too small = slow; α too large = overshoots and fails to converge. Convex function = one global minimum = gradient descent works. Large λ = simpler model (underfitting risk). Small λ = complex (overfitting risk). Scale features to z-scores. Remove perfectly correlated features.

Practice Questions

Q1. Linear regression produces what type of output?

A. Discrete class labels (e.g., cat/dog)
B. Probabilities between 0 and 1
C. Continuous numerical values (e.g., house price)
D. Binary outcomes (0 or 1 only)

Show Answer

Answer: C

Linear regression is a regression algorithm — it predicts continuous numerical values (house price, temperature, sales volume). For classification tasks (predicting discrete class labels like spam/not spam), you use logistic regression or other classification algorithms.

Q2. The goal of least squares regression is to:

A. Minimize the sum of absolute differences between actual and predicted values
B. Minimize the sum of squared differences between actual and predicted values
C. Maximize the correlation between features and the target
D. Maximize the sum of squared differences

Show Answer

Answer: B

Least squares minimizes the sum of squared residuals (differences between actual and predicted values). We square because: (1) eliminates sign (positive and negative errors don't cancel), (2) penalizes large errors more than small ones, (3) gives a nice mathematical closed-form solution.

Q3. A residual in linear regression is geometrically:

A. Perpendicular distance from a data point to the regression line
B. Horizontal distance from a data point to the regression line
C. Vertical distance from a data point to the regression line (actual minus predicted)
D. The slope of the line at that data point

Show Answer

Answer: C

The residual r = y - f(x) is the vertical (y-direction) distance from the data point to the regression line. If the point is above the line, the residual is positive. If below, negative. It is NOT perpendicular distance (that would define a different kind of regression) and NOT horizontal distance.

Q4. Gradient descent helps in:

A. Minimizing the cost function
B. Maximizing the cost function
C. Keeping the cost function constant
D. Finding the maximum of the cost function

Show Answer

Answer: A

Gradient descent is an iterative optimization algorithm that minimizes the cost function. It works by repeatedly stepping in the direction of the negative gradient (the direction of steepest descent) until it reaches a minimum. The "descent" in the name refers to descending down the cost surface.

Q5. If the learning rate in gradient descent is too high, what happens?

A. The algorithm converges to the optimal solution faster
B. The algorithm converges more slowly than with a small learning rate
C. The algorithm may overshoot the minimum and fail to converge
D. The algorithm always finds the exact global optimum

Show Answer

Answer: C

With too large a learning rate, each step is so big that the algorithm overshoots the minimum of the cost function. It bounces back and forth across the minimum and may actually diverge (cost increases instead of decreasing). Too small a learning rate causes slow convergence but will eventually reach the minimum.

Q6. The cost function for linear regression is convex. This means:

A. It has many local minima that gradient descent can get stuck in
B. It has exactly one global minimum, so gradient descent always finds the optimal solution
C. The cost function is always zero when the model is perfect
D. We must use the closed-form solution instead of gradient descent

Show Answer

Answer: B

A convex function has exactly one minimum (the global minimum) — like the bottom of a bowl. No matter where gradient descent starts, rolling downhill always leads to the same optimal point. This guarantees gradient descent will find the best solution. Non-convex functions have multiple local minima where gradient descent can get stuck.

Q7. What is the effect of a very large regularization parameter lambda?

A. Emphasizes large parameters, preventing underfitting
B. Emphasizes small parameters, pushing coefficients toward zero — creates a simpler model, avoids overfitting
C. Has no effect on the model's complexity
D. Increases the learning rate in gradient descent

Show Answer

Answer: B

Large lambda heavily penalizes large parameter values, forcing them toward zero. This effectively removes unimportant features from the model, creating a simpler, more general model. However, if lambda is too large, it may force ALL parameters toward zero, causing underfitting. Small lambda allows all parameters to grow freely, risking overfitting.

Q8. True or False: In linear regression, the relationship between independent and dependent variables is always non-linear.

A. True
B. False

Show Answer

Answer: B — False

Linear regression models a LINEAR relationship — that's the definition. The model assumes y = w0 + w1*x1 + w2*x2... which is a linear combination of features. While you can add non-linear features (like x^2) to the feature matrix and still use linear regression, the model itself is linear in its parameters.

Q9. Why should features be scaled to z-scores before running linear regression?

A. It makes the model run faster on any computer
B. Features with different numerical scales cause extreme, unstable coefficients and make gradient descent inefficient
C. Z-scaling makes all features normally distributed
D. It prevents the model from overfitting on its own

Show Answer

Answer: B

When features have very different scales (age: 0-100 vs income: 0-1,000,000), the coefficients must also vary widely to compensate. This causes numerical instability, unreadable coefficients, and gradient descent that moves very slowly (cost function is elongated). Z-scoring all features to the same scale makes the cost function more circular and gradient descent efficient.

Q10. Stochastic gradient descent (SGD) differs from standard gradient descent in that:

A. SGD uses the entire training dataset for each update
B. SGD uses a small random batch of examples per update, making it much faster for large datasets
C. SGD always finds a better solution than standard gradient descent
D. SGD only works for logistic regression, not linear regression

Show Answer

Answer: B

Standard gradient descent computes the gradient using ALL n training examples for each update step — expensive for large datasets. SGD uses a small random batch (mini-batch) to estimate the gradient. This makes each update much faster, at the cost of some noise in the gradient estimate. For large datasets, SGD is far more practical and often converges just as well.

Q11. What happens if you include two perfectly correlated features (e.g., height in both feet and meters) in linear regression?

A. The model becomes more accurate because it has more information
B. The model runs twice as fast
C. The feature matrix becomes singular (non-invertible), causing the closed-form solution to fail; no additional information is gained
D. The regularization term automatically removes one of them

Show Answer

Answer: C

Perfectly correlated features provide no additional information (they are mathematically identical, just scaled differently). More critically, they cause the feature matrix A to be singular (rank-deficient), making the matrix inversion in the closed-form solution (A^T A)^-1 impossible. Numerical methods will also become unreliable. Always remove one of any pair of perfectly correlated features.

Q12. When does gradient descent get trapped in local minima?

A. Always, for any type of function
B. Only for convex functions with a single global minimum
C. Only for non-convex functions that have multiple local minima
D. Only when the learning rate is very small

Show Answer

Answer: C

Gradient descent can only get trapped in local minima for NON-CONVEX functions (functions with multiple valleys). For convex functions (one global minimum, like the linear regression cost function), gradient descent is guaranteed to find the global optimum. For non-convex functions (like neural networks), run from multiple starting points and take the best result found.

Q13. The closed-form solution for linear regression is w = (A^T A)^-1 A^T b. What is a key limitation of this approach?

A. It only works for binary classification problems
B. Matrix inversion is computationally slow for large feature matrices
C. It cannot be used when the data has any missing values
D. It requires the data to be perfectly normally distributed

Show Answer

Answer: B

The closed-form solution requires inverting the matrix (A^T A), which has complexity O(m^3) where m is the number of features. For small datasets this is fine, but for large feature sets it becomes very slow. This is why gradient descent is preferred in practice — it scales well to large datasets and many features.

Q14. A very low learning rate in gradient descent causes:

A. The algorithm to overshoot the minimum
B. The algorithm to fail to converge
C. Very slow convergence — many iterations needed to reach the minimum
D. The algorithm to find local rather than global minima

Show Answer

Answer: C

A very small learning rate means each step moves only a tiny amount toward the minimum. The algorithm will eventually converge (it won't overshoot), but it requires a very large number of iterations. In contrast, too large a learning rate causes overshooting. The optimal learning rate is neither too small nor too large.