Finding the best-fit line, residuals, gradient descent, regularization, feature engineering
Given a collection of n data points, find the line y = f(x) which best approximates those points. Once found, you can plug in any x value to get a predicted y value.
The residual for a data point is the difference between the actual (observed) value and the predicted (fitted) value:
Least squares regression finds the line that minimizes the sum of squared residuals across all training points. We square the residuals because:
Use when: Small datasets. Avoid when: Large datasets (inversion scales poorly).
An iterative optimization algorithm that finds the minimum of the cost function by taking repeated steps "downhill" in the direction that reduces the cost fastest.
| α value | Effect | Problem |
|---|---|---|
| Too small | Takes very tiny steps downhill | Slow convergence — takes forever to reach minimum |
| Too large | Takes huge steps, overshoots the minimum | Bounces back and forth, may fail to converge |
| Just right | Efficient steps toward minimum | None |
Strategy: Start larger, decrease as you get closer. Monitor the value of J() over iterations — if J is increasing, your α is too large.
The cost function for linear regression is convex — it has exactly ONE minimum (the global minimum). Imagine a bowl shape: no matter where you start, rolling downhill always leads to the same bottom.
For convex functions: gradient descent is guaranteed to find the global optimum.
For non-convex functions: gradient descent can get stuck in local minima (false bottoms). Solution: run from multiple starting points and take the best result.
Standard gradient descent uses ALL n training points to compute each update step — expensive for large datasets. SGD uses a small random batch of examples instead.
When features have very different numerical ranges (e.g., age 0-100 vs income 0-1,000,000), the coefficients become extreme and numerically unstable. Scale all features to z-scores BEFORE regression.
For features with a very skewed distribution (power-law, like income), use log(x) or sqrt(x) transformation to compress the range before z-scoring.
If two features measure the same thing (e.g., height in feet AND height in meters), they are perfectly correlated. Adding the second feature:
Use a correlation matrix to identify highly correlated features and remove one from each pair.
Regularization adds a penalty term to the cost function that discourages large parameter values. This is how Occam's Razor is applied mathematically — it pushes the model toward simplicity.
| λ value | Effect on model | Risk |
|---|---|---|
| Large λ | Strongly penalizes large parameters → forces them toward zero → simpler model with fewer effective features | Underfitting |
| Small λ | Minimal penalty → model freely uses all parameters → complex model | Overfitting |
| Optimal λ | Balance between complexity and fit | None |
Linear regression fits a line to predict continuous output. Residual = actual - predicted = vertical distance from point to line. Least squares minimizes the sum of squared residuals. Gradient descent iteratively minimizes the cost function: α too small = slow; α too large = overshoots and fails to converge. Convex function = one global minimum = gradient descent works. Large λ = simpler model (underfitting risk). Small λ = complex (overfitting risk). Scale features to z-scores. Remove perfectly correlated features.
Q1. Linear regression produces what type of output?
Answer: C
Linear regression is a regression algorithm — it predicts continuous numerical values (house price, temperature, sales volume). For classification tasks (predicting discrete class labels like spam/not spam), you use logistic regression or other classification algorithms.
Q2. The goal of least squares regression is to:
Answer: B
Least squares minimizes the sum of squared residuals (differences between actual and predicted values). We square because: (1) eliminates sign (positive and negative errors don't cancel), (2) penalizes large errors more than small ones, (3) gives a nice mathematical closed-form solution.
Q3. A residual in linear regression is geometrically:
Answer: C
The residual r = y - f(x) is the vertical (y-direction) distance from the data point to the regression line. If the point is above the line, the residual is positive. If below, negative. It is NOT perpendicular distance (that would define a different kind of regression) and NOT horizontal distance.
Q4. Gradient descent helps in:
Answer: A
Gradient descent is an iterative optimization algorithm that minimizes the cost function. It works by repeatedly stepping in the direction of the negative gradient (the direction of steepest descent) until it reaches a minimum. The "descent" in the name refers to descending down the cost surface.
Q5. If the learning rate in gradient descent is too high, what happens?
Answer: C
With too large a learning rate, each step is so big that the algorithm overshoots the minimum of the cost function. It bounces back and forth across the minimum and may actually diverge (cost increases instead of decreasing). Too small a learning rate causes slow convergence but will eventually reach the minimum.
Q6. The cost function for linear regression is convex. This means:
Answer: B
A convex function has exactly one minimum (the global minimum) — like the bottom of a bowl. No matter where gradient descent starts, rolling downhill always leads to the same optimal point. This guarantees gradient descent will find the best solution. Non-convex functions have multiple local minima where gradient descent can get stuck.
Q7. What is the effect of a very large regularization parameter lambda?
Answer: B
Large lambda heavily penalizes large parameter values, forcing them toward zero. This effectively removes unimportant features from the model, creating a simpler, more general model. However, if lambda is too large, it may force ALL parameters toward zero, causing underfitting. Small lambda allows all parameters to grow freely, risking overfitting.
Q8. True or False: In linear regression, the relationship between independent and dependent variables is always non-linear.
Answer: B — False
Linear regression models a LINEAR relationship — that's the definition. The model assumes y = w0 + w1*x1 + w2*x2... which is a linear combination of features. While you can add non-linear features (like x^2) to the feature matrix and still use linear regression, the model itself is linear in its parameters.
Q9. Why should features be scaled to z-scores before running linear regression?
Answer: B
When features have very different scales (age: 0-100 vs income: 0-1,000,000), the coefficients must also vary widely to compensate. This causes numerical instability, unreadable coefficients, and gradient descent that moves very slowly (cost function is elongated). Z-scoring all features to the same scale makes the cost function more circular and gradient descent efficient.
Q10. Stochastic gradient descent (SGD) differs from standard gradient descent in that:
Answer: B
Standard gradient descent computes the gradient using ALL n training examples for each update step — expensive for large datasets. SGD uses a small random batch (mini-batch) to estimate the gradient. This makes each update much faster, at the cost of some noise in the gradient estimate. For large datasets, SGD is far more practical and often converges just as well.
Q11. What happens if you include two perfectly correlated features (e.g., height in both feet and meters) in linear regression?
Answer: C
Perfectly correlated features provide no additional information (they are mathematically identical, just scaled differently). More critically, they cause the feature matrix A to be singular (rank-deficient), making the matrix inversion in the closed-form solution (A^T A)^-1 impossible. Numerical methods will also become unreliable. Always remove one of any pair of perfectly correlated features.
Q12. When does gradient descent get trapped in local minima?
Answer: C
Gradient descent can only get trapped in local minima for NON-CONVEX functions (functions with multiple valleys). For convex functions (one global minimum, like the linear regression cost function), gradient descent is guaranteed to find the global optimum. For non-convex functions (like neural networks), run from multiple starting points and take the best result found.
Q13. The closed-form solution for linear regression is w = (A^T A)^-1 A^T b. What is a key limitation of this approach?
Answer: B
The closed-form solution requires inverting the matrix (A^T A), which has complexity O(m^3) where m is the number of features. For small datasets this is fine, but for large feature sets it becomes very slow. This is why gradient descent is preferred in practice — it scales well to large datasets and many features.
Q14. A very low learning rate in gradient descent causes:
Answer: C
A very small learning rate means each step moves only a tiny amount toward the minimum. The algorithm will eventually converge (it won't overshoot), but it requires a very large number of iterations. In contrast, too large a learning rate causes overshooting. The optimal learning rate is neither too small nor too large.