L8: Logistic Regression

Classification is the task of assigning an input to one of several predefined categories (classes). Unlike regression which predicts a number, classification predicts a discrete label: spam/not spam, cancer/benign, male/female. Logistic regression is the foundational classification algorithm.

Classification vs Regression

	Regression	Classification
Output	Continuous number (any value)	Discrete class label
Examples	House price, temperature, revenue	Spam/ham, cancer/benign, cat/dog
Loss function	Mean Squared Error (MSE)	Cross-Entropy Loss (Log Loss)
Output range	Any real number	Probability 0 to 1 (via sigmoid)
Algorithm	Linear regression	Logistic regression

Why Not Use Linear Regression for Classification?

Linear regression output is unbounded — it can produce values below 0 or above 1, which don't make sense as probabilities. Also, the decision threshold (0.5) may not work correctly when extreme outliers are added to the dataset — the regression line shifts and the threshold breaks down.

Solution: Use the sigmoid function to convert any real-valued score into a probability between 0 and 1.

The Sigmoid (Logit / Logistic) Function EXAM HOT

The sigmoid function converts ANY real number into a probability between 0 and 1:

f(x) = 1 / (1 + e^(-cx))

c = steepness parameter (controls how sharp the transition is)

Input x	Output f(x)	Interpretation
x = 0	0.5	Decision boundary — equal probability of either class
x → +∞	→ 1	Almost certain it's class 1
x → -∞	→ 0	Almost certain it's class 0

The sigmoid function creates an S-shaped curve that smoothly transitions from 0 to 1.

Exam answer: "Sigmoid function is used for?" → To model the probability that a binary outcome variable belongs to a particular class

Logistic Regression Model

Step 1: Fit a linear function to the features (same as linear regression):

h(x, w) = w₀ + w₁x₁ + w₂x₂ + ... + wₑ₋₁xₑ₋₁

Step 2: Pass through the sigmoid to get a probability:

f(x) = 1 / (1 + e^(-h(x,w)))

Output = P(y=1 | x) = probability that the input belongs to class 1

If f(x) ≥ 0.5, predict class 1. If f(x) < 0.5, predict class 0. (Threshold can be adjusted depending on the problem.)

Decision Boundaries

Logistic regression finds the best separating line (or hyperplane in multiple dimensions) between two classes in feature space. Points on one side are predicted class 0; points on the other side are class 1.

The decision boundary is where f(x) = 0.5, which is where h(x,w) = 0
Linear logistic regression: the boundary is a straight line
Non-linear boundaries: possible by adding non-linear features (x², x₁×x₂)
Perfect separation through complex boundaries usually means overfitting, not genuine insight

Cross-Entropy Loss (Log Loss) EXAM HOT

Logistic regression uses cross-entropy loss, NOT mean squared error. Why? Because MSE creates a non-convex cost function for logistic regression, making gradient descent unreliable. Cross-entropy creates a convex cost function.

For positive class (y=1): cost = -log( f(x) )

For negative class (y=0): cost = -log( 1 - f(x) )

Intuition: if y=1 and prediction f(x)→1, cost→0. If prediction f(x)→0 (wrong!), cost→∞. Penalizes confident wrong predictions most.

J(w) = -(1/n) ∑[ yᵢ log(f(xᵢ)) + (1-yᵢ) log(1-f(xᵢ)) ]

Combined log loss for entire dataset. This function is convex → gradient descent finds global minimum.

Exam answer: "Primary loss function in logistic regression?" → Cross-Entropy Loss (Log Loss) — NOT MSE, NOT RMSE, NOT MAE

Class Imbalance in Logistic Regression

When training with 10 positive examples and 1,000,000 negative examples, the best-scoring decision boundary gets pushed very far from the positive class (trying to be far from the massive negative cluster) rather than being placed at the midpoint between classes.

Fix: Use equal numbers of positive and negative training examples.

Ways to balance: find more minority examples, discard majority examples, weight minority class more heavily, replicate minority examples with random perturbation.

Multi-Class Classification

Logistic regression is binary by default. For more than 2 classes:

One-vs-All approach:

Build one binary classifier for each class: "Is this class 1 vs all others?"

Run all classifiers on the new input

Predict the class whose classifier gives the highest probability

Do NOT encode nominal multi-class as integers (1, 2, 3). This creates a false ordering — the model will treat class 3 as mathematically "more than" class 1, which is meaningless for nominal categories like "comedy, drama, action."

Exception: Ordinal multi-class (like star ratings 1-5) can be encoded numerically using ordinal logistic regression.

Lecture 8 Summary — 5 Minute Revision

Classification predicts discrete labels (vs regression which predicts continuous values). Sigmoid: f(0)=0.5, output always 0-1, converts score to probability. Loss function = Cross-Entropy (Log Loss) — NOT MSE. Log loss is convex → gradient descent finds global minimum. For imbalanced data: balance the training classes. Multi-class: use one-vs-all. Don't encode nominal multi-class as integers. Complex decision boundaries = usually overfitting.

Practice Questions

Q1. The primary loss function used in logistic regression for binary classification is:

A. Mean Squared Error (MSE)
B. Root Mean Squared Error (RMSE)
C. Cross-Entropy Loss (Log Loss)
D. Mean Absolute Error (MAE)

Show Answer

Answer: C

Logistic regression uses Cross-Entropy Loss (Log Loss), not MSE. MSE creates a non-convex optimization problem for logistic regression, meaning gradient descent could get stuck in local minima. Cross-entropy creates a convex cost function, guaranteeing gradient descent finds the global minimum. The formula penalizes confident wrong predictions most severely.

Q2. In logistic regression, what is the sigmoid function used for?

A. To calculate the mean and standard deviation of input variables
B. To model the probability that a binary outcome belongs to a particular class
C. To measure the accuracy of the model on training data
D. To identify and remove outliers from the dataset

Show Answer

Answer: B

The sigmoid function converts any real-valued score (the linear combination of features) into a probability between 0 and 1. This probability represents the likelihood that the input belongs to class 1 (or the positive class). For example, P(email is spam | its features) = sigmoid(linear score).

Q3. What does the sigmoid function output when its input is exactly 0?

A. 0
B. 1
C. -1
D. 0.5

Show Answer

Answer: D

f(0) = 1/(1+e^0) = 1/(1+1) = 1/2 = 0.5. The sigmoid always equals 0.5 at x=0. This is the decision boundary: if the linear score h(x,w) = 0, the model is exactly 50-50 between the two classes. Positive scores give probability above 0.5 (predict class 1), negative scores give below 0.5 (predict class 0).

Q4. Logistic regression can be extended to multi-class problems using:

A. Multiple linear regression equations combined
B. The one-vs-all approach: one binary classifier per class, predict the class with highest probability
C. Encoding classes as sequential integers (1, 2, 3, ...) in the target variable
D. Training a single model with a different activation function

Show Answer

Answer: B

One-vs-all (one-vs-rest): for k classes, train k separate binary classifiers. Each classifier answers "Is this class i, or is it something else?" For a new input, run all k classifiers and predict the class whose classifier returns the highest probability. This correctly handles multi-class without creating false orderings.

Q5. Why is it problematic to encode nominal multi-class categories as integers (blond=0, brown=1, red=2)?

A. The model will require more memory to store larger integer values
B. This creates a false mathematical ordering — the model incorrectly treats "red" (2) as greater than "blond" (0)
C. Integer encoding is not supported in logistic regression implementations
D. It makes the model train much slower

Show Answer

Answer: B

Encoding nominal (unordered) categories as integers implies mathematical relationships that don't exist. The model would think "red" (2) = twice "blond" (0), or that "brown" (1) is between the other two. For nominal categories, use one-hot encoding (a separate binary feature for each category). Ordinal encoding is only appropriate when the categories have a genuine meaningful order.

Q6. True or False: Logistic regression can be used for both binary and multi-class classification problems.

A. True
B. False — logistic regression is strictly binary only

Show Answer

Answer: A — True

While logistic regression is inherently binary (outputs a probability between 0 and 1 for two classes), it can handle multi-class problems using the one-vs-all strategy. For k classes, you build k binary logistic regression models. At prediction time, each model gives a probability and you select the class with the highest probability.

Q7. The cross-entropy cost function for logistic regression is convex. Why does this matter?

A. It means the function is easy to visualize
B. It guarantees that gradient descent will find the global minimum (the best possible parameters)
C. It means the model will always achieve 100% training accuracy
D. It means no regularization is needed

Show Answer

Answer: B

A convex function has exactly one global minimum. When the cost function is convex, gradient descent is guaranteed to find that global minimum — it cannot get stuck in local minima. This is why cross-entropy is used instead of MSE for logistic regression; MSE creates a non-convex surface with potential local minima.

Q8. When training logistic regression with severely imbalanced classes (10 positive vs 1,000,000 negative), without fixing the imbalance the model will tend to:

A. Place the decision boundary exactly between the two class clusters
B. Push the decision boundary far from the large cluster, potentially misclassifying most of the minority class
C. Automatically balance the classes internally
D. Refuse to train and return an error

Show Answer

Answer: B

With 10 positive vs 1,000,000 negative examples, the optimal decision boundary from the logistic regression perspective is pushed far away from the massive negative cluster (to minimize total error), rather than sitting between the two groups. This means many positive (minority) examples will be on the wrong side. Fix: use equal numbers of positive and negative examples.

Q9. The key difference between logistic regression and linear regression is:

A. Linear regression uses gradient descent; logistic regression does not
B. Logistic regression uses a sigmoid to produce probabilities for classification; linear regression produces continuous numerical outputs
C. Logistic regression is always more accurate than linear regression
D. Linear regression requires normally distributed data; logistic regression does not

Show Answer

Answer: B

The core difference: linear regression predicts continuous values (no bound on output). Logistic regression passes the linear score through a sigmoid function to produce a probability (0 to 1) for classification. Both use gradient descent. Linear regression uses MSE loss. Logistic regression uses cross-entropy loss. Use logistic regression when your output is a discrete class label.

Q10. A logistic regression model achieves a perfect decision boundary that separates all training examples with zero error. This most likely indicates:

A. An excellent, highly generalizable model
B. The model has found the true underlying pattern in the data
C. Overfitting — perfect separation via a complex boundary reflects memorization of training data, not genuine insight
D. The sigmoid function is working correctly

Show Answer

Answer: C

Perfect separation on training data is a red flag for overfitting. Complex decision boundaries that perfectly classify all training examples usually memorize the specific noise and quirks of the training set. On new, unseen data, such models typically perform poorly. A simpler decision boundary that makes a few training errors often generalizes much better.

Q11. Non-linear decision boundaries in logistic regression can be achieved by:

A. Increasing the learning rate above 1.0
B. Using a different loss function
C. Explicitly adding non-linear features (like x^2 or x1*x2) to the feature set
D. Logistic regression cannot produce non-linear boundaries under any circumstances

Show Answer

Answer: C

Logistic regression by itself produces linear decision boundaries. To get non-linear boundaries (like circles, curves, or complex shapes), you explicitly add non-linear features to the input — for example, x1^2, x2^2, or x1*x2. The model is then "linear in the features" but the boundary is non-linear in the original input space. However, too many non-linear features risks overfitting.

Q12. For binary classification (class 0 and class 1), the target variable must be encoded as:

A. -1 and +1
B. 0 and 1
C. Any two different numbers
D. The string labels "negative" and "positive"

Show Answer

Answer: B

Logistic regression requires the target variable to be 0 (negative class) and 1 (positive class). The log loss formula uses these values directly as indicator variables — the y_i terms in the formula switch between the two cost terms. Examples: spam=1, ham=0; cancer=1, benign=0; male=0, female=1.

Q13. The decision boundary of a logistic regression model is located where:

A. The sigmoid function equals 0
B. The sigmoid function equals 0.5, which is where the linear score h(x,w) = 0
C. The cost function equals 0
D. The accuracy of the model is maximized

Show Answer

Answer: B

The decision boundary is the set of points where the model is exactly 50/50 (probability = 0.5) between the two classes. Since sigmoid(0) = 0.5, the decision boundary is where h(x,w) = w0 + w1*x1 + ... = 0. Points where h(x,w) > 0 have probability > 0.5 (predict class 1). Points where h(x,w) < 0 have probability < 0.5 (predict class 0).

Logistic Regression & Classification

Classification vs Regression

Why Not Use Linear Regression for Classification?

The Sigmoid (Logit / Logistic) Function EXAM HOT

Logistic Regression Model

Decision Boundaries

Cross-Entropy Loss (Log Loss) EXAM HOT

Class Imbalance in Logistic Regression

Multi-Class Classification

One-vs-All approach:

Lecture 8 Summary — 5 Minute Revision

Practice Questions