L9: Topics in Machine Learning

This lecture surveys the major machine learning methods beyond logistic regression. You need to understand what each algorithm does, its key properties, strengths, and weaknesses — especially AdaBoost and SVMs which had confirmed exam questions.

Dimensions of ML Performance

Dimension	What it means	Example
Power/Expressibility	How complex a pattern it can learn	Deep learning: very powerful but risks overfitting
Interpretability	Can you explain WHY it made a prediction?	Decision trees: explainable. Deep learning: "black box."
Ease of use	How many decisions you need to make to use it	Naive Bayes: few parameters, works "out of the box"
Training speed	How long it takes to learn from data	Naive Bayes: very fast. Deep learning: very slow
Prediction speed	How fast it makes predictions on new data	KNN: slow (compares to all training points)

Naive Bayes Classifier

Based on Bayes' Theorem: given an input, compute the probability of each class and predict the most probable one.

P(B|A) = P(A|B) × P(B) / P(A)

For classification: P(class | input) ∝ P(class) × P(input | class)

The "naive" assumption: all features are conditionally independent given the class. This means knowing one feature tells us nothing about another feature, once we know the class. This is rarely true in reality, but the classifier still often works well in practice.

C(X) = argmax P(Cᵢ) × ∏ P(x₁|Cᵢ)

Predict the class Cᵢ that maximizes: prior probability × product of likelihoods of each feature

Log form (to avoid numerical underflow with very small probabilities):

C(X) = argmax [ log P(Cᵢ) + ∑ log P(x₁|Cᵢ) ]

Strengths: Fast, simple, works well with many features, handles missing data well. Weakness: Naive independence assumption is often violated.

Decision Tree Classifiers

A decision tree is a binary branching structure. Each internal node asks a yes/no question about a feature ("Is age > 9.5?"). Each training example follows a path from root to leaf, and the leaf assigns the class label.

Advantages of Decision Trees:

Non-linearity: can represent highly complex decision boundaries
Categorical variables: naturally handles "if hair color = red"
Interpretability: you can read the tree and explain every decision
Robustness: can build ensembles (random forests) for better performance

Key disadvantage: Prone to overfitting — if each training example has its own leaf, the tree memorizes training data perfectly but generalizes poorly.

Entropy and Information Gain — How to Choose Splits

Entropy measures the "impurity" or "messiness" of a node — how mixed the classes are:

H(S) = -∑ fᵢ × log₂(fᵢ)

fᵢ = fraction of examples belonging to class i

Entropy value	Meaning	Good for classification?
H = 0	All examples belong to ONE class — pure node	Best possible (a leaf)
H = log₂(m)	m classes distributed equally — maximum confusion	Worst possible

Rule: Lower entropy = purer node = better split.

Information Gain: how much a split reduces entropy:

IG(S) = H(S) - ∑(|Sᵢ|/|S|) × H(Sᵢ)

Choose the split with the HIGHEST information gain at each node.

Stopping: Stop splitting when information gain falls below a threshold, or prune the tree after building it fully.

Ensemble Methods: Bagging and Boosting

Bagging (Random Forests)

Build many decision trees, each trained on a random subset of features (and often a random subset of training data). Combine predictions by voting.

Voting across many trees increases robustness and reduces variance (overfitting)
Also lets you measure uncertainty (how many trees agree?)

Bagging = Bootstrap AGGregatING — randomly pick subsets of features to train each tree on.

AdaBoost (Adaptive Boosting) EXAM HOT

Combines many weak classifiers (individually only slightly better than random guessing, >50% accuracy) into one strong classifier.

How AdaBoost works:

Initialize: all training examples have equal weight (1/n each)

Train a weak classifier on the current weights

Increase the weight of misclassified examples — so the next classifier focuses on them

Compute the classifier's weight α based on its accuracy: better classifiers get higher weight

Repeat for T rounds; final prediction = weighted vote of all classifiers

Exam answer: "In AdaBoost, weight is increased for examples that were..." → Classified incorrectly by the weak classifier trained in round t (Answer A)

Key properties:

Pro: Effectively uses weak classifiers. Gradient-boosted decision trees (a variant) dominate Kaggle competitions.
Con: Tries to fit every example → overfits noisy data. Handle noise with regularization.

Support Vector Machines (SVMs) EXAM HOT

SVMs find the maximum margin hyperplane — the decision boundary that is as far as possible from both classes.

Support Vectors — The data points that are closest to the decision boundary. They "support" (define) the position of the margin boundary. Only these points affect the hyperplane position — all other points are irrelevant.

Exam answer: "Role of support vectors?" → Data points closest to the decision boundary (Answer A)

SVM vs Logistic Regression:

Logistic regression: considers ALL training points in determining the boundary (weighted by their distance)
SVM: ONLY the support vectors (boundary points) determine the hyperplane; points far away are irrelevant

Kernel Trick

What if the data is not linearly separable? The kernel trick transforms the data into a higher-dimensional space where linear separation BECOMES possible.

Key insight: The kernel function computes the dot product in the high-dimensional space WITHOUT explicitly transforming the data. This is computationally efficient.

Example: Data arranged in circles (inner class, outer class) cannot be separated by a line in 2D. By adding a dimension z = x² + y², the inner circle sits at low z and the outer circle sits at high z — now a horizontal plane separates them.

Common kernels: polynomial, Gaussian (RBF), radial basis function

Supervised vs Unsupervised Learning EXAM HOT

Supervised Learning

Training data includes labels (y) for each example. The algorithm learns to map inputs to correct outputs.

Classification: predicts discrete class labels
Regression: predicts continuous values
Algorithms: logistic regression, decision trees, SVMs, Naive Bayes, linear regression

Unsupervised Learning

Training data has NO labels. The algorithm discovers structure or patterns in the data on its own.

Clustering: finds natural groupings (K-means)
Dimensionality reduction: PCA, t-SNE
Best for exploration — making sense of data no human has labeled

Feature Engineering

Feature engineering = applying domain knowledge to create or transform features to help ML algorithms work better.

Z-scores and normalization
Imputing missing values
Dimension reduction (PCA)
Non-linear transformations: products, ratios, log(x)

Deep learning avoids most feature engineering by learning representations automatically, but requires much more data and computation.

Deep Learning (Overview)

Large neural networks with many hidden layers. Each layer learns increasingly abstract representations (e.g., pixels → edges → shapes → faces).

Trained via backpropagation (stochastic gradient descent through all layers)
Requires non-linear activation functions (sigmoid, ReLU) at each node — linear activations collapse all layers into one
Generally avoids manual feature engineering — learns features automatically
Non-convex optimization problem → can get stuck in local minima, but usually produces good results in practice

Algorithm Comparison Table

Algorithm	Type	Key strength	Key weakness
Naive Bayes	Probabilistic	Fast, simple, handles many features	Independence assumption often wrong
Decision Tree	Tree-based	Interpretable, handles categorical	Overfits without pruning
Random Forest	Ensemble (bagging)	Robust, reduces variance	Less interpretable than single tree
AdaBoost	Ensemble (boosting)	Very effective, uses weak learners	Overfits noisy data
SVM	Margin-based	Effective in high dimensions, kernel trick	Slow for large datasets
Deep Learning	Neural network	Very powerful, auto feature learning	Needs huge data, slow, "black box"

Lecture 9 Summary — 5 Minute Revision

Naive Bayes: Bayes' theorem + conditional independence assumption. Decision trees: split by max information gain; H=0 = pure node. AdaBoost: increase weights of misclassified examples each round — exam answer = "misclassified by weak classifier in round t" (Answer A). SVMs: max margin; support vectors = points closest to boundary; kernel trick enables non-linear separation. Supervised = labeled data (classification/regression). Unsupervised = no labels (find structure). Deep learning: backpropagation + non-linear activations.

Practice Questions

Q1. The "naive" assumption in Naive Bayes means:

A. The model is simpler than all other classifiers
B. All features are assumed to be conditionally independent given the class label
C. The model requires no training data to make predictions
D. All classes are assumed to have equal prior probability

Show Answer

Answer: B

Naive Bayes assumes all features are conditionally independent given the class. This means: once you know the class, knowing one feature tells you nothing about other features. In reality this is rarely true (e.g., word frequencies in emails are correlated), but the classifier often performs surprisingly well despite this violation.

Q2. In a decision tree, entropy H = 0 at a node means:

A. The node is maximally impure — all classes equally represented
B. The split provided no information gain
C. All examples at this node belong to the same class (perfectly pure)
D. The tree cannot be split further

Show Answer

Answer: C

Entropy H = 0 means all items in the set belong to one class — the node is perfectly pure. H(S) = -sum(fi * log2(fi)). If all examples are class A (f_A=1, f_B=0), then H = -(1*log2(1) + 0*log2(0)) = -(0 + 0) = 0. This is the best possible situation for a leaf node in a decision tree.

Q3. In AdaBoost, a training example's weight is increased when it was:

A. Classified incorrectly by the weak classifier trained in round t
B. Classified correctly by all weak classifiers trained so far
C. A randomly selected example in round t
D. The easiest example in the training set

Show Answer

Answer: A

In AdaBoost, after training the weak classifier in round t, the weights of misclassified examples are INCREASED (and correctly classified examples get lower weight). This forces the next weak classifier to focus more attention on the hard-to-classify examples. Over many rounds, the ensemble learns to handle all examples, including the difficult ones.

Q4. In Support Vector Machines, support vectors are:

A. The data points closest to the decision boundary that define the margin
B. The data points farthest from the decision boundary
C. Randomly selected data points used to train the model
D. All data points used in the training set

Show Answer

Answer: A

Support vectors are the training points that lie on the margin boundary — the data points closest to the decision hyperplane. Only these points determine where the hyperplane is positioned. Points further away have no influence. The SVM objective is to maximize the margin (distance between the two support vector boundaries), which is why only boundary points matter.

Q5. The kernel trick in SVMs allows:

A. Simplifying the model by removing the least important features
B. Implicitly transforming data into a higher-dimensional space to achieve linear separation of non-linearly separable data
C. Reducing the number of support vectors required
D. Applying L2 regularization to all SVM parameters

Show Answer

Answer: B

The kernel trick allows SVMs to handle non-linearly separable data by implicitly mapping data to a higher-dimensional space where a linear hyperplane can separate the classes. The "trick" is that the kernel function computes dot products in this high-dimensional space without explicitly performing the transformation — making it computationally feasible.

Q6. Which of the following correctly describes supervised learning?

A. Finding structure in data without any class labels or target values
B. Training with input examples that each have an associated class label or target value
C. Clustering data points into groups based on similarity
D. Building models with no human annotation whatsoever

Show Answer

Answer: B

Supervised learning requires labeled training data: each input example x_i has an associated label or target y_i. The "supervision" comes from these labels. Tasks: classification (y is a discrete class) and regression (y is a continuous value). Unsupervised learning has no y values — it discovers structure in unlabeled data.

Q7. The information gain of a split is calculated as:

A. The sum of entropies of all child nodes
B. The parent node's entropy MINUS the weighted average entropy of child nodes
C. The number of correct classifications after the split
D. The maximum entropy across all possible splits

Show Answer

Answer: B

Information Gain = H(parent) - sum(|S_i|/|S| * H(S_i)). It measures how much the split reduces entropy (impurity). The best split is the one with the highest information gain — it produces the purest child nodes. We always choose the split that gives us the most "information" about the class labels.

Q8. How does AdaBoost avoid the weakness of any single weak classifier?

A. By using only the most accurate weak classifiers
B. By combining many weak classifiers (each >50% accuracy), weighted by their individual accuracy, into a strong ensemble
C. By applying kernel tricks to each weak classifier
D. By training each weak classifier on the complete dataset with no reweighting

Show Answer

Answer: B

AdaBoost takes many weak classifiers (each only slightly better than random guessing — accuracy just above 50%) and combines them into a strong final classifier. Each weak classifier gets a weight proportional to its accuracy. The final prediction is a weighted vote. The key is the sequential reweighting: each round focuses on the errors of previous rounds.

Q9. Bagging (as used in Random Forests) primarily addresses which problem?

A. Underfitting due to too few features
B. Overfitting (high variance) by training many trees on random feature subsets and voting
C. Missing data imputation
D. The computational cost of training a single large tree

Show Answer

Answer: B

Bagging addresses high variance (overfitting). A single decision tree is very sensitive to training data — slightly different data can produce a completely different tree. By training many trees on random subsets of features and data, then voting across them, bagging averages out individual tree variance. Each tree overfits somewhat, but their errors in different directions cancel out.

Q10. SVM values points differently from logistic regression in that:

A. SVM uses all training points equally; logistic regression ignores distant points
B. Logistic regression values all training points; SVM only values the support vectors (boundary points)
C. SVM requires no training data after kernel transformation
D. Both methods value exactly the same set of points

Show Answer

Answer: B

Logistic regression uses the probability of ALL training points in computing the cost function and thus the decision boundary. SVM's objective only depends on the support vectors (the points closest to the margin). Points far from the margin have no influence on SVM's decision boundary whatsoever. This makes SVM more robust to outliers that are far from the boundary.

Q11. Why do deep learning neural networks need non-linear activation functions?

A. To make the training process faster
B. Without non-linearity, multiple hidden layers collapse into a single linear transformation — no benefit from depth
C. To prevent the vanishing gradient problem in all cases
D. Non-linear activations allow using smaller datasets

Show Answer

Answer: B

If all nodes use linear activation functions (or no activation), then no matter how many layers you stack, the entire network computes a linear function of the input. Multiple linear layers collapse mathematically into one linear layer. Non-linear activations (sigmoid, ReLU) prevent this collapse and allow the network to learn complex, non-linear patterns — which is the whole point of depth.

Q12. The main disadvantage of AdaBoost is:

A. It can only be used with decision tree weak classifiers
B. It tries to fit every training example, making it susceptible to overfitting noisy data
C. It is much slower than training a single decision tree
D. It cannot handle multi-class classification problems

Show Answer

Answer: B

AdaBoost keeps increasing the weight of misclassified examples and keeps adding classifiers until everything is classified correctly. If the training data contains noise (mislabeled examples, outliers), AdaBoost will try to fit those noisy examples too, leading to overfitting. Solutions: regularization, limiting the number of boosting rounds, or using methods that handle noise explicitly.

Q13. What is the primary advantage of decision trees over many other ML methods?

A. They always achieve the highest accuracy
B. They are interpretable — you can trace any prediction back through the tree and explain every decision
C. They require no training data at all
D. They never overfit the training data

Show Answer

Answer: B

Decision trees are highly interpretable — you can follow the exact path through the tree that led to any prediction and explain each decision in plain language. This is a critical advantage in domains like medicine, law, and finance where you need to justify predictions. Deep learning models, by contrast, are "black boxes" — highly accurate but difficult to explain.

Q14. Gradient boosted decision trees are particularly notable because:

A. They are the theoretical foundation of all ML algorithms
B. They dominate competitions like Kaggle for structured data problems
C. They were the first ML algorithm ever invented
D. They work exclusively on image data

Show Answer

Answer: B

Gradient boosted decision trees (e.g., XGBoost, LightGBM) are the most common winning method in Kaggle machine learning competitions for structured/tabular data. They combine the expressiveness of decision trees with the power of boosting, and handle many real-world datasets very well. For unstructured data (images, text, audio), deep learning dominates.

Q15. The dimension of "interpretability" when comparing ML methods refers to:

A. How many dimensions (features) the algorithm can handle
B. Whether you can explain WHY the model made a specific prediction
C. How easy the algorithm is to implement in code
D. How many hyperparameters need to be tuned

Show Answer

Answer: B

Interpretability refers to whether you can understand and explain the model's decisions. Decision trees: highly interpretable (you can read the tree and explain each branch). Linear/logistic regression: interpretable (largest coefficients identify most influential features). Deep learning: low interpretability ("black box") — highly accurate but you can't easily explain why it made a specific prediction. In high-stakes domains, interpretability is essential.