Naive Bayes, decision trees, ensembles, AdaBoost, SVMs, supervised vs unsupervised, deep learning
| Dimension | What it means | Example |
|---|---|---|
| Power/Expressibility | How complex a pattern it can learn | Deep learning: very powerful but risks overfitting |
| Interpretability | Can you explain WHY it made a prediction? | Decision trees: explainable. Deep learning: "black box." |
| Ease of use | How many decisions you need to make to use it | Naive Bayes: few parameters, works "out of the box" |
| Training speed | How long it takes to learn from data | Naive Bayes: very fast. Deep learning: very slow |
| Prediction speed | How fast it makes predictions on new data | KNN: slow (compares to all training points) |
Based on Bayes' Theorem: given an input, compute the probability of each class and predict the most probable one.
The "naive" assumption: all features are conditionally independent given the class. This means knowing one feature tells us nothing about another feature, once we know the class. This is rarely true in reality, but the classifier still often works well in practice.
Log form (to avoid numerical underflow with very small probabilities):
Strengths: Fast, simple, works well with many features, handles missing data well. Weakness: Naive independence assumption is often violated.
A decision tree is a binary branching structure. Each internal node asks a yes/no question about a feature ("Is age > 9.5?"). Each training example follows a path from root to leaf, and the leaf assigns the class label.
Key disadvantage: Prone to overfitting — if each training example has its own leaf, the tree memorizes training data perfectly but generalizes poorly.
Entropy measures the "impurity" or "messiness" of a node — how mixed the classes are:
| Entropy value | Meaning | Good for classification? |
|---|---|---|
| H = 0 | All examples belong to ONE class — pure node | Best possible (a leaf) |
| H = log₂(m) | m classes distributed equally — maximum confusion | Worst possible |
Rule: Lower entropy = purer node = better split.
Information Gain: how much a split reduces entropy:
Stopping: Stop splitting when information gain falls below a threshold, or prune the tree after building it fully.
Build many decision trees, each trained on a random subset of features (and often a random subset of training data). Combine predictions by voting.
Bagging = Bootstrap AGGregatING — randomly pick subsets of features to train each tree on.
Combines many weak classifiers (individually only slightly better than random guessing, >50% accuracy) into one strong classifier.
SVMs find the maximum margin hyperplane — the decision boundary that is as far as possible from both classes.
SVM vs Logistic Regression:
What if the data is not linearly separable? The kernel trick transforms the data into a higher-dimensional space where linear separation BECOMES possible.
Key insight: The kernel function computes the dot product in the high-dimensional space WITHOUT explicitly transforming the data. This is computationally efficient.
Example: Data arranged in circles (inner class, outer class) cannot be separated by a line in 2D. By adding a dimension z = x² + y², the inner circle sits at low z and the outer circle sits at high z — now a horizontal plane separates them.
Common kernels: polynomial, Gaussian (RBF), radial basis function
Training data includes labels (y) for each example. The algorithm learns to map inputs to correct outputs.
Training data has NO labels. The algorithm discovers structure or patterns in the data on its own.
Feature engineering = applying domain knowledge to create or transform features to help ML algorithms work better.
Deep learning avoids most feature engineering by learning representations automatically, but requires much more data and computation.
Large neural networks with many hidden layers. Each layer learns increasingly abstract representations (e.g., pixels → edges → shapes → faces).
| Algorithm | Type | Key strength | Key weakness |
|---|---|---|---|
| Naive Bayes | Probabilistic | Fast, simple, handles many features | Independence assumption often wrong |
| Decision Tree | Tree-based | Interpretable, handles categorical | Overfits without pruning |
| Random Forest | Ensemble (bagging) | Robust, reduces variance | Less interpretable than single tree |
| AdaBoost | Ensemble (boosting) | Very effective, uses weak learners | Overfits noisy data |
| SVM | Margin-based | Effective in high dimensions, kernel trick | Slow for large datasets |
| Deep Learning | Neural network | Very powerful, auto feature learning | Needs huge data, slow, "black box" |
Naive Bayes: Bayes' theorem + conditional independence assumption. Decision trees: split by max information gain; H=0 = pure node. AdaBoost: increase weights of misclassified examples each round — exam answer = "misclassified by weak classifier in round t" (Answer A). SVMs: max margin; support vectors = points closest to boundary; kernel trick enables non-linear separation. Supervised = labeled data (classification/regression). Unsupervised = no labels (find structure). Deep learning: backpropagation + non-linear activations.
Q1. The "naive" assumption in Naive Bayes means:
Answer: B
Naive Bayes assumes all features are conditionally independent given the class. This means: once you know the class, knowing one feature tells you nothing about other features. In reality this is rarely true (e.g., word frequencies in emails are correlated), but the classifier often performs surprisingly well despite this violation.
Q2. In a decision tree, entropy H = 0 at a node means:
Answer: C
Entropy H = 0 means all items in the set belong to one class — the node is perfectly pure. H(S) = -sum(fi * log2(fi)). If all examples are class A (f_A=1, f_B=0), then H = -(1*log2(1) + 0*log2(0)) = -(0 + 0) = 0. This is the best possible situation for a leaf node in a decision tree.
Q3. In AdaBoost, a training example's weight is increased when it was:
Answer: A
In AdaBoost, after training the weak classifier in round t, the weights of misclassified examples are INCREASED (and correctly classified examples get lower weight). This forces the next weak classifier to focus more attention on the hard-to-classify examples. Over many rounds, the ensemble learns to handle all examples, including the difficult ones.
Q4. In Support Vector Machines, support vectors are:
Answer: A
Support vectors are the training points that lie on the margin boundary — the data points closest to the decision hyperplane. Only these points determine where the hyperplane is positioned. Points further away have no influence. The SVM objective is to maximize the margin (distance between the two support vector boundaries), which is why only boundary points matter.
Q5. The kernel trick in SVMs allows:
Answer: B
The kernel trick allows SVMs to handle non-linearly separable data by implicitly mapping data to a higher-dimensional space where a linear hyperplane can separate the classes. The "trick" is that the kernel function computes dot products in this high-dimensional space without explicitly performing the transformation — making it computationally feasible.
Q6. Which of the following correctly describes supervised learning?
Answer: B
Supervised learning requires labeled training data: each input example x_i has an associated label or target y_i. The "supervision" comes from these labels. Tasks: classification (y is a discrete class) and regression (y is a continuous value). Unsupervised learning has no y values — it discovers structure in unlabeled data.
Q7. The information gain of a split is calculated as:
Answer: B
Information Gain = H(parent) - sum(|S_i|/|S| * H(S_i)). It measures how much the split reduces entropy (impurity). The best split is the one with the highest information gain — it produces the purest child nodes. We always choose the split that gives us the most "information" about the class labels.
Q8. How does AdaBoost avoid the weakness of any single weak classifier?
Answer: B
AdaBoost takes many weak classifiers (each only slightly better than random guessing — accuracy just above 50%) and combines them into a strong final classifier. Each weak classifier gets a weight proportional to its accuracy. The final prediction is a weighted vote. The key is the sequential reweighting: each round focuses on the errors of previous rounds.
Q9. Bagging (as used in Random Forests) primarily addresses which problem?
Answer: B
Bagging addresses high variance (overfitting). A single decision tree is very sensitive to training data — slightly different data can produce a completely different tree. By training many trees on random subsets of features and data, then voting across them, bagging averages out individual tree variance. Each tree overfits somewhat, but their errors in different directions cancel out.
Q10. SVM values points differently from logistic regression in that:
Answer: B
Logistic regression uses the probability of ALL training points in computing the cost function and thus the decision boundary. SVM's objective only depends on the support vectors (the points closest to the margin). Points far from the margin have no influence on SVM's decision boundary whatsoever. This makes SVM more robust to outliers that are far from the boundary.
Q11. Why do deep learning neural networks need non-linear activation functions?
Answer: B
If all nodes use linear activation functions (or no activation), then no matter how many layers you stack, the entire network computes a linear function of the input. Multiple linear layers collapse mathematically into one linear layer. Non-linear activations (sigmoid, ReLU) prevent this collapse and allow the network to learn complex, non-linear patterns — which is the whole point of depth.
Q12. The main disadvantage of AdaBoost is:
Answer: B
AdaBoost keeps increasing the weight of misclassified examples and keeps adding classifiers until everything is classified correctly. If the training data contains noise (mislabeled examples, outliers), AdaBoost will try to fit those noisy examples too, leading to overfitting. Solutions: regularization, limiting the number of boosting rounds, or using methods that handle noise explicitly.
Q13. What is the primary advantage of decision trees over many other ML methods?
Answer: B
Decision trees are highly interpretable — you can follow the exact path through the tree that led to any prediction and explain each decision in plain language. This is a critical advantage in domains like medicine, law, and finance where you need to justify predictions. Deep learning models, by contrast, are "black boxes" — highly accurate but difficult to explain.
Q14. Gradient boosted decision trees are particularly notable because:
Answer: B
Gradient boosted decision trees (e.g., XGBoost, LightGBM) are the most common winning method in Kaggle machine learning competitions for structured/tabular data. They combine the expressiveness of decision trees with the power of boosting, and handle many real-world datasets very well. For unstructured data (images, text, audio), deep learning dominates.
Q15. The dimension of "interpretability" when comparing ML methods refers to:
Answer: B
Interpretability refers to whether you can understand and explain the model's decisions. Decision trees: highly interpretable (you can read the tree and explain each branch). Linear/logistic regression: interpretable (largest coefficients identify most influential features). Deep learning: low interpretability ("black box") — highly accurate but you can't easily explain why it made a specific prediction. In high-stakes domains, interpretability is essential.