Distance metrics, KNN classification, K-means clustering, hierarchical clustering, unsupervised learning
A distance measure is formally called a metric only if it satisfies all four properties. If any property fails, it is a similarity measure, not a distance metric.
| Property | Rule | Plain English |
|---|---|---|
| Positivity | d(x,y) ≥ 0 | Distance is never negative |
| Identity | d(x,y) = 0 iff x = y | Zero distance means same point |
| Symmetry | d(x,y) = d(y,x) | Distance from A to B equals B to A |
| Triangle inequality | d(x,y) ≤ d(x,z) + d(z,y) | Direct path is never longer than detour through z |
| Measure | Range | Which property fails |
|---|---|---|
| Correlation coefficient | -1 to 1 | Positivity (can be negative) and Identity (d=0 doesn't always mean same point) |
| Cosine similarity | -1 to 1 | Positivity and Identity |
| Directed travel time | Asymmetric | Symmetry (one-way streets: distance A→B ≠ B→A) |
A family of distance metrics controlled by a single parameter k:
| k value | Name | Formula | Intuition |
|---|---|---|---|
| k = 1 | Manhattan (L1) | d = ∑|xᵢ - yᵢ| | "Taxi cab" distance — travel only on a grid (horizontal + vertical). Sum of absolute differences. |
| k = 2 | Euclidean (L2) | d = √(∑(xᵢ-yᵢ)²) | Straight-line distance. Most popular. Weighs all dimensions equally. |
| k = ∞ | Chebyshev | d = max|xᵢ - yᵢ| | Only the largest dimensional difference matters |
KNN is the simplest classification algorithm: find the K most similar (nearest) training examples to a new input, then predict by majority vote.
Instead of voting, assign the average (or weighted average) of the K neighbors' target values.
| K value | Decision boundary | Risk |
|---|---|---|
| K = 1 | Very complex, jagged — follows every individual training example | Overfitting — sensitive to noise |
| Larger K | Smoother — averages over many neighbors | More robust, but may oversmooth (underfit) |
Disadvantage: Prediction is slow — must compute distance to ALL training points for each new query. O(nd) per query, where n = training size, d = dimensions. Solutions: kd-trees, grid indices, locality-sensitive hashing.
Clustering = grouping data points by similarity, WITHOUT any labels. This is unsupervised — the algorithm discovers natural groupings on its own.
The most widely used clustering algorithm. Partitions data into K groups by iteratively assigning points to the nearest centroid and updating centroids.
The right value of K is usually unknown in advance. The elbow method plots the within-cluster MSE (mean squared error from centroid) against K:
A bottom-up approach that does NOT require specifying K in advance:
Result: a dendrogram — a tree diagram showing the order and distance of merges. You can cut the dendrogram at any height to get any number of clusters.
Minimum Spanning Tree = single-link clustering (merge closest individual points).
Advantage over K-means: No need to specify K; produces a hierarchy of clusters at all scales.
K-means is a special case of the broader EM algorithm:
EM is also used for semi-supervised learning: amplify a small amount of labeled data using many unlabeled examples.
4 metric properties: positivity, identity, symmetry, triangle inequality. Correlation fails these. L1 = Manhattan (sum of absolute differences). L2 = Euclidean (straight-line, most popular). Always normalize before computing distance. KNN: supervised; k neighbors vote; larger k = smoother boundary. K-means: unsupervised; all 3 steps required (K value, initial centroids, distance metric). K-means limitation: only round clusters. Use elbow method for K. Agglomerative: bottom-up merging, produces dendrogram. Unsupervised = no labels, find structure.
Q1. Which of the following is NOT a required property of a distance metric?
Answer: B
The four required properties of a distance metric are: Positivity (d ≥ 0), Identity (d=0 iff x=y), Symmetry (d(x,y)=d(y,x)), and Triangle Inequality (d(x,y) ≤ d(x,z)+d(z,y)). "Linearity" is not a required property of distance metrics.
Q2. The correlation coefficient is NOT a distance metric because:
Answer: B
The correlation coefficient ranges from -1 to +1. Negative values violate positivity (a metric must be ≥ 0). Also, a correlation of 0 doesn't mean the two variables are "the same point," violating the identity property. Despite not being a metric, correlation is still a useful similarity measure.
Q3. The Manhattan distance (L1) between points (1,2) and (4,6) is:
Answer: B
Manhattan (L1) distance = sum of absolute differences across all dimensions. |4-1| + |6-2| = 3 + 4 = 7. Think of it as traveling on a grid (like Manhattan streets) — you must go 3 blocks right and 4 blocks up, total 7 blocks. Euclidean (L2) would be sqrt(3^2 + 4^2) = sqrt(9+16) = sqrt(25) = 5 (straight-line diagonal).
Q4. Why should features be normalized before computing distances?
Answer: B
Without normalization, a feature with values 0-1,000,000 (like income) will dominate the distance calculation over a feature with values 0-1 (like a percentage). The income difference between any two people would swamp all other feature differences. Normalizing to z-scores puts all features on the same scale, giving each dimension comparable influence.
Q5. In KNN classification, what is the effect of increasing K (the number of neighbors)?
Answer: B
K=1: the boundary follows each individual training example exactly — very complex, jagged, prone to overfitting (sensitive to noise). Larger K: prediction is based on a vote of many neighbors, smoothing out noise. The boundary becomes smoother and more generalizable. Too large a K may oversmooth, but the main benefit is reduced sensitivity to individual noisy training examples.
Q6. Which step is NOT required by K-means clustering?
Answer: D
ALL three steps are required for K-means: (1) You must determine K (how many clusters to find). (2) You must make initial guesses for the cluster centroids (starting positions). (3) You must determine the distance metric to measure how similar points are to centroids. Without any one of these, the algorithm cannot run.
Q7. K-means clustering is a type of:
Answer: C
K-means is unsupervised learning — it groups data points by similarity WITHOUT any class labels or target values. The algorithm discovers natural groupings in the data on its own. This is the defining characteristic of unsupervised learning: finding structure in unlabeled data.
Q8. The main limitation of K-means clustering is:
Answer: B
K-means minimizes the sum of squared distances from points to their centroid, which naturally produces round/spherical clusters. It fails on: elongated clusters (banana-shaped), nested clusters (rings inside rings), and clusters of very different densities. Additionally, K-means may converge to local optima — different initial centroid guesses can produce different results.
Q9. The elbow method is used in K-means to:
Answer: B
The elbow method plots within-cluster MSE (how far points are from their centroid) against K. MSE decreases as K increases (more clusters = closer centroids). Once K exceeds the true number of natural clusters, additional clusters give little improvement — the MSE curve flattens out. The "elbow" (bend point) in the curve suggests the appropriate K.
Q10. An unsupervised learning problem is one where:
Answer: B
Unsupervised learning tries to find structure in data without any predefined labels or target values. Clustering is the primary unsupervised task — grouping observations based on similarity. Options A and C describe supervised learning (predicting a labeled target from features). Option D describes data wrangling, not learning.
Q11. In hierarchical agglomerative clustering, the algorithm:
Answer: B
Hierarchical agglomerative clustering is bottom-up: it starts with every data point as its own cluster (n clusters) and repeatedly merges the two nearest clusters until one cluster remains. The result is a dendrogram. Unlike K-means, you do NOT need to specify K in advance — you can cut the dendrogram at any level to get any number of clusters.
Q12. A customer reviews dataset needs to be categorized into "Positive," "Neutral," or "Negative" based on predefined sentiment rules. The appropriate approach is:
Answer: A
When the categories are PREDEFINED (Positive, Neutral, Negative) and known in advance, this is a classification problem. You train a model with labeled examples and predict the predefined category for new reviews. Clustering would be used if you wanted to DISCOVER natural groupings without knowing the categories beforehand.
Q13. The key difference between KNN classification and K-means clustering is:
Answer: B
This is the fundamental distinction: KNN is supervised classification — it requires labeled training data and K refers to the number of neighbors to vote for a class label. K-means is unsupervised clustering — it requires no labels and K refers to the number of clusters to find. Both use distance, but for completely different purposes.
Q14. Why does K-means potentially converge to different solutions on different runs?
Answer: C
K-means is sensitive to initialization. The starting positions of centroids (usually random) affect which local optimum the algorithm converges to. Different random starts can produce very different final clusterings. Solution: run K-means multiple times with different random initializations and select the run with the lowest total within-cluster MSE (best solution found).
Q15. The Euclidean distance (L2) between points (0,0) and (3,4) is:
Answer: B
Euclidean distance (L2) = sqrt(sum of squared differences) = sqrt((3-0)^2 + (4-0)^2) = sqrt(9+16) = sqrt(25) = 5. This is the classic 3-4-5 right triangle. Manhattan (L1) would be 3+4=7. Euclidean gives the straight-line ("as the crow flies") distance between the points.