L10: Distance & Clustering

Distance is the foundation of both classification (KNN) and clustering (K-means). Understanding how to measure similarity between data points — and knowing which measure to use — is essential. This lecture also covers clustering as the primary unsupervised learning technique.

Distance Metric Properties

A distance measure is formally called a metric only if it satisfies all four properties. If any property fails, it is a similarity measure, not a distance metric.

Property	Rule	Plain English
Positivity	d(x,y) ≥ 0	Distance is never negative
Identity	d(x,y) = 0 iff x = y	Zero distance means same point
Symmetry	d(x,y) = d(y,x)	Distance from A to B equals B to A
Triangle inequality	d(x,y) ≤ d(x,z) + d(z,y)	Direct path is never longer than detour through z

What fails to be a metric:

Measure	Range	Which property fails
Correlation coefficient	-1 to 1	Positivity (can be negative) and Identity (d=0 doesn't always mean same point)
Cosine similarity	-1 to 1	Positivity and Identity
Directed travel time	Asymmetric	Symmetry (one-way streets: distance A→B ≠ B→A)

Lk Distance Metrics EXAM HOT

A family of distance metrics controlled by a single parameter k:

d(x,y) = ( ∑ |xᵢ - yᵢ|^k )^(1/k)

Sum over all d dimensions. x and y are two data points.

k value	Name	Formula	Intuition
k = 1	Manhattan (L1)	d = ∑\|xᵢ - yᵢ\|	"Taxi cab" distance — travel only on a grid (horizontal + vertical). Sum of absolute differences.
k = 2	Euclidean (L2)	d = √(∑(xᵢ-yᵢ)²)	Straight-line distance. Most popular. Weighs all dimensions equally.
k = ∞	Chebyshev	d = max\|xᵢ - yᵢ\|	Only the largest dimensional difference matters

Always normalize features to z-scores before computing distances. If one feature has range 0-1 and another 0-1,000,000, the large-scale feature will dominate the distance calculation regardless of actual relevance.

K-Nearest Neighbors (KNN) Classification

KNN is the simplest classification algorithm: find the K most similar (nearest) training examples to a new input, then predict by majority vote.

Algorithm:

For each training example, compute the distance to the new input point

Find the K training examples with the smallest distances (the K nearest neighbors)

Predict by majority vote among those K neighbors

For regression:

Instead of voting, assign the average (or weighted average) of the K neighbors' target values.

Effect of K:

K value	Decision boundary	Risk
K = 1	Very complex, jagged — follows every individual training example	Overfitting — sensitive to noise
Larger K	Smoother — averages over many neighbors	More robust, but may oversmooth (underfit)

Key advantages of KNN:

Simplicity: no training needed — just store all training examples
Interpretability: you can see WHICH training examples influenced the prediction
Non-linearity: naturally creates non-linear decision boundaries

Disadvantage: Prediction is slow — must compute distance to ALL training points for each new query. O(nd) per query, where n = training size, d = dimensions. Solutions: kd-trees, grid indices, locality-sensitive hashing.

Clustering (Unsupervised Learning)

Clustering = grouping data points by similarity, WITHOUT any labels. This is unsupervised — the algorithm discovers natural groupings on its own.

Why cluster?

Hypothesis development: how many distinct populations exist in this data?
Data reduction: represent each cluster by its centroid instead of all points
Outlier detection: points far from any cluster center are suspicious
Modeling over groups: build separate models for each cluster (personalization)

K-Means Clustering EXAM HOT

The most widely used clustering algorithm. Partitions data into K groups by iteratively assigning points to the nearest centroid and updating centroids.

Algorithm — ALL steps are required:

Determine the value of K (number of clusters you want)

Make an initial guess of K cluster centroids (usually random starting positions)

Determine the distance metric to use (e.g., Euclidean)

Assign each data point to its nearest centroid

Recalculate each centroid as the mean of all points assigned to it

Repeat steps 4-5 until no point changes its cluster assignment (convergence)

Exam answer: "Which step is NOT required by K-means?" → None of the above — ALL steps are required (Answer D)

Choosing K — The Elbow Method

The right value of K is usually unknown in advance. The elbow method plots the within-cluster MSE (mean squared error from centroid) against K:

As K increases, MSE always decreases (more clusters = points closer to their centroid)
Once K exceeds the true number of natural clusters, adding more clusters gives little improvement
Look for the "elbow" — the point where the curve bends and the rate of improvement slows dramatically

Limitations of K-Means

Round clusters only: K-means minimizes distance to centroid, naturally creating circular/spherical clusters. Fails on elongated clusters (like bananas) or nested/ring shapes.
Sensitive to initialization: Different starting centroids can produce different final clusters (local optima). Solution: run multiple times with different random starts, keep best result.
Must specify K in advance: The true number of clusters is usually unknown.
Sensitive to outliers: Outliers can pull centroids away from the true cluster centers.

Hierarchical Agglomerative Clustering

A bottom-up approach that does NOT require specifying K in advance:

Start: each data point is its own cluster (n clusters)

Find the two closest clusters and merge them into one

Repeat until only one cluster remains

Result: a dendrogram — a tree diagram showing the order and distance of merges. You can cut the dendrogram at any height to get any number of clusters.

Minimum Spanning Tree = single-link clustering (merge closest individual points).

Advantage over K-means: No need to specify K; produces a hierarchy of clusters at all scales.

EM Algorithm (Expectation-Maximization)

K-means is a special case of the broader EM algorithm:

E-step (Expectation): Assign each point to the most probable cluster (estimate cluster memberships)
M-step (Maximization): Use the assignments to update cluster parameters (recalculate centroids)

EM is also used for semi-supervised learning: amplify a small amount of labeled data using many unlabeled examples.

KNN vs K-Means — Important Distinction!

KNN (K-Nearest Neighbors)

Type: Supervised
Needs: labeled training data
K = number of neighbors to vote
Output: class label or value
Task: classification or regression
No "training" — stores data and searches at prediction time

K-Means

Type: Unsupervised
Needs: no labels
K = number of clusters to find
Output: cluster assignment for each point
Task: clustering / finding structure
Has a training phase (iteratively updates centroids)

Lecture 10 Summary — 5 Minute Revision

4 metric properties: positivity, identity, symmetry, triangle inequality. Correlation fails these. L1 = Manhattan (sum of absolute differences). L2 = Euclidean (straight-line, most popular). Always normalize before computing distance. KNN: supervised; k neighbors vote; larger k = smoother boundary. K-means: unsupervised; all 3 steps required (K value, initial centroids, distance metric). K-means limitation: only round clusters. Use elbow method for K. Agglomerative: bottom-up merging, produces dendrogram. Unsupervised = no labels, find structure.

Practice Questions

Q1. Which of the following is NOT a required property of a distance metric?

A. Positivity: d(x,y) ≥ 0
B. Linearity: d(x,y) must increase linearly with the number of dimensions
C. Symmetry: d(x,y) = d(y,x)
D. Triangle inequality: d(x,y) ≤ d(x,z) + d(z,y)

Show Answer

Answer: B

The four required properties of a distance metric are: Positivity (d ≥ 0), Identity (d=0 iff x=y), Symmetry (d(x,y)=d(y,x)), and Triangle Inequality (d(x,y) ≤ d(x,z)+d(z,y)). "Linearity" is not a required property of distance metrics.

Q2. The correlation coefficient is NOT a distance metric because:

A. It is too expensive to compute
B. It ranges from -1 to 1, violating positivity and identity properties
C. It is asymmetric: corr(x,y) ≠ corr(y,x)
D. It only works for discrete variables

Show Answer

Answer: B

The correlation coefficient ranges from -1 to +1. Negative values violate positivity (a metric must be ≥ 0). Also, a correlation of 0 doesn't mean the two variables are "the same point," violating the identity property. Despite not being a metric, correlation is still a useful similarity measure.

Q3. The Manhattan distance (L1) between points (1,2) and (4,6) is:

A. 5 (Euclidean)
B. 7 (sum of absolute differences: |4-1| + |6-2| = 3+4)
C. 4 (maximum difference)
D. 25 (sum of squared differences)

Show Answer

Answer: B

Manhattan (L1) distance = sum of absolute differences across all dimensions. |4-1| + |6-2| = 3 + 4 = 7. Think of it as traveling on a grid (like Manhattan streets) — you must go 3 blocks right and 4 blocks up, total 7 blocks. Euclidean (L2) would be sqrt(3^2 + 4^2) = sqrt(9+16) = sqrt(25) = 5 (straight-line diagonal).

Q4. Why should features be normalized before computing distances?

A. Normalization always improves model accuracy
B. Without normalization, features with large numeric ranges dominate the distance calculation, making other features irrelevant
C. Most distance computation libraries require normalized inputs
D. Normalization converts qualitative data to quantitative data

Show Answer

Answer: B

Without normalization, a feature with values 0-1,000,000 (like income) will dominate the distance calculation over a feature with values 0-1 (like a percentage). The income difference between any two people would swamp all other feature differences. Normalizing to z-scores puts all features on the same scale, giving each dimension comparable influence.

Q5. In KNN classification, what is the effect of increasing K (the number of neighbors)?

A. The decision boundary becomes more complex and jagged
B. The decision boundary becomes smoother and more robust
C. Prediction becomes faster
D. The algorithm requires labeled training data for the first time

Show Answer

Answer: B

K=1: the boundary follows each individual training example exactly — very complex, jagged, prone to overfitting (sensitive to noise). Larger K: prediction is based on a vote of many neighbors, smoothing out noise. The boundary becomes smoother and more generalizable. Too large a K may oversmooth, but the main benefit is reduced sensitivity to individual noisy training examples.

Q6. Which step is NOT required by K-means clustering?

A. Determine the distance metric
B. Determine the value of K
C. Make an initial guess of the cluster centroids
D. None of the above — all three steps are required

Show Answer

Answer: D

ALL three steps are required for K-means: (1) You must determine K (how many clusters to find). (2) You must make initial guesses for the cluster centroids (starting positions). (3) You must determine the distance metric to measure how similar points are to centroids. Without any one of these, the algorithm cannot run.

Q7. K-means clustering is a type of:

A. Supervised learning — it requires labeled data to assign clusters
B. Semi-supervised learning — it uses a mix of labeled and unlabeled data
C. Unsupervised learning — it groups points without any labels
D. Reinforcement learning — it learns through rewards and penalties

Show Answer

Answer: C

K-means is unsupervised learning — it groups data points by similarity WITHOUT any class labels or target values. The algorithm discovers natural groupings in the data on its own. This is the defining characteristic of unsupervised learning: finding structure in unlabeled data.

Q8. The main limitation of K-means clustering is:

A. It requires labeled training data
B. It assumes clusters are round/circular and struggles with elongated, nested, or non-convex cluster shapes
C. It always finds the globally optimal solution
D. It cannot handle datasets with more than two features

Show Answer

Answer: B

K-means minimizes the sum of squared distances from points to their centroid, which naturally produces round/spherical clusters. It fails on: elongated clusters (banana-shaped), nested clusters (rings inside rings), and clusters of very different densities. Additionally, K-means may converge to local optima — different initial centroid guesses can produce different results.

Q9. The elbow method is used in K-means to:

A. Detect outliers in the dataset before clustering
B. Determine the appropriate number of clusters K by finding where adding more clusters gives diminishing improvement
C. Select the best distance metric for the clustering task
D. Initialize cluster centroids in an optimal way

Show Answer

Answer: B

The elbow method plots within-cluster MSE (how far points are from their centroid) against K. MSE decreases as K increases (more clusters = closer centroids). Once K exceeds the true number of natural clusters, additional clusters give little improvement — the MSE curve flattens out. The "elbow" (bend point) in the curve suggests the appropriate K.

Q10. An unsupervised learning problem is one where:

A. We predict a categorical variable as a function of input features
B. We seek to understand whether observations fit into distinct groups based on their similarities, with no predefined labels
C. We fit a regression model as a function of predictor variables
D. We import and clean data to make it tidy

Show Answer

Answer: B

Unsupervised learning tries to find structure in data without any predefined labels or target values. Clustering is the primary unsupervised task — grouping observations based on similarity. Options A and C describe supervised learning (predicting a labeled target from features). Option D describes data wrangling, not learning.

Q11. In hierarchical agglomerative clustering, the algorithm:

A. Starts with K pre-defined clusters and iteratively merges them
B. Starts with each point as its own cluster, then repeatedly merges the two nearest clusters
C. Starts with one large cluster and recursively splits it
D. Requires specifying K before running

Show Answer

Answer: B

Hierarchical agglomerative clustering is bottom-up: it starts with every data point as its own cluster (n clusters) and repeatedly merges the two nearest clusters until one cluster remains. The result is a dendrogram. Unlike K-means, you do NOT need to specify K in advance — you can cut the dendrogram at any level to get any number of clusters.

Q12. A customer reviews dataset needs to be categorized into "Positive," "Neutral," or "Negative" based on predefined sentiment rules. The appropriate approach is:

A. Classification — the categories are predefined and known in advance
B. Clustering — to discover natural sentiment groups
C. Regression — sentiment is a continuous variable
D. Agglomerative clustering — to merge similar reviews hierarchically

Show Answer

Answer: A

When the categories are PREDEFINED (Positive, Neutral, Negative) and known in advance, this is a classification problem. You train a model with labeled examples and predict the predefined category for new reviews. Clustering would be used if you wanted to DISCOVER natural groupings without knowing the categories beforehand.

Q13. The key difference between KNN classification and K-means clustering is:

A. KNN uses Euclidean distance; K-means uses Manhattan distance
B. KNN is supervised (uses labeled data to classify); K-means is unsupervised (finds clusters without labels)
C. KNN requires specifying K; K-means does not
D. K-means can only be used for 2D data; KNN handles any dimension

Show Answer

Answer: B

This is the fundamental distinction: KNN is supervised classification — it requires labeled training data and K refers to the number of neighbors to vote for a class label. K-means is unsupervised clustering — it requires no labels and K refers to the number of clusters to find. Both use distance, but for completely different purposes.

Q14. Why does K-means potentially converge to different solutions on different runs?

A. Because the distance metric changes each run
B. Because the algorithm is non-deterministic and changes K each run
C. Because different random initial centroid positions lead to different local optima
D. Because the training data is randomly shuffled each run

Show Answer

Answer: C

K-means is sensitive to initialization. The starting positions of centroids (usually random) affect which local optimum the algorithm converges to. Different random starts can produce very different final clusterings. Solution: run K-means multiple times with different random initializations and select the run with the lowest total within-cluster MSE (best solution found).

Q15. The Euclidean distance (L2) between points (0,0) and (3,4) is:

A. 7 (Manhattan: 3+4)
B. 5 (Euclidean: sqrt(3^2 + 4^2) = sqrt(25))
C. 25 (sum of squares without square root)
D. 4 (maximum dimension: max(3,4))

Show Answer

Answer: B

Euclidean distance (L2) = sqrt(sum of squared differences) = sqrt((3-0)^2 + (4-0)^2) = sqrt(9+16) = sqrt(25) = 5. This is the classic 3-4-5 right triangle. Manhattan (L1) would be 3+4=7. Euclidean gives the straight-line ("as the crow flies") distance between the points.

Distance & Clustering Methods

Distance Metric Properties

What fails to be a metric:

Lk Distance Metrics EXAM HOT

K-Nearest Neighbors (KNN) Classification

Algorithm:

For regression:

Effect of K:

Key advantages of KNN:

Clustering (Unsupervised Learning)

Why cluster?

K-Means Clustering EXAM HOT

Algorithm — ALL steps are required:

Choosing K — The Elbow Method

Limitations of K-Means

Hierarchical Agglomerative Clustering

EM Algorithm (Expectation-Maximization)

KNN vs K-Means — Important Distinction!

KNN (K-Nearest Neighbors)

K-Means

Lecture 10 Summary — 5 Minute Revision

Practice Questions