Distance is the foundation of both classification (KNN) and clustering (K-means). Understanding how to measure similarity between data points — and knowing which measure to use — is essential. This lecture also covers clustering as the primary unsupervised learning technique.

Distance Metric Properties

A distance measure is formally called a metric only if it satisfies all four properties. If any property fails, it is a similarity measure, not a distance metric.

PropertyRulePlain English
Positivityd(x,y) ≥ 0Distance is never negative
Identityd(x,y) = 0 iff x = yZero distance means same point
Symmetryd(x,y) = d(y,x)Distance from A to B equals B to A
Triangle inequalityd(x,y) ≤ d(x,z) + d(z,y)Direct path is never longer than detour through z

What fails to be a metric:

MeasureRangeWhich property fails
Correlation coefficient-1 to 1Positivity (can be negative) and Identity (d=0 doesn't always mean same point)
Cosine similarity-1 to 1Positivity and Identity
Directed travel timeAsymmetricSymmetry (one-way streets: distance A→B ≠ B→A)

Lk Distance Metrics EXAM HOT

A family of distance metrics controlled by a single parameter k:

d(x,y) = ( ∑ |xᵢ - yᵢ|^k )^(1/k)
Sum over all d dimensions. x and y are two data points.
k valueNameFormulaIntuition
k = 1Manhattan (L1)d = ∑|xᵢ - yᵢ|"Taxi cab" distance — travel only on a grid (horizontal + vertical). Sum of absolute differences.
k = 2Euclidean (L2)d = √(∑(xᵢ-yᵢ)²)Straight-line distance. Most popular. Weighs all dimensions equally.
k = ∞Chebyshevd = max|xᵢ - yᵢ|Only the largest dimensional difference matters
Always normalize features to z-scores before computing distances. If one feature has range 0-1 and another 0-1,000,000, the large-scale feature will dominate the distance calculation regardless of actual relevance.

K-Nearest Neighbors (KNN) Classification

KNN is the simplest classification algorithm: find the K most similar (nearest) training examples to a new input, then predict by majority vote.

Algorithm:

For each training example, compute the distance to the new input point
Find the K training examples with the smallest distances (the K nearest neighbors)
Predict by majority vote among those K neighbors

For regression:

Instead of voting, assign the average (or weighted average) of the K neighbors' target values.

Effect of K:

K valueDecision boundaryRisk
K = 1Very complex, jagged — follows every individual training exampleOverfitting — sensitive to noise
Larger KSmoother — averages over many neighborsMore robust, but may oversmooth (underfit)

Key advantages of KNN:

Disadvantage: Prediction is slow — must compute distance to ALL training points for each new query. O(nd) per query, where n = training size, d = dimensions. Solutions: kd-trees, grid indices, locality-sensitive hashing.

Clustering (Unsupervised Learning)

Clustering = grouping data points by similarity, WITHOUT any labels. This is unsupervised — the algorithm discovers natural groupings on its own.

Why cluster?

K-Means Clustering EXAM HOT

The most widely used clustering algorithm. Partitions data into K groups by iteratively assigning points to the nearest centroid and updating centroids.

Algorithm — ALL steps are required:

Determine the value of K (number of clusters you want)
Make an initial guess of K cluster centroids (usually random starting positions)
Determine the distance metric to use (e.g., Euclidean)
Assign each data point to its nearest centroid
Recalculate each centroid as the mean of all points assigned to it
Repeat steps 4-5 until no point changes its cluster assignment (convergence)
Exam answer: "Which step is NOT required by K-means?" → None of the above — ALL steps are required (Answer D)

Choosing K — The Elbow Method

The right value of K is usually unknown in advance. The elbow method plots the within-cluster MSE (mean squared error from centroid) against K:

Limitations of K-Means

Hierarchical Agglomerative Clustering

A bottom-up approach that does NOT require specifying K in advance:

Start: each data point is its own cluster (n clusters)
Find the two closest clusters and merge them into one
Repeat until only one cluster remains

Result: a dendrogram — a tree diagram showing the order and distance of merges. You can cut the dendrogram at any height to get any number of clusters.

Minimum Spanning Tree = single-link clustering (merge closest individual points).

Advantage over K-means: No need to specify K; produces a hierarchy of clusters at all scales.

EM Algorithm (Expectation-Maximization)

K-means is a special case of the broader EM algorithm:

EM is also used for semi-supervised learning: amplify a small amount of labeled data using many unlabeled examples.

KNN vs K-Means — Important Distinction!

KNN (K-Nearest Neighbors)

  • Type: Supervised
  • Needs: labeled training data
  • K = number of neighbors to vote
  • Output: class label or value
  • Task: classification or regression
  • No "training" — stores data and searches at prediction time

K-Means

  • Type: Unsupervised
  • Needs: no labels
  • K = number of clusters to find
  • Output: cluster assignment for each point
  • Task: clustering / finding structure
  • Has a training phase (iteratively updates centroids)

Lecture 10 Summary — 5 Minute Revision

4 metric properties: positivity, identity, symmetry, triangle inequality. Correlation fails these. L1 = Manhattan (sum of absolute differences). L2 = Euclidean (straight-line, most popular). Always normalize before computing distance. KNN: supervised; k neighbors vote; larger k = smoother boundary. K-means: unsupervised; all 3 steps required (K value, initial centroids, distance metric). K-means limitation: only round clusters. Use elbow method for K. Agglomerative: bottom-up merging, produces dendrogram. Unsupervised = no labels, find structure.

Practice Questions

Q1. Which of the following is NOT a required property of a distance metric?

  • A. Positivity: d(x,y) ≥ 0
  • B. Linearity: d(x,y) must increase linearly with the number of dimensions
  • C. Symmetry: d(x,y) = d(y,x)
  • D. Triangle inequality: d(x,y) ≤ d(x,z) + d(z,y)
Show Answer

Answer: B

The four required properties of a distance metric are: Positivity (d ≥ 0), Identity (d=0 iff x=y), Symmetry (d(x,y)=d(y,x)), and Triangle Inequality (d(x,y) ≤ d(x,z)+d(z,y)). "Linearity" is not a required property of distance metrics.

Q2. The correlation coefficient is NOT a distance metric because:

  • A. It is too expensive to compute
  • B. It ranges from -1 to 1, violating positivity and identity properties
  • C. It is asymmetric: corr(x,y) ≠ corr(y,x)
  • D. It only works for discrete variables
Show Answer

Answer: B

The correlation coefficient ranges from -1 to +1. Negative values violate positivity (a metric must be ≥ 0). Also, a correlation of 0 doesn't mean the two variables are "the same point," violating the identity property. Despite not being a metric, correlation is still a useful similarity measure.

Q3. The Manhattan distance (L1) between points (1,2) and (4,6) is:

  • A. 5 (Euclidean)
  • B. 7 (sum of absolute differences: |4-1| + |6-2| = 3+4)
  • C. 4 (maximum difference)
  • D. 25 (sum of squared differences)
Show Answer

Answer: B

Manhattan (L1) distance = sum of absolute differences across all dimensions. |4-1| + |6-2| = 3 + 4 = 7. Think of it as traveling on a grid (like Manhattan streets) — you must go 3 blocks right and 4 blocks up, total 7 blocks. Euclidean (L2) would be sqrt(3^2 + 4^2) = sqrt(9+16) = sqrt(25) = 5 (straight-line diagonal).

Q4. Why should features be normalized before computing distances?

  • A. Normalization always improves model accuracy
  • B. Without normalization, features with large numeric ranges dominate the distance calculation, making other features irrelevant
  • C. Most distance computation libraries require normalized inputs
  • D. Normalization converts qualitative data to quantitative data
Show Answer

Answer: B

Without normalization, a feature with values 0-1,000,000 (like income) will dominate the distance calculation over a feature with values 0-1 (like a percentage). The income difference between any two people would swamp all other feature differences. Normalizing to z-scores puts all features on the same scale, giving each dimension comparable influence.

Q5. In KNN classification, what is the effect of increasing K (the number of neighbors)?

  • A. The decision boundary becomes more complex and jagged
  • B. The decision boundary becomes smoother and more robust
  • C. Prediction becomes faster
  • D. The algorithm requires labeled training data for the first time
Show Answer

Answer: B

K=1: the boundary follows each individual training example exactly — very complex, jagged, prone to overfitting (sensitive to noise). Larger K: prediction is based on a vote of many neighbors, smoothing out noise. The boundary becomes smoother and more generalizable. Too large a K may oversmooth, but the main benefit is reduced sensitivity to individual noisy training examples.

Q6. Which step is NOT required by K-means clustering?

  • A. Determine the distance metric
  • B. Determine the value of K
  • C. Make an initial guess of the cluster centroids
  • D. None of the above — all three steps are required
Show Answer

Answer: D

ALL three steps are required for K-means: (1) You must determine K (how many clusters to find). (2) You must make initial guesses for the cluster centroids (starting positions). (3) You must determine the distance metric to measure how similar points are to centroids. Without any one of these, the algorithm cannot run.

Q7. K-means clustering is a type of:

  • A. Supervised learning — it requires labeled data to assign clusters
  • B. Semi-supervised learning — it uses a mix of labeled and unlabeled data
  • C. Unsupervised learning — it groups points without any labels
  • D. Reinforcement learning — it learns through rewards and penalties
Show Answer

Answer: C

K-means is unsupervised learning — it groups data points by similarity WITHOUT any class labels or target values. The algorithm discovers natural groupings in the data on its own. This is the defining characteristic of unsupervised learning: finding structure in unlabeled data.

Q8. The main limitation of K-means clustering is:

  • A. It requires labeled training data
  • B. It assumes clusters are round/circular and struggles with elongated, nested, or non-convex cluster shapes
  • C. It always finds the globally optimal solution
  • D. It cannot handle datasets with more than two features
Show Answer

Answer: B

K-means minimizes the sum of squared distances from points to their centroid, which naturally produces round/spherical clusters. It fails on: elongated clusters (banana-shaped), nested clusters (rings inside rings), and clusters of very different densities. Additionally, K-means may converge to local optima — different initial centroid guesses can produce different results.

Q9. The elbow method is used in K-means to:

  • A. Detect outliers in the dataset before clustering
  • B. Determine the appropriate number of clusters K by finding where adding more clusters gives diminishing improvement
  • C. Select the best distance metric for the clustering task
  • D. Initialize cluster centroids in an optimal way
Show Answer

Answer: B

The elbow method plots within-cluster MSE (how far points are from their centroid) against K. MSE decreases as K increases (more clusters = closer centroids). Once K exceeds the true number of natural clusters, additional clusters give little improvement — the MSE curve flattens out. The "elbow" (bend point) in the curve suggests the appropriate K.

Q10. An unsupervised learning problem is one where:

  • A. We predict a categorical variable as a function of input features
  • B. We seek to understand whether observations fit into distinct groups based on their similarities, with no predefined labels
  • C. We fit a regression model as a function of predictor variables
  • D. We import and clean data to make it tidy
Show Answer

Answer: B

Unsupervised learning tries to find structure in data without any predefined labels or target values. Clustering is the primary unsupervised task — grouping observations based on similarity. Options A and C describe supervised learning (predicting a labeled target from features). Option D describes data wrangling, not learning.

Q11. In hierarchical agglomerative clustering, the algorithm:

  • A. Starts with K pre-defined clusters and iteratively merges them
  • B. Starts with each point as its own cluster, then repeatedly merges the two nearest clusters
  • C. Starts with one large cluster and recursively splits it
  • D. Requires specifying K before running
Show Answer

Answer: B

Hierarchical agglomerative clustering is bottom-up: it starts with every data point as its own cluster (n clusters) and repeatedly merges the two nearest clusters until one cluster remains. The result is a dendrogram. Unlike K-means, you do NOT need to specify K in advance — you can cut the dendrogram at any level to get any number of clusters.

Q12. A customer reviews dataset needs to be categorized into "Positive," "Neutral," or "Negative" based on predefined sentiment rules. The appropriate approach is:

  • A. Classification — the categories are predefined and known in advance
  • B. Clustering — to discover natural sentiment groups
  • C. Regression — sentiment is a continuous variable
  • D. Agglomerative clustering — to merge similar reviews hierarchically
Show Answer

Answer: A

When the categories are PREDEFINED (Positive, Neutral, Negative) and known in advance, this is a classification problem. You train a model with labeled examples and predict the predefined category for new reviews. Clustering would be used if you wanted to DISCOVER natural groupings without knowing the categories beforehand.

Q13. The key difference between KNN classification and K-means clustering is:

  • A. KNN uses Euclidean distance; K-means uses Manhattan distance
  • B. KNN is supervised (uses labeled data to classify); K-means is unsupervised (finds clusters without labels)
  • C. KNN requires specifying K; K-means does not
  • D. K-means can only be used for 2D data; KNN handles any dimension
Show Answer

Answer: B

This is the fundamental distinction: KNN is supervised classification — it requires labeled training data and K refers to the number of neighbors to vote for a class label. K-means is unsupervised clustering — it requires no labels and K refers to the number of clusters to find. Both use distance, but for completely different purposes.

Q14. Why does K-means potentially converge to different solutions on different runs?

  • A. Because the distance metric changes each run
  • B. Because the algorithm is non-deterministic and changes K each run
  • C. Because different random initial centroid positions lead to different local optima
  • D. Because the training data is randomly shuffled each run
Show Answer

Answer: C

K-means is sensitive to initialization. The starting positions of centroids (usually random) affect which local optimum the algorithm converges to. Different random starts can produce very different final clusterings. Solution: run K-means multiple times with different random initializations and select the run with the lowest total within-cluster MSE (best solution found).

Q15. The Euclidean distance (L2) between points (0,0) and (3,4) is:

  • A. 7 (Manhattan: 3+4)
  • B. 5 (Euclidean: sqrt(3^2 + 4^2) = sqrt(25))
  • C. 25 (sum of squares without square root)
  • D. 4 (maximum dimension: max(3,4))
Show Answer

Answer: B

Euclidean distance (L2) = sqrt(sum of squared differences) = sqrt((3-0)^2 + (4-0)^2) = sqrt(9+16) = sqrt(25) = 5. This is the classic 3-4-5 right triangle. Manhattan (L1) would be 3+4=7. Euclidean gives the straight-line ("as the crow flies") distance between the points.