L6: Building & Validating Models

Modeling is the process of encapsulating information into a tool that can make forecasts and predictions. This lecture covers the principles behind building GOOD models — not just any model that fits training data, but one that generalizes to new data.

The Data Science Analysis Pipeline

Ask an interesting question: What is the scientific goal? What do you want to predict or estimate?

Get the data: How was it sampled? Which data is relevant? Privacy issues?

Explore the data: Plot the data. Look for anomalies and patterns. (This is EDA — Lecture 5)

Model the data: Build a model. Fit the model. Validate the model.

Communicate and visualize the results: What did we learn? Do results make sense? Can we tell a story?

The key steps in modeling are: building, fitting, and validating the model.

Philosophies of Modeling

1. Occam's Razor EXAM HOT

"The simplest explanation is best."

In model building, this means: prefer the simpler model when two models perform similarly. A simpler model with fewer parameters is:

More interpretable — you can explain why it makes predictions
More robust — less likely to overfit the training data
More generalizable — performs better on new, unseen data

Important: Accuracy is NOT the best metric for judging model quality. Improved performance on specific training data is often due to overfitting (memorizing noise), not genuine insight. Methods like LASSO and Ridge regression apply Occam's Razor by automatically penalizing model complexity.

2. Bias-Variance Tradeoff EXAM HOT

Every model makes errors from two sources. The goal is to find the balance between them.

Bias = Underfitting

Error from wrong assumptions in the model. Example: assuming the data is linear when it is actually curved. The model is too simple — it misses real patterns in BOTH training and test data.

Symptoms: High error on training data AND high error on test data.

Variance = Overfitting

Error from too much sensitivity to the training data. The model memorizes noise and specific training examples rather than learning the true pattern.

Symptoms: Very low error on training data BUT high error on new test data.

The tradeoff: Increasing model complexity reduces bias but increases variance. The goal is to find the sweet spot with low bias AND low variance.

Exam answer: "What is the Bias-Variance Tradeoff?" → The balance between underfitting and overfitting.

3. Think Probabilistically (Nate Silver)

Good forecasting models produce a probability distribution over possible outcomes, not a single deterministic prediction. Properties of valid probabilities:

They sum to 1 across all possible events
They are never negative
Rare events do not get probability of zero

A good forecaster: thinks probabilistically, updates forecasts with new information, and looks for consensus across multiple models.

Modeling Methodologies

Type	Description	Example
First Principle Models	Based on theoretical understanding of how the system works	Physics simulation, scientific formula
Data-Driven Models	Based on observed correlations between input and output from data	Linear regression trained on historical data

Good models are typically a mixture of both.

Baseline Models

Before declaring your sophisticated model is good, you must first compare it to the simplest reasonable alternatives — baselines. If your model only barely beats a baseline, it is not very impressive.

Common baselines:

Always predict the most common class (useful for imbalanced data)
Uniform random guessing
The best single-variable model
The previous time period's value (for time series)
A published/existing model

Rule: Only after decisively beating your baselines can your model be deemed genuinely effective.

How to Evaluate Your Model

Always evaluate on out-of-sample (test) data that was NOT used for training. A model that performs well on its own training data may simply have memorized it (overfitting).

Model type	Output	Key metrics
Classification	Discrete labels (spam/not spam, cat/dog)	Accuracy, Precision, Recall, F1-score, Confusion Matrix
Regression	Continuous numerical values (price, temperature)	MSE (mean squared error), RMSE, MAE, R-squared

Note: Accuracy can be misleading for imbalanced datasets — a model that always predicts the majority class achieves high accuracy but is useless (see Lecture 4).

Underfitting vs Overfitting in Practice

Property	Underfitting (High Bias)	Good Fit	Overfitting (High Variance)
Training error	High	Low	Very low
Test error	High	Low	High
Model complexity	Too simple	Appropriate	Too complex
Problem	Misses real patterns	Generalizes well	Memorizes noise
Solution	More features, more complexity	Keep as-is	Regularization, simpler model, more data

Lecture 6 Summary — 5 Minute Revision

Modeling pipeline: Ask → Get → Explore → Model → Communicate. Occam's Razor: simplest model that fits is preferable. Bias = underfitting (too simple, misses patterns; high error everywhere). Variance = overfitting (too complex, memorizes noise; low training error, high test error). Always build baseline models first. Test on out-of-sample data. Primary goal: generalize well to unseen data, not just minimize training error. Accuracy is not the best metric for imbalanced data.

Practice Questions

Q1. The primary goal of model building is:

A. Maximizing the number of features used
B. Achieving the lowest possible error on training data
C. Developing a model that generalizes well to unseen data
D. Using the most complex available algorithm

Show Answer

Answer: C

The primary goal is generalization — the model must work well on NEW data it has never seen before. A model that perfectly fits training data but fails on new data (overfitting) is useless in practice. The whole point of modeling is to make predictions on future/unseen data.

Q2. What does Occam's Razor suggest in model building?

A. The simplest model that fits the data is preferable
B. More complex models always perform better
C. The model with the most parameters is the best
D. More data always leads to more accurate models

Show Answer

Answer: A

Occam's Razor states "the simplest explanation is best." In modeling: when two models perform similarly, choose the simpler one. Simpler models are more interpretable, more robust, and generalize better. More complex models often appear to perform better only because they overfit training data, not because they have genuine insight.

Q3. What is the Bias-Variance Tradeoff?

A. The balance between underfitting and overfitting
B. The tradeoff between accuracy and interpretability
C. The compromise between model speed and prediction quality
D. The choice between supervised and unsupervised learning

Show Answer

Answer: A

The Bias-Variance Tradeoff describes the tension between two sources of model error. Bias (underfitting) = error from wrong assumptions; model too simple. Variance (overfitting) = error from over-sensitivity to training data; model too complex. The goal is to find the balance where both are low — good fit on training data AND on new data.

Q4. A model has very low training error but very high error on test data. This is called:

A. Underfitting (high bias)
B. Overfitting (high variance)
C. A good fit
D. Baseline performance

Show Answer

Answer: B

Low training error + high test error = overfitting (high variance). The model has memorized the training data, including its noise, but fails to generalize. It has learned the specific quirks of the training set rather than the true underlying pattern. The solution: simpler model, regularization, or more training data.

Q5. Which of the following is NOT a common step in building a machine learning model?

A. Data preprocessing
B. Feature selection
C. Model interpretation
D. Implementing database indexing

Show Answer

Answer: D

Database indexing is a database engineering task, not a machine learning step. Common ML model building steps include: data preprocessing (cleaning, scaling), feature selection (choosing which variables to include), model selection, training, validation on test data, and model interpretation/evaluation.

Q6. Why must you evaluate a model on test data rather than training data?

A. Test data is always larger than training data
B. Training evaluation would always show 100% accuracy
C. A model evaluated only on training data may have memorized it (overfit) and fail on new data
D. Test data has no missing values, making evaluation more reliable

Show Answer

Answer: C

Evaluating on training data tells you how well the model fits the data it was built on — not how well it will perform on new data. An overfit model can achieve near-perfect training accuracy while being useless on new examples. Test data (held out from training) measures genuine generalization ability.

Q7. A baseline model is:

A. The most complex model available
B. A simple, reasonable reference model that your main model must clearly outperform
C. The model deployed in production
D. A model built without any features

Show Answer

Answer: B

A baseline is the simplest reasonable model you can build — like always predicting the most common class, or using only one variable. Your main model must decisively beat the baseline to be considered genuinely useful. If your complex model only barely beats a naive baseline, it is not adding real value.

Q8. High bias (underfitting) means:

A. The model is too complex and memorizes training noise
B. The model is too simple and makes wrong assumptions, causing high error on both training and test data
C. The training error is low but test error is high
D. The model uses too many features

Show Answer

Answer: B

High bias (underfitting) means the model is too simple or makes incorrect assumptions. Example: fitting a straight line to data that is actually curved. Result: high error on BOTH training and test data — the model misses the real pattern everywhere. Solution: increase model complexity, add better features.

Q9. The data science pipeline in correct order is:

A. Model, Explore, Get, Ask, Communicate
B. Ask, Get, Explore, Model, Communicate
C. Get, Ask, Model, Explore, Communicate
D. Ask, Model, Get, Explore, Communicate

Show Answer

Answer: B

The data science pipeline: (1) Ask an interesting question, (2) Get the data, (3) Explore the data (EDA — visualize, clean), (4) Model the data (build, fit, validate), (5) Communicate and visualize the results. Exploration (EDA) always comes BEFORE modeling — you must understand your data before modeling it.

Q10. LASSO and Ridge regression are examples of techniques that apply:

A. The Central Limit Theorem to model selection
B. Occam's Razor — they automatically penalize model complexity to prefer simpler models
C. Bayes' theorem to estimate parameters
D. Gradient descent with a very high learning rate

Show Answer

Answer: B

LASSO (L1 regularization) and Ridge (L2 regularization) regression add a penalty term to the cost function that discourages large coefficients, effectively minimizing the number of parameters used. This is a mathematical implementation of Occam's Razor — encouraging the model to use only the most important features and remain simple.

Q11. First principle models differ from data-driven models in that:

A. First principle models can only be built with R, not Python
B. First principle models are based on theoretical understanding of the system; data-driven models are based on observed data correlations
C. Data-driven models always outperform first principle models
D. First principle models require no data at all

Show Answer

Answer: B

First principle models are based on a theoretical explanation of how the system works (like physical simulations or scientific formulas). Data-driven models are based on observed correlations in data (like regression models trained on historical data). Good models are typically a mixture of both — you use domain theory to design the model structure and data to fit the parameters.

Q12. Good forecasting models should produce:

A. A single deterministic prediction (the most likely outcome)
B. A probability distribution over all possible outcomes
C. Only binary yes/no predictions
D. A prediction only when confidence is above 99%

Show Answer

Answer: B

Demanding a single deterministic prediction from a model is a "fool's errand." Good forecasting models produce a probability distribution over all possible events. Properties of valid probabilities: they sum to 1, are never negative, and rare events get small but non-zero probabilities. This captures uncertainty honestly rather than pretending certainty.