L5: Data Visualization

Visualization serves three purposes: (1) EDA — understanding what your data looks like, (2) Error detection — catching anomalies and cleaning artifacts, (3) Communication — presenting findings to others. "Feeding unvisualized data to an analytical model is asking for trouble."

Why Visualization Is Essential: Anscombe's Quartet

Anscombe's Quartet consists of four datasets that have identical summary statistics:

Same mean (x=9, y=7.5)
Same variance (x=10, y=3.75)
Same correlation (r=0.816)
Same regression line

Yet when plotted, they look completely different: one is a clean linear relationship, one is a perfect curve, one has a single outlier distorting an otherwise perfect line, and one is a vertical cluster with one outlier pulling the regression line.

The lesson: Statistical summaries alone are NEVER enough to understand your data. Always visualize before modeling.

Chart Selection Guide

Chart type	Best for	Key rule / note
Simple text	Communicating 1-2 key numbers	When a chart adds no value
Table	Mixed audience; multiple different units	Design should fade into background — data takes center stage; use minimal borders
Heatmap	Tabular data with color-coded magnitude	Shows frequency distribution of data across two dimensions
Scatter plot	Relationship between two numerical variables	Ideal for showing relationships; use bubble charts for 3 variables
Line chart	Continuous data and trends over time	Implies continuity between points; do NOT use for categorical x-axis
Bar chart (vertical)	Comparing categories	MUST start at zero — eyes compare bar heights
Bar chart (horizontal)	Categories with long names	Readers see labels before data — better for named categories
Stacked bar	Totals plus subcomponent breakdown	Easy to compare bottom series only; upper series lack consistent baseline
Histogram	Distribution of one continuous variable	Bin size matters — try several; better for peaks than CDFs
CDF plot	Cumulative distribution (focus on tails)	CDFs better than histograms for showing tails
Box plot	Distribution comparison across groups	Shows median, quartiles, IQR, and outliers

Key Charts Explained

Bar Charts — Zero Baseline Rule EXAM HOT

Bar charts work because our eyes compare the HEIGHTS (endpoints) of bars. If the y-axis does not start at zero, small differences between bars appear enormous.

Example: Values of 35% and 39.6% look almost identical when the y-axis starts at 0. But if the y-axis starts at 34%, the 39.6% bar appears nearly 5 times taller than the 35% bar — a completely false visual impression.

Rule: Bar charts MUST always have a zero baseline. Non-zero baselines are a classic way to mislead viewers (intentionally or accidentally).

Stacked Bar Charts

Show totals AND how they break down into subcomponents. The bottom series (sitting on the x-axis) is easy to compare because it has a consistent baseline (zero). Upper series are hard to compare because they have different starting heights.

Histograms vs Bar Charts

Histogram

For continuous data divided into bins
Bars represent frequency of values in each range
Bin size changes the shape you see — always try multiple
Frequency histogram: shows counts. Density histogram: shows proportions (more interpretable)

Bar Chart

For categorical data
Each bar represents one distinct category
Must start at zero
Gaps between bars (unlike histogram)

Box Plots (Box and Whisker)

Shows the full distribution of a variable in a compact way:

Center line = median (Q2)
Box = interquartile range (Q1 to Q3, middle 50% of data)
Whiskers = range of non-outlier values
Points outside whiskers = outliers

Very useful for comparing distributions across multiple groups side by side (e.g., weight distribution for different height groups).

What To Avoid EXAM HOT

Pie Charts

Humans are bad at accurately reading angles and comparing arc lengths. 3D pie charts add perspective distortion, making slices at the front appear bigger and slices at the back appear smaller.

Always use bar charts instead for proportions. Bar charts allow direct length comparison which is much more accurate.

3D Charts

3D creates a perspective effect that distorts data. Bars at the front of a 3D chart appear taller than identical bars at the back. Values cannot be accurately read.

Never use 3D for data charts. 3D charts exist only to look impressive, at the cost of accuracy.

Non-Zero Baseline (Bar Charts)

If the y-axis starts above zero, the visual difference between bars is exaggerated. This is a common misleading technique in media and advertising.

Always ensure bar charts start at zero.

Secondary Y-axis

Having two different y-axes on the same chart is confusing — viewers don't know which scale to use for which series. Better alternatives: label data directly on the chart, or split into two separate charts stacked vertically.

ggplot2 in R MAY BE TESTED

ggplot2 is R's famous visualization package. The basic structure is:

ggplot(data, aes(x_variable, y_variable)) + geom_charttype()

Key examples:

ggplot(data, aes(factor(sex), age)) + geom_boxplot()

Boxplot of age grouped by sex. factor(sex) converts sex to a categorical variable for grouping on the x-axis.

ggplot(data, aes(height, weight)) + geom_point()

Scatter plot of height vs weight

Rule: The first argument in aes() is the x-axis variable (grouping or independent). The second is the y-axis variable (measured or dependent). factor() wraps a variable to treat it as categorical.

Lecture 5 Summary — 5 Minute Revision

Always visualize before modeling (Anscombe's Quartet). Use scatter plots for relationships between two numeric variables. Line charts for time trends (continuous). Bar charts for categories — ALWAYS zero baseline. Histograms for distributions. Box plots for group comparisons. Avoid: pie charts (hard to read angles), 3D charts (distort data), non-zero baselines (mislead). ggplot2: ggplot(data, aes(factor(sex), age)) + geom_boxplot() creates a grouped boxplot.

Practice Questions

Q1. Anscombe's Quartet demonstrates that:

A. Summary statistics alone are sufficient to understand your data
B. Scatter plots are always the best chart type
C. Four datasets with identical statistical summaries can have completely different visual distributions
D. Normal distributions always look the same visually

Show Answer

Answer: C

Anscombe's Quartet shows four datasets with identical means, variances, correlations, and regression lines that look completely different when plotted. One is linear, one is curved, one has a single influential outlier, one is nearly vertical. The lesson: always visualize your data before analysis — statistics alone can hide critical patterns.

Q2. Which R code generates a boxplot of age grouped by sex?

A. ggplot(data, aes(factor(sex), age)) + geom_boxplot()
B. ggplot(data, aes(sex, factor(age))) + geom_boxplot()
C. ggplot(data, aes(age, sex)) + geom_boxplot()
D. ggplot(data, aes(factor(sex))) + geom_boxplot(aes(age))

Show Answer

Answer: A

ggplot(data, aes(factor(sex), age)) + geom_boxplot(): The first aes() argument is the x-axis grouping variable. factor(sex) converts sex to a categorical variable. The second argument (age) is the y-axis continuous variable being measured. geom_boxplot() draws the box plots. The result: one box plot for each sex category.

Q3. What is the primary disadvantage of 3D charts?

A. They are harder to create than 2D charts
B. They distort data and mislead viewers due to perspective effects
C. They enhance clarity by adding visual depth
D. They are more engaging but contain the same information as 2D charts

Show Answer

Answer: B

3D charts create perspective distortion — bars at the front appear taller and bars at the back appear shorter than their actual values. This makes it impossible to accurately read data values and creates misleading visual comparisons. 3D is purely cosmetic and always harms accuracy.

Q4. Which chart type is most appropriate for showing how a value changes over time?

A. Bar chart
B. Pie chart
C. Line chart
D. Scatter plot

Show Answer

Answer: C

Line charts are most appropriate for continuous data trends over time. The physical connection between data points implies continuity — appropriate for time-series data. Don't use line charts for categorical x-axis data, as the implied connection between categories is misleading.

Q5. Why must bar charts always have a zero baseline?

A. To make the chart look visually balanced
B. Because R automatically enforces this rule
C. Our eyes compare bar heights from the bottom — a non-zero baseline exaggerates differences
D. To accurately display the sample mean

Show Answer

Answer: C

When we read bar charts, our eyes compare the total height of each bar from the bottom. If the y-axis starts above zero (say at 34%), bars with similar values (35% and 39.6%) appear drastically different in height — one might look 5x taller than the other when they actually differ by only 4.6 percentage points. Always start at zero for honest visual comparisons.

Q6. True or False: Bar charts are generally better than pie charts for showing proportions.

A. True
B. False

Show Answer

Answer: A — True

Bar charts are better because humans are much more accurate at comparing lengths (bar heights) than at comparing angles or areas (pie slices). When two pie slices are similar in size, it is very hard to tell which is larger. Two adjacent bars of similar height are immediately identifiable as very similar in value.

Q7. In a stacked bar chart, which part of the chart is easiest to compare across categories?

A. The uppermost series
B. The middle series
C. The bottom series (directly on the x-axis baseline)
D. All series are equally easy to compare

Show Answer

Answer: C

The bottom series sits on the x-axis (a consistent zero baseline) making it easy to compare across categories. Upper series start at different heights for each bar — they have no consistent baseline to compare from, making visual comparison unreliable. This is a key limitation of stacked bar charts.

Q8. A scatter plot is ideal for:

A. Showing how categories compare in frequency
B. Showing the relationship between two numerical variables
C. Showing the distribution of a single continuous variable
D. Showing changes over time for categorical data

Show Answer

Answer: B

Scatter plots are ideal for showing the relationship between two numerical variables. Each point represents one observation with its position determined by its values on both variables. Patterns in the scatter (upward trend, downward trend, cluster, no pattern) reveal the relationship. For frequency comparison: bar chart. For single variable distribution: histogram.

Q9. What does a box plot show?

A. Only the mean and standard deviation
B. The median, quartiles (Q1-Q3), range of non-outlier values, and outliers
C. The frequency of values in each bin
D. The correlation between two variables

Show Answer

Answer: B

A box plot shows: the center line = median, the box edges = Q1 (25th percentile) and Q3 (75th percentile), the whiskers = range of non-outlier values, and individual dots = outliers beyond the whiskers. Box plots are very useful for comparing distributions across multiple groups side by side.

Q10. The three main purposes of data visualization are:

A. Printing, publishing, and presenting
B. EDA (exploratory analysis), error detection, and communicating results
C. Model building, model validation, and model deployment
D. Data collection, data storage, and data retrieval

Show Answer

Answer: B

The three purposes are: (1) EDA — exploring what your data looks like, finding patterns and anomalies before modeling. (2) Error detection — identifying whether something went wrong in data collection or processing. (3) Communication — presenting what you learned to stakeholders, telling a clear data story.

Q11. In a histogram, why does bin size matter?

A. Larger bins always give more accurate representations
B. Bin size does not affect the interpretation of the data
C. Different bin sizes can reveal or hide different patterns — too few bins loses detail, too many is noisy
D. Bin size only matters for very large datasets

Show Answer

Answer: C

Bin size significantly changes what a histogram looks like. Too few bins (very wide): you lose detail and multiple distinct peaks may merge into one. Too many bins (very narrow): the chart looks noisy and random, making patterns hard to see. Always explore several bin sizes to understand your data's distribution properly.

Q12. Which of the following is the best alternative to a secondary y-axis on a chart?

A. Use a 3D chart to add the extra dimension
B. Split into two separate charts stacked vertically, or label data directly
C. Use a pie chart for one variable and a bar for the other
D. Use different colors but keep the secondary axis

Show Answer

Answer: B

Secondary y-axes are confusing because viewers don't know which scale applies to which series. Better alternatives: (1) Label data points directly on the chart with their values, (2) Split into two separate charts stacked vertically so each has its own clear axis. Both options are easier to read than a dual-axis chart.