Chart types, when to use each, best practices, what to avoid, ggplot2
Anscombe's Quartet consists of four datasets that have identical summary statistics:
Yet when plotted, they look completely different: one is a clean linear relationship, one is a perfect curve, one has a single outlier distorting an otherwise perfect line, and one is a vertical cluster with one outlier pulling the regression line.
| Chart type | Best for | Key rule / note |
|---|---|---|
| Simple text | Communicating 1-2 key numbers | When a chart adds no value |
| Table | Mixed audience; multiple different units | Design should fade into background — data takes center stage; use minimal borders |
| Heatmap | Tabular data with color-coded magnitude | Shows frequency distribution of data across two dimensions |
| Scatter plot | Relationship between two numerical variables | Ideal for showing relationships; use bubble charts for 3 variables |
| Line chart | Continuous data and trends over time | Implies continuity between points; do NOT use for categorical x-axis |
| Bar chart (vertical) | Comparing categories | MUST start at zero — eyes compare bar heights |
| Bar chart (horizontal) | Categories with long names | Readers see labels before data — better for named categories |
| Stacked bar | Totals plus subcomponent breakdown | Easy to compare bottom series only; upper series lack consistent baseline |
| Histogram | Distribution of one continuous variable | Bin size matters — try several; better for peaks than CDFs |
| CDF plot | Cumulative distribution (focus on tails) | CDFs better than histograms for showing tails |
| Box plot | Distribution comparison across groups | Shows median, quartiles, IQR, and outliers |
Bar charts work because our eyes compare the HEIGHTS (endpoints) of bars. If the y-axis does not start at zero, small differences between bars appear enormous.
Example: Values of 35% and 39.6% look almost identical when the y-axis starts at 0. But if the y-axis starts at 34%, the 39.6% bar appears nearly 5 times taller than the 35% bar — a completely false visual impression.
Rule: Bar charts MUST always have a zero baseline. Non-zero baselines are a classic way to mislead viewers (intentionally or accidentally).
Show totals AND how they break down into subcomponents. The bottom series (sitting on the x-axis) is easy to compare because it has a consistent baseline (zero). Upper series are hard to compare because they have different starting heights.
Shows the full distribution of a variable in a compact way:
Very useful for comparing distributions across multiple groups side by side (e.g., weight distribution for different height groups).
Humans are bad at accurately reading angles and comparing arc lengths. 3D pie charts add perspective distortion, making slices at the front appear bigger and slices at the back appear smaller.
Always use bar charts instead for proportions. Bar charts allow direct length comparison which is much more accurate.
3D creates a perspective effect that distorts data. Bars at the front of a 3D chart appear taller than identical bars at the back. Values cannot be accurately read.
Never use 3D for data charts. 3D charts exist only to look impressive, at the cost of accuracy.
If the y-axis starts above zero, the visual difference between bars is exaggerated. This is a common misleading technique in media and advertising.
Always ensure bar charts start at zero.
Having two different y-axes on the same chart is confusing — viewers don't know which scale to use for which series. Better alternatives: label data directly on the chart, or split into two separate charts stacked vertically.
ggplot2 is R's famous visualization package. The basic structure is:
Key examples:
Rule: The first argument in aes() is the x-axis variable (grouping or independent). The second is the y-axis variable (measured or dependent). factor() wraps a variable to treat it as categorical.
Always visualize before modeling (Anscombe's Quartet). Use scatter plots for relationships between two numeric variables. Line charts for time trends (continuous). Bar charts for categories — ALWAYS zero baseline. Histograms for distributions. Box plots for group comparisons. Avoid: pie charts (hard to read angles), 3D charts (distort data), non-zero baselines (mislead). ggplot2: ggplot(data, aes(factor(sex), age)) + geom_boxplot() creates a grouped boxplot.
Q1. Anscombe's Quartet demonstrates that:
Answer: C
Anscombe's Quartet shows four datasets with identical means, variances, correlations, and regression lines that look completely different when plotted. One is linear, one is curved, one has a single influential outlier, one is nearly vertical. The lesson: always visualize your data before analysis — statistics alone can hide critical patterns.
Q2. Which R code generates a boxplot of age grouped by sex?
Answer: A
ggplot(data, aes(factor(sex), age)) + geom_boxplot(): The first aes() argument is the x-axis grouping variable. factor(sex) converts sex to a categorical variable. The second argument (age) is the y-axis continuous variable being measured. geom_boxplot() draws the box plots. The result: one box plot for each sex category.
Q3. What is the primary disadvantage of 3D charts?
Answer: B
3D charts create perspective distortion — bars at the front appear taller and bars at the back appear shorter than their actual values. This makes it impossible to accurately read data values and creates misleading visual comparisons. 3D is purely cosmetic and always harms accuracy.
Q4. Which chart type is most appropriate for showing how a value changes over time?
Answer: C
Line charts are most appropriate for continuous data trends over time. The physical connection between data points implies continuity — appropriate for time-series data. Don't use line charts for categorical x-axis data, as the implied connection between categories is misleading.
Q5. Why must bar charts always have a zero baseline?
Answer: C
When we read bar charts, our eyes compare the total height of each bar from the bottom. If the y-axis starts above zero (say at 34%), bars with similar values (35% and 39.6%) appear drastically different in height — one might look 5x taller than the other when they actually differ by only 4.6 percentage points. Always start at zero for honest visual comparisons.
Q6. True or False: Bar charts are generally better than pie charts for showing proportions.
Answer: A — True
Bar charts are better because humans are much more accurate at comparing lengths (bar heights) than at comparing angles or areas (pie slices). When two pie slices are similar in size, it is very hard to tell which is larger. Two adjacent bars of similar height are immediately identifiable as very similar in value.
Q7. In a stacked bar chart, which part of the chart is easiest to compare across categories?
Answer: C
The bottom series sits on the x-axis (a consistent zero baseline) making it easy to compare across categories. Upper series start at different heights for each bar — they have no consistent baseline to compare from, making visual comparison unreliable. This is a key limitation of stacked bar charts.
Q8. A scatter plot is ideal for:
Answer: B
Scatter plots are ideal for showing the relationship between two numerical variables. Each point represents one observation with its position determined by its values on both variables. Patterns in the scatter (upward trend, downward trend, cluster, no pattern) reveal the relationship. For frequency comparison: bar chart. For single variable distribution: histogram.
Q9. What does a box plot show?
Answer: B
A box plot shows: the center line = median, the box edges = Q1 (25th percentile) and Q3 (75th percentile), the whiskers = range of non-outlier values, and individual dots = outliers beyond the whiskers. Box plots are very useful for comparing distributions across multiple groups side by side.
Q10. The three main purposes of data visualization are:
Answer: B
The three purposes are: (1) EDA — exploring what your data looks like, finding patterns and anomalies before modeling. (2) Error detection — identifying whether something went wrong in data collection or processing. (3) Communication — presenting what you learned to stakeholders, telling a clear data story.
Q11. In a histogram, why does bin size matter?
Answer: C
Bin size significantly changes what a histogram looks like. Too few bins (very wide): you lose detail and multiple distinct peaks may merge into one. Too many bins (very narrow): the chart looks noisy and random, making patterns hard to see. Always explore several bin sizes to understand your data's distribution properly.
Q12. Which of the following is the best alternative to a secondary y-axis on a chart?
Answer: B
Secondary y-axes are confusing because viewers don't know which scale applies to which series. Better alternatives: (1) Label data points directly on the chart with their values, (2) Split into two separate charts stacked vertically so each has its own clear axis. Both options are easier to read than a dual-axis chart.