L2: Types of Data

If you don't understand the type of data you are working with, you will waste time applying models that are ineffective for that type. Every analysis decision — which chart to use, which model to apply, what questions to ask — depends on knowing your data type.

The Big Picture: Two Classification Axes

Axis 1: Structure

Structured — organized in rows and columns
Unstructured — free-form, no standard format

Axis 2: Measurement

Quantitative — numbers where arithmetic is meaningful
Qualitative — categories where arithmetic is NOT meaningful

Structured vs Unstructured Data

Structured (Organized) Data — Data organized as rows and columns, where every row is one observation and every column is one characteristic. ML models are built primarily for this. Examples: spreadsheets, database tables, scientific measurement records.

Unstructured (Unorganized) Data — Data that exists in free form with no standard hierarchy. Must be pre-processed before most ML models can use it. Examples: tweets, emails, server logs, audio files, genetic sequences (ACGTATTGCA).

Why unstructured data matters: Most data in the real world IS unstructured — tweets, emails, literature, server logs. We must apply pre-processing techniques to extract structured features from it before using standard models.

Example: To analyze emails, you might count word frequencies and convert each email into a row with columns like "count_of_word_free", "count_of_word_money", etc.

Quantitative vs Qualitative Data

Quantitative Data — Data described using numbers where basic mathematical operations (addition, averaging) produce meaningful results. The TEST: "Does it make sense to add these values together or compute an average?" If yes, it is quantitative.

Qualitative Data — Data described using categories and language where mathematical operations are NOT meaningful. You cannot compute "the average hair color" or "the sum of all country names." If arithmetic is nonsensical, it is qualitative.

Quantitative: Discrete vs Continuous

Discrete Data

Values are distinct and separate, typically counted. Always integers — cannot have fractions or decimals. You cannot have 2.5 customers or a dice roll of 3.7.

Examples: Number of customers in a shop, dice roll (1-6), number of children in a family, number of cars in a parking lot.

Continuous Data

Values are measured and can take any value within a range, including fractions and decimals. Theoretically infinite precision.

Examples: Height (170.2cm or 175.85cm), weight (68.5kg or 89.66kg), temperature, revenue ($12,345.67).

Qualitative: Nominal vs Ordinal

Nominal Data

Categories with no natural order or ranking. You cannot say one category is "greater than" another. Comparison between categories is meaningless.

Examples: Hair color (red, brown, blonde), blood type (A, B, AB, O), country name, eye color, zip code.

Ordinal Data

Categories with a meaningful order or ranking, BUT the gaps between ranks are NOT equal or measurable. You know which is "more" but not by how much.

Examples: Competition placing (1st, 2nd, 3rd), satisfaction survey (Poor, Fair, Good, Excellent), star ratings (1-5 stars), Likert scales (Strongly Agree to Strongly Disagree).

Key distinction: Ordinal vs Nominal — the gap between "Good" and "Very Good" is NOT necessarily the same as between "Average" and "Good." That unequal-gap property is what makes it ordinal rather than quantitative.

Complete Data Type Hierarchy

Category	Sub-type	Arithmetic OK?	Order?	Examples
Quantitative	Discrete	Yes (integers only)	Yes	Customer count, dice roll, no. of children
Quantitative	Continuous	Yes (any decimal)	Yes	Height, weight, temperature, revenue
Qualitative	Nominal	No	No	Hair color, country, blood type, zip code
Qualitative	Ordinal	No	Yes (unequal gaps)	Survey ratings, competition rank, star ratings

The Zip Code Rule (Important!)

A zip code like 6000 or 90210 looks like a number, but it is qualitative nominal. Why? Because:

You cannot compute "the average zip code" — that would be meaningless
Zip code 6000 is not "greater than" or "less than" zip code 90210
There is no ordering between zip codes

General rule: Whenever a word/label could substitute for a number without losing meaning, it is qualitative. The test is always: "Can I meaningfully add these? Can I meaningfully average these?"

Questions You Can Ask by Data Type

Quantitative questions	Qualitative questions
What is the average value?	Which value occurs most/least?
Does this increase or decrease over time?	How many unique values are there?
Is there a dangerous threshold?	What are all the unique values?
What is the standard deviation?	What proportion belongs to each category?

Coffee Shop Classification Example

Field	Type	Reasoning
Name of coffee shop	Qualitative Nominal	No arithmetic meaning; no ordering between names
Revenue ($thousands)	Quantitative Continuous	Can add/average; decimals are valid ($12,345.67)
Zip code	Qualitative Nominal	Numbers but no meaningful arithmetic or ordering
Monthly customers	Quantitative Discrete	Counted (whole people only, no 2.5 customers)
Country of coffee origin	Qualitative Nominal	No order between countries (Ethiopia not > Colombia)
Star rating (1-5)	Qualitative Ordinal	Has order, but 5-4 gap may not equal 2-1 gap

Lecture 2 Summary — 5 Minute Revision

Structured = rows/columns (ML needs this). Unstructured = free-form (needs pre-processing). Quantitative = arithmetic is meaningful. Qualitative = categories, arithmetic is nonsensical. Quantitative splits into discrete (counted, integers only) and continuous (measured, decimals OK). Qualitative splits into nominal (no order, e.g. hair color) and ordinal (ordered but unequal gaps, e.g. rankings, Likert scales). Zip codes = qualitative nominal despite looking like numbers. Key test: can you meaningfully compute an average?

Practice Questions

Q1. Customer satisfaction measured as "Poor, Fair, Good, Excellent" is what type of data?

A. Quantitative continuous
B. Quantitative discrete
C. Qualitative ordinal
D. Qualitative nominal

Show Answer

Answer: C

There is a clear order (Poor < Fair < Good < Excellent) but the gaps between levels are not equal — the improvement from "Poor" to "Fair" may not be the same as from "Good" to "Excellent." Arithmetic is not meaningful (you can't compute "average satisfaction"), making it qualitative. Order present + unequal gaps = ordinal.

Q2. A zip code is classified as:

A. Quantitative discrete
B. Quantitative continuous
C. Qualitative nominal
D. Qualitative ordinal

Show Answer

Answer: C

Despite being represented as numbers, zip codes are qualitative nominal. Computing "the average zip code" or asking which zip code is "greater than" another is meaningless. No arithmetic and no natural ordering = nominal qualitative.

Q3. Which data type consists of distinct values that are always integers, typically obtained by counting?

A. Continuous
B. Nominal
C. Ordinal
D. Discrete

Show Answer

Answer: D

Discrete data consists of distinct, separate values that are counted (not measured). They are always integers — you cannot have 2.5 customers, a dice roll of 3.7, or 1.5 children. The key test: would a fractional value make sense? If no, it is discrete.

Q4. Revenue in thousands of dollars is:

A. Qualitative nominal
B. Qualitative ordinal
C. Quantitative discrete
D. Quantitative continuous

Show Answer

Answer: D

Revenue can take any value including decimals ($12,345.67) and arithmetic is fully meaningful (you can add revenues, find averages, compare differences). This makes it quantitative continuous. It is not discrete because it can have fractional values.

Q5. The key difference between nominal and ordinal data is:

A. Nominal data can be averaged; ordinal cannot
B. Ordinal data has a meaningful rank order; nominal data has no natural ordering
C. Nominal data is always numerical; ordinal is always text
D. Ordinal data allows arithmetic; nominal does not

Show Answer

Answer: B

Ordinal data has a meaningful rank order (1st > 2nd > 3rd, Excellent > Good > Poor) but the intervals between ranks are not equal. Nominal data has no ordering whatsoever — hair color "red" is not greater or less than "brown." Neither allows meaningful arithmetic.

Q6. Most statistical and ML models require which type of data?

A. Unstructured data
B. Only ordinal data
C. Structured (organized) data in row/column format
D. Nominal data

Show Answer

Answer: C

Most statistical and ML models were built with structured data in mind. They expect a row/column format where each row is one observation and each column is one feature. Unstructured data (text, audio, logs) must be pre-processed into structured form first.

Q7. A student rates a movie 3.5 out of 5 stars. This fractional rating suggests the rating scale is:

A. Qualitative nominal
B. Qualitative ordinal (if only whole stars) or quantitative if fractions are allowed
C. Always qualitative regardless of fraction
D. Quantitative discrete

Show Answer

Answer: B

Traditional star ratings (1-5 whole stars) are treated as qualitative ordinal — ordered but unequal gaps. However, if decimals/fractions are allowed (3.5 stars), it starts to behave more like quantitative data. In this course, star ratings are typically treated as ordinal. The key is context and whether arithmetic operations produce meaningful results.

Q8. Blood type (A, B, AB, O) is an example of:

A. Quantitative discrete
B. Quantitative continuous
C. Qualitative ordinal
D. Qualitative nominal

Show Answer

Answer: D

Blood type is qualitative nominal. There is no ordering between blood types (type A is not "more than" type O), and arithmetic is meaningless (the "average blood type" makes no sense). It is simply a categorical label with no mathematical properties.

Q9. Which of the following questions can ONLY be asked about quantitative data?

A. What are all the unique categories?
B. Which category appears most often?
C. What is the average value and standard deviation?
D. How many distinct values are there?

Show Answer

Answer: C

Average and standard deviation require meaningful arithmetic, which only applies to quantitative data. Questions A, B, and D can be asked about both quantitative and qualitative data. Asking for the "average hair color" or "standard deviation of zip codes" is nonsensical.

Q10. The number of customers visiting a coffee shop each day is:

A. Qualitative ordinal — customers can be ranked
B. Quantitative discrete — counted as whole people
C. Quantitative continuous — measured on a scale
D. Qualitative nominal — just labels for each day

Show Answer

Answer: B

Customer count is quantitative (arithmetic is meaningful: you can average, compare, sum) and discrete (you count whole people — you cannot have 2.5 customers). The number 237 customers per day is a whole integer.

Q11. Competition placing (1st, 2nd, 3rd) is what type of data?

A. Quantitative continuous
B. Quantitative discrete
C. Qualitative nominal
D. Qualitative ordinal

Show Answer

Answer: D

Competition placings are ordinal — 1st is better than 2nd is better than 3rd (there is a meaningful order). However, the gap between 1st and 2nd place may not be the same as between 2nd and 3rd (one winner might dominate while the others are close). Arithmetic is not meaningful (you cannot average 1st and 3rd place to get "2nd place equivalent performance").

Q12. An important reason why understanding data types matters for model selection is:

A. Some models run faster on certain data types
B. Using a model designed for quantitative data on nominal data produces nonsensical or wrong results
C. Data type determines how data is stored on disk
D. Only quantitative data can be used for any ML model

Show Answer

Answer: B

If you encode nominal categories as integers (blond=0, brown=1, red=2) and use a model that treats them as quantitative, the model will incorrectly assume red (2) is twice as much as blond (0), or that brown is between them. This creates false mathematical relationships. Wrong data type → wrong model → wrong results.