L1: Introduction to Data Science

Data science is about using data to gain insights you would otherwise have missed — to make decisions, predict outcomes, understand patterns, and create new things. This lecture establishes all the foundational vocabulary you need for the rest of the course.

What is Data Science?

Data science combines three skills to extract knowledge from data:

Skill	What it means	Example
Hacking Skills	Programming ability to collect, process, and manipulate data	Writing R/Python code to clean a dataset
Math & Statistics	Mathematical understanding of models and probability	Knowing how linear regression works mathematically
Substantive Expertise	Domain knowledge of the field you are working in	Understanding what "blood pressure" means in a medical study

Data Science = center of all three. Missing any one makes you less effective or dangerous.

The 7 Regions of the Venn Diagram

Intersection	Label	Problem
Hacking + Math only	Machine Learning	Results can't be applied without domain knowledge
Math + Domain only	Traditional Research	Can't handle large-scale data without coding
Hacking + Domain only	Danger Zone	Builds things with no statistical rigor — dangerous conclusions
All three	Data Science	None — this is the goal

Why Data Science Now?

The volume of data being created is exploding (from 2 zettabytes in 2010 to a projected 2,142 by 2035). Traditional methods fail because:

Volume is too large for humans to process manually in a reasonable time
Data comes in many formats and sources — often unorganized
Data can be missing, incomplete, or wrong
Data can be at very different scales, making comparison hard

Core Terminology

Data Science — Using data to gain new insights you would otherwise have missed. It involves collecting, cleaning, exploring, modeling, and communicating data to support decision-making and prediction.

Machine Learning (ML) — Giving computers the ability to learn from data without explicit rules being programmed. Instead of writing "if spam then..." rules, you show the model thousands of spam emails and it learns the pattern itself.

Exploratory Data Analysis (EDA) — The process of exploring and cleaning data BEFORE building models. You visualize distributions, look for anomalies, handle missing values, and get a feel for the data. EDA always comes before modeling.

Data Mining — Finding relationships between elements of data. Example: discovering that customers who buy diapers also frequently buy beer (a famous retail data mining finding).

Probabilistic Model — A model that uses probability to describe the relationship between variables, acknowledging uncertainty. Example: the probability that an email is spam given its words.

Statistical Model — A model that uses statistical theorems to formalize relationships in a mathematical formula. Example: Recruits = 0.5 × Spawners + 60 (the spawner-recruit model in biology).

Data Model — An organized, formal relationship between data elements meant to simulate a real-world phenomenon. Once built, you plug in known values to get predictions.

Organized vs Unorganized Data

Organized (Structured) Data

Stored in rows and columns. Every row = one observation. Every column = one characteristic. ML models are built for this.

Examples: Spreadsheet of employee salaries, database of customer orders, scientific measurement tables.

Unorganized (Unstructured) Data

Free-form. No standard row/column hierarchy. Must be pre-processed before most ML models can use it.

Examples: Tweets, emails, server logs, audio files, genetic sequences (ACGTATTGCA).

Key fact: Most ML models need structured data. Unstructured data requires pre-processing to extract structured features before modeling.

Programming Languages for Data Science

Language	Best for	Why
R	Statistics, visualization, data wrangling	Deepest statistical libraries; ggplot2 for visualization; preferred in academia
Python	General purpose, text processing	Regular expressions; wider range of applications; easier munging
Matlab	Matrix operations	Fast and efficient for numerical computing
Java / C	Big Data systems	Speed and scalability at very large scale
Excel	Quick exploration	Bread-and-butter tool for initial data exploration

Why Domain Knowledge Is Essential

Without domain knowledge, ML algorithms remain just algorithms sitting on your computer. You need it to:

Interpret results correctly (does a 10% drug use rate make sense in this context?)
Generalize findings to the real world
Recognize when a result is nonsensical (a model predicting negative hospital stays)
Choose the right features and model type

Lecture 1 Summary — 5 Minute Revision

Data science = Hacking Skills + Math/Stats + Domain Expertise. Goals: decisions, predictions, understanding, new products. Machine learning = Hacking + Math only (needs domain to be applied). Danger Zone = Hacking + Domain without Math (no rigor). EDA always comes before modeling. Structured data = rows/columns (ML needs this). Unstructured = free-form (needs pre-processing). R is preferred for stats/visualization. Domain knowledge is essential — without it, ML results can't be applied.

Practice Questions

Q1. Data science is best described as the intersection of which three fields?

A. Programming, Biology, and Economics
B. Hacking Skills, Math & Statistics, and Substantive Expertise
C. Machine Learning, Deep Learning, and Neural Networks
D. Python, R, and SQL

Show Answer

Answer: B

The data science Venn diagram shows three circles: Hacking Skills, Math & Statistics, and Substantive Expertise. Data Science is at the center of all three. Without any one of them, you can't do proper data science — you end up in Machine Learning (no domain), Traditional Research (no coding), or the Danger Zone (no math).

Q2. Machine learning sits at the intersection of which two fields in the Venn diagram?

A. Hacking Skills and Substantive Expertise
B. Math & Statistics and Substantive Expertise
C. Hacking Skills and Math & Statistics
D. All three fields

Show Answer

Answer: C

Machine Learning is at the intersection of Hacking Skills and Math & Statistics — it requires coding ability and mathematical understanding. However, it lacks substantive (domain) expertise. Without domain knowledge, you cannot generalize ML results to real-world applications.

Q3. The "Danger Zone" in the data science Venn diagram refers to having:

A. Hacking Skills and Math/Statistics, but no domain expertise
B. Domain expertise and Math/Statistics, but no hacking skills
C. Hacking Skills and Domain Expertise, but no Math/Statistics
D. All three skills at a novice level

Show Answer

Answer: C

The Danger Zone is Hacking Skills + Domain Expertise without mathematical/statistical rigor. You can build things and understand the domain, but without statistical rigor, your conclusions may be fundamentally wrong and dangerously misleading.

Q4. Machine learning is best defined as:

A. Explicit programming of every decision rule
B. Giving computers the ability to learn from data without explicit rules
C. Manual statistical analysis of large datasets
D. Exclusively creating data visualizations

Show Answer

Answer: B

Machine learning gives computers the ability to learn from data without explicit rules being programmed. Instead of manually writing rules like "if the email contains 'free money' then it's spam," the ML algorithm learns these patterns automatically from thousands of examples.

Q5. Which of the following is an example of unstructured data?

A. A CSV spreadsheet of employee salaries
B. A database table of product orders
C. Twitter posts and email content
D. A scientific measurement table

Show Answer

Answer: C

Unstructured (unorganized) data is free-form and does not follow a standard row/column structure. Twitter posts and emails are free-text with no standard format. Spreadsheets, databases, and measurement tables are all structured (organized into rows and columns).

Q6. Exploratory Data Analysis (EDA) is best described as:

A. Building the final predictive model
B. Deploying models to production systems
C. Exploring, visualizing, and cleaning data to gain insights BEFORE modeling
D. Conducting hypothesis tests on already-cleaned data

Show Answer

Answer: C

EDA is the process of preparing data and gaining quick insights BEFORE building models. It involves data visualization, cleaning (handling missing values, fixing errors), and identifying key features. EDA always comes before modeling — feeding unvisualized, uncleaned data to a model is asking for trouble.

Q7. Data mining is best described as:

A. Extracting data from raw text files
B. The process of finding relationships between elements of data
C. Removing incorrect records from a dataset
D. Training neural networks on large datasets

Show Answer

Answer: B

Data mining is the process of finding relationships between elements of data. For example, discovering that TV advertising spending strongly correlates with sales. It is the part of data science where you try to find patterns and relationships in variables.

Q8. Why must unstructured data be pre-processed before using most ML models?

A. Unstructured data is always inaccurate
B. Most ML models require a row/column structure and cannot work on free-form data directly
C. Structured data always has more records
D. Unstructured data is only useful for deep learning

Show Answer

Answer: B

Most statistical and ML models were built with structured (organized) data in mind. They expect a row/column format. Unstructured data like emails or tweets must be pre-processed (tokenized, cleaned, converted to numerical features) before they can be fed into standard ML models.

Q9. A data model is best described as:

A. A spreadsheet containing cleaned data
B. An organized, formal relationship between data elements that simulates a real-world phenomenon
C. A type of database management system
D. A visualization of data distributions

Show Answer

Answer: B

A data model is an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Example: the spawner-recruit model (Recruits = 0.5 × Spawners + 60) formalizes the relationship between parent fish and new fish. Once built, you plug in values to get predictions.

Q10. Why is R preferred for data science in this course over other languages?

A. R is the only language that can handle big data
B. R has the deepest statistical libraries, excellent data wrangling packages, and the famous ggplot2 visualization package
C. R runs faster than Python and Java on all tasks
D. R is exclusively used for database management

Show Answer

Answer: B

R is preferred because: it has the deepest available statistical libraries, it provides excellent data wrangling packages, it has ggplot2 for aesthetically excellent visualizations, and it contains ML packages for boosting, random forests, regression, and classification. For big data systems, Java/C is preferred. For matrix operations, Matlab. But for statistics and visualization, R leads.

Q11. The 4 main goals of data science are:

A. Collect, store, process, and delete data
B. Make decisions, predict the future, understand the past/present, and create new industries/products
C. Visualize, model, deploy, and monitor
D. Clean, transform, analyze, and report

Show Answer

Answer: B

According to the lecture, data science uses data to: (1) Make decisions, (2) Predict the future, (3) Understand the past/present, and (4) Create new industries and products. All four are distinct use cases. For example: predicting flu outbreaks (predict future), understanding what drives sales (understand present), deciding which advertisement to show (make decisions).

Q12. Which statement best explains why domain knowledge cannot be replaced by ML algorithms alone?

A. ML algorithms require domain knowledge to run
B. Without domain knowledge, ML results cannot be interpreted, generalized, or applied to the real world
C. Domain knowledge makes algorithms run faster
D. ML without domain knowledge always produces wrong results

Show Answer

Answer: B

Without domain knowledge, ML algorithms remain just algorithms on your computer. You need domain expertise to: interpret what results mean in context, determine whether conclusions make sense (e.g. a negative hospital stay prediction is nonsensical), choose appropriate features, and generalize findings to the real world. A financial analyst working on stock data needs financial knowledge; a journalist needs journalism context.