Data science is about using data to gain insights you would otherwise have missed — to make decisions, predict outcomes, understand patterns, and create new things. This lecture establishes all the foundational vocabulary you need for the rest of the course.

What is Data Science?

Data science combines three skills to extract knowledge from data:

SkillWhat it meansExample
Hacking SkillsProgramming ability to collect, process, and manipulate dataWriting R/Python code to clean a dataset
Math & StatisticsMathematical understanding of models and probabilityKnowing how linear regression works mathematically
Substantive ExpertiseDomain knowledge of the field you are working inUnderstanding what "blood pressure" means in a medical study

Data Science = center of all three. Missing any one makes you less effective or dangerous.

The 7 Regions of the Venn Diagram

IntersectionLabelProblem
Hacking + Math onlyMachine LearningResults can't be applied without domain knowledge
Math + Domain onlyTraditional ResearchCan't handle large-scale data without coding
Hacking + Domain onlyDanger ZoneBuilds things with no statistical rigor — dangerous conclusions
All threeData ScienceNone — this is the goal

Why Data Science Now?

The volume of data being created is exploding (from 2 zettabytes in 2010 to a projected 2,142 by 2035). Traditional methods fail because:

Core Terminology

Data Science — Using data to gain new insights you would otherwise have missed. It involves collecting, cleaning, exploring, modeling, and communicating data to support decision-making and prediction.
Machine Learning (ML) — Giving computers the ability to learn from data without explicit rules being programmed. Instead of writing "if spam then..." rules, you show the model thousands of spam emails and it learns the pattern itself.
Exploratory Data Analysis (EDA) — The process of exploring and cleaning data BEFORE building models. You visualize distributions, look for anomalies, handle missing values, and get a feel for the data. EDA always comes before modeling.
Data Mining — Finding relationships between elements of data. Example: discovering that customers who buy diapers also frequently buy beer (a famous retail data mining finding).
Probabilistic Model — A model that uses probability to describe the relationship between variables, acknowledging uncertainty. Example: the probability that an email is spam given its words.
Statistical Model — A model that uses statistical theorems to formalize relationships in a mathematical formula. Example: Recruits = 0.5 × Spawners + 60 (the spawner-recruit model in biology).
Data Model — An organized, formal relationship between data elements meant to simulate a real-world phenomenon. Once built, you plug in known values to get predictions.

Organized vs Unorganized Data

Organized (Structured) Data

Stored in rows and columns. Every row = one observation. Every column = one characteristic. ML models are built for this.

Examples: Spreadsheet of employee salaries, database of customer orders, scientific measurement tables.

Unorganized (Unstructured) Data

Free-form. No standard row/column hierarchy. Must be pre-processed before most ML models can use it.

Examples: Tweets, emails, server logs, audio files, genetic sequences (ACGTATTGCA).

Key fact: Most ML models need structured data. Unstructured data requires pre-processing to extract structured features before modeling.

Programming Languages for Data Science

LanguageBest forWhy
RStatistics, visualization, data wranglingDeepest statistical libraries; ggplot2 for visualization; preferred in academia
PythonGeneral purpose, text processingRegular expressions; wider range of applications; easier munging
MatlabMatrix operationsFast and efficient for numerical computing
Java / CBig Data systemsSpeed and scalability at very large scale
ExcelQuick explorationBread-and-butter tool for initial data exploration

Why Domain Knowledge Is Essential

Without domain knowledge, ML algorithms remain just algorithms sitting on your computer. You need it to:

Lecture 1 Summary — 5 Minute Revision

Data science = Hacking Skills + Math/Stats + Domain Expertise. Goals: decisions, predictions, understanding, new products. Machine learning = Hacking + Math only (needs domain to be applied). Danger Zone = Hacking + Domain without Math (no rigor). EDA always comes before modeling. Structured data = rows/columns (ML needs this). Unstructured = free-form (needs pre-processing). R is preferred for stats/visualization. Domain knowledge is essential — without it, ML results can't be applied.

Practice Questions

Q1. Data science is best described as the intersection of which three fields?

  • A. Programming, Biology, and Economics
  • B. Hacking Skills, Math & Statistics, and Substantive Expertise
  • C. Machine Learning, Deep Learning, and Neural Networks
  • D. Python, R, and SQL
Show Answer

Answer: B

The data science Venn diagram shows three circles: Hacking Skills, Math & Statistics, and Substantive Expertise. Data Science is at the center of all three. Without any one of them, you can't do proper data science — you end up in Machine Learning (no domain), Traditional Research (no coding), or the Danger Zone (no math).

Q2. Machine learning sits at the intersection of which two fields in the Venn diagram?

  • A. Hacking Skills and Substantive Expertise
  • B. Math & Statistics and Substantive Expertise
  • C. Hacking Skills and Math & Statistics
  • D. All three fields
Show Answer

Answer: C

Machine Learning is at the intersection of Hacking Skills and Math & Statistics — it requires coding ability and mathematical understanding. However, it lacks substantive (domain) expertise. Without domain knowledge, you cannot generalize ML results to real-world applications.

Q3. The "Danger Zone" in the data science Venn diagram refers to having:

  • A. Hacking Skills and Math/Statistics, but no domain expertise
  • B. Domain expertise and Math/Statistics, but no hacking skills
  • C. Hacking Skills and Domain Expertise, but no Math/Statistics
  • D. All three skills at a novice level
Show Answer

Answer: C

The Danger Zone is Hacking Skills + Domain Expertise without mathematical/statistical rigor. You can build things and understand the domain, but without statistical rigor, your conclusions may be fundamentally wrong and dangerously misleading.

Q4. Machine learning is best defined as:

  • A. Explicit programming of every decision rule
  • B. Giving computers the ability to learn from data without explicit rules
  • C. Manual statistical analysis of large datasets
  • D. Exclusively creating data visualizations
Show Answer

Answer: B

Machine learning gives computers the ability to learn from data without explicit rules being programmed. Instead of manually writing rules like "if the email contains 'free money' then it's spam," the ML algorithm learns these patterns automatically from thousands of examples.

Q5. Which of the following is an example of unstructured data?

  • A. A CSV spreadsheet of employee salaries
  • B. A database table of product orders
  • C. Twitter posts and email content
  • D. A scientific measurement table
Show Answer

Answer: C

Unstructured (unorganized) data is free-form and does not follow a standard row/column structure. Twitter posts and emails are free-text with no standard format. Spreadsheets, databases, and measurement tables are all structured (organized into rows and columns).

Q6. Exploratory Data Analysis (EDA) is best described as:

  • A. Building the final predictive model
  • B. Deploying models to production systems
  • C. Exploring, visualizing, and cleaning data to gain insights BEFORE modeling
  • D. Conducting hypothesis tests on already-cleaned data
Show Answer

Answer: C

EDA is the process of preparing data and gaining quick insights BEFORE building models. It involves data visualization, cleaning (handling missing values, fixing errors), and identifying key features. EDA always comes before modeling — feeding unvisualized, uncleaned data to a model is asking for trouble.

Q7. Data mining is best described as:

  • A. Extracting data from raw text files
  • B. The process of finding relationships between elements of data
  • C. Removing incorrect records from a dataset
  • D. Training neural networks on large datasets
Show Answer

Answer: B

Data mining is the process of finding relationships between elements of data. For example, discovering that TV advertising spending strongly correlates with sales. It is the part of data science where you try to find patterns and relationships in variables.

Q8. Why must unstructured data be pre-processed before using most ML models?

  • A. Unstructured data is always inaccurate
  • B. Most ML models require a row/column structure and cannot work on free-form data directly
  • C. Structured data always has more records
  • D. Unstructured data is only useful for deep learning
Show Answer

Answer: B

Most statistical and ML models were built with structured (organized) data in mind. They expect a row/column format. Unstructured data like emails or tweets must be pre-processed (tokenized, cleaned, converted to numerical features) before they can be fed into standard ML models.

Q9. A data model is best described as:

  • A. A spreadsheet containing cleaned data
  • B. An organized, formal relationship between data elements that simulates a real-world phenomenon
  • C. A type of database management system
  • D. A visualization of data distributions
Show Answer

Answer: B

A data model is an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Example: the spawner-recruit model (Recruits = 0.5 × Spawners + 60) formalizes the relationship between parent fish and new fish. Once built, you plug in values to get predictions.

Q10. Why is R preferred for data science in this course over other languages?

  • A. R is the only language that can handle big data
  • B. R has the deepest statistical libraries, excellent data wrangling packages, and the famous ggplot2 visualization package
  • C. R runs faster than Python and Java on all tasks
  • D. R is exclusively used for database management
Show Answer

Answer: B

R is preferred because: it has the deepest available statistical libraries, it provides excellent data wrangling packages, it has ggplot2 for aesthetically excellent visualizations, and it contains ML packages for boosting, random forests, regression, and classification. For big data systems, Java/C is preferred. For matrix operations, Matlab. But for statistics and visualization, R leads.

Q11. The 4 main goals of data science are:

  • A. Collect, store, process, and delete data
  • B. Make decisions, predict the future, understand the past/present, and create new industries/products
  • C. Visualize, model, deploy, and monitor
  • D. Clean, transform, analyze, and report
Show Answer

Answer: B

According to the lecture, data science uses data to: (1) Make decisions, (2) Predict the future, (3) Understand the past/present, and (4) Create new industries and products. All four are distinct use cases. For example: predicting flu outbreaks (predict future), understanding what drives sales (understand present), deciding which advertisement to show (make decisions).

Q12. Which statement best explains why domain knowledge cannot be replaced by ML algorithms alone?

  • A. ML algorithms require domain knowledge to run
  • B. Without domain knowledge, ML results cannot be interpreted, generalized, or applied to the real world
  • C. Domain knowledge makes algorithms run faster
  • D. ML without domain knowledge always produces wrong results
Show Answer

Answer: B

Without domain knowledge, ML algorithms remain just algorithms on your computer. You need domain expertise to: interpret what results mean in context, determine whether conclusions make sense (e.g. a negative hospital stay prediction is nonsensical), choose appropriate features, and generalize findings to the real world. A financial analyst working on stock data needs financial knowledge; a journalist needs journalism context.