What data science is, why it matters, core terminology, data types, and tools
Data science combines three skills to extract knowledge from data:
| Skill | What it means | Example |
|---|---|---|
| Hacking Skills | Programming ability to collect, process, and manipulate data | Writing R/Python code to clean a dataset |
| Math & Statistics | Mathematical understanding of models and probability | Knowing how linear regression works mathematically |
| Substantive Expertise | Domain knowledge of the field you are working in | Understanding what "blood pressure" means in a medical study |
Data Science = center of all three. Missing any one makes you less effective or dangerous.
| Intersection | Label | Problem |
|---|---|---|
| Hacking + Math only | Machine Learning | Results can't be applied without domain knowledge |
| Math + Domain only | Traditional Research | Can't handle large-scale data without coding |
| Hacking + Domain only | Danger Zone | Builds things with no statistical rigor — dangerous conclusions |
| All three | Data Science | None — this is the goal |
The volume of data being created is exploding (from 2 zettabytes in 2010 to a projected 2,142 by 2035). Traditional methods fail because:
Stored in rows and columns. Every row = one observation. Every column = one characteristic. ML models are built for this.
Examples: Spreadsheet of employee salaries, database of customer orders, scientific measurement tables.
Free-form. No standard row/column hierarchy. Must be pre-processed before most ML models can use it.
Examples: Tweets, emails, server logs, audio files, genetic sequences (ACGTATTGCA).
| Language | Best for | Why |
|---|---|---|
| R | Statistics, visualization, data wrangling | Deepest statistical libraries; ggplot2 for visualization; preferred in academia |
| Python | General purpose, text processing | Regular expressions; wider range of applications; easier munging |
| Matlab | Matrix operations | Fast and efficient for numerical computing |
| Java / C | Big Data systems | Speed and scalability at very large scale |
| Excel | Quick exploration | Bread-and-butter tool for initial data exploration |
Without domain knowledge, ML algorithms remain just algorithms sitting on your computer. You need it to:
Data science = Hacking Skills + Math/Stats + Domain Expertise. Goals: decisions, predictions, understanding, new products. Machine learning = Hacking + Math only (needs domain to be applied). Danger Zone = Hacking + Domain without Math (no rigor). EDA always comes before modeling. Structured data = rows/columns (ML needs this). Unstructured = free-form (needs pre-processing). R is preferred for stats/visualization. Domain knowledge is essential — without it, ML results can't be applied.
Q1. Data science is best described as the intersection of which three fields?
Answer: B
The data science Venn diagram shows three circles: Hacking Skills, Math & Statistics, and Substantive Expertise. Data Science is at the center of all three. Without any one of them, you can't do proper data science — you end up in Machine Learning (no domain), Traditional Research (no coding), or the Danger Zone (no math).
Q2. Machine learning sits at the intersection of which two fields in the Venn diagram?
Answer: C
Machine Learning is at the intersection of Hacking Skills and Math & Statistics — it requires coding ability and mathematical understanding. However, it lacks substantive (domain) expertise. Without domain knowledge, you cannot generalize ML results to real-world applications.
Q3. The "Danger Zone" in the data science Venn diagram refers to having:
Answer: C
The Danger Zone is Hacking Skills + Domain Expertise without mathematical/statistical rigor. You can build things and understand the domain, but without statistical rigor, your conclusions may be fundamentally wrong and dangerously misleading.
Q4. Machine learning is best defined as:
Answer: B
Machine learning gives computers the ability to learn from data without explicit rules being programmed. Instead of manually writing rules like "if the email contains 'free money' then it's spam," the ML algorithm learns these patterns automatically from thousands of examples.
Q5. Which of the following is an example of unstructured data?
Answer: C
Unstructured (unorganized) data is free-form and does not follow a standard row/column structure. Twitter posts and emails are free-text with no standard format. Spreadsheets, databases, and measurement tables are all structured (organized into rows and columns).
Q6. Exploratory Data Analysis (EDA) is best described as:
Answer: C
EDA is the process of preparing data and gaining quick insights BEFORE building models. It involves data visualization, cleaning (handling missing values, fixing errors), and identifying key features. EDA always comes before modeling — feeding unvisualized, uncleaned data to a model is asking for trouble.
Q7. Data mining is best described as:
Answer: B
Data mining is the process of finding relationships between elements of data. For example, discovering that TV advertising spending strongly correlates with sales. It is the part of data science where you try to find patterns and relationships in variables.
Q8. Why must unstructured data be pre-processed before using most ML models?
Answer: B
Most statistical and ML models were built with structured (organized) data in mind. They expect a row/column format. Unstructured data like emails or tweets must be pre-processed (tokenized, cleaned, converted to numerical features) before they can be fed into standard ML models.
Q9. A data model is best described as:
Answer: B
A data model is an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Example: the spawner-recruit model (Recruits = 0.5 × Spawners + 60) formalizes the relationship between parent fish and new fish. Once built, you plug in values to get predictions.
Q10. Why is R preferred for data science in this course over other languages?
Answer: B
R is preferred because: it has the deepest available statistical libraries, it provides excellent data wrangling packages, it has ggplot2 for aesthetically excellent visualizations, and it contains ML packages for boosting, random forests, regression, and classification. For big data systems, Java/C is preferred. For matrix operations, Matlab. But for statistics and visualization, R leads.
Q11. The 4 main goals of data science are:
Answer: B
According to the lecture, data science uses data to: (1) Make decisions, (2) Predict the future, (3) Understand the past/present, and (4) Create new industries and products. All four are distinct use cases. For example: predicting flu outbreaks (predict future), understanding what drives sales (understand present), deciding which advertisement to show (make decisions).
Q12. Which statement best explains why domain knowledge cannot be replaced by ML algorithms alone?
Answer: B
Without domain knowledge, ML algorithms remain just algorithms on your computer. You need domain expertise to: interpret what results mean in context, determine whether conclusions make sense (e.g. a negative hospital stay prediction is nonsensical), choose appropriate features, and generalize findings to the real world. A financial analyst working on stock data needs financial knowledge; a journalist needs journalism context.