Data scientists spend the MAJORITY of their time on this step. Clean, well-prepared data is the foundation of any good analysis. "Garbage in, garbage out" — if you feed a model bad data, you will get bad results, no matter how sophisticated the model.

What is Data Wrangling?

Data Munging / Data Wrangling — The art of acquiring data and preparing it for analysis. Includes collecting data from sources, converting formats, cleaning errors, handling missing values, and making the data suitable for modeling.

Languages and Data Formats

Languages for Data Science

RStatistics, visualization, deepest libraries
PythonGeneral purpose, regex, easier text munging
MatlabFast matrix operations
Java / CBig Data systems requiring speed
ExcelQuick exploration and basic analysis

Standard Data Formats

CSVTables like spreadsheets
XMLStructured but non-tabular data
JSONAPIs (JavaScript Object Notation)
SQLMultiple related tables in a database

Sources of Data

SourceDescriptionKey note
ProprietaryInternal company data (Facebook, Google, Amazon)Outside access usually impossible; may release rate-limited APIs
GovernmentOpen data portals (data.gov.au)Privacy is the main barrier to release
AcademicPublished datasets, GitHub repositoriesData availability now required for many publications
Web scrapingStripping text/data from webpagesCheck terms of service first; legal limits apply
Sensor / IoTDevices like GPS, accelerometers, health trackersBuild logging systems early; storage is cheap
CrowdsourcingAmazon Mechanical Turk and similar platformsPay people to label/annotate data at scale
Sweat equityManual data entry from paper/PDF recordsOften unavoidable for historical data

Cleaning Data: Common Problems

Data Error — Information that is fundamentally LOST in data acquisition. The data simply does not exist or was never captured correctly. Cannot be recovered — only handled (e.g., treat as missing).
Artifact — A systematic problem caused by PROCESSING done to the data. The underlying real-world fact exists, but the data representation is wrong due to a processing step. CAN be fixed by cleaning.

Real Example: PubMed Artifact

PubMed (medical literature database) started storing author first names in 2002. Before 2002, "G. Wang" and "Guanjin Wang" were treated as different authors. When first names were added in 2002, thousands of previously-listed authors suddenly appeared to be "new" authors, creating a huge spike in that year. This is an artifact — a processing decision created a false pattern in the data. Data cleaning removes such artifacts.

Data Compatibility Issues

Dealing with Missing Data

Missing values appear as NaN (Not a Number), NULL, or blank spaces. Most ML tools cannot handle them and will crash or produce unreliable results.

Option 1: Discard

Remove rows (samples) or columns (features) with missing values.

Disadvantage: If too many records are removed, the remaining data may be too small for reliable analysis. If feature columns are removed, the classifier may lose the information it needs to distinguish between classes.

When OK: When you have plenty of data and missing values are rare.

Option 2: Impute

Fill in estimated values instead of discarding records.

  • Mean imputation — replace with column mean; keeps mean unchanged
  • Random imputation — replace with randomly selected values; permits statistical evaluation of imputation impact
  • Regression imputation — predict missing values from other features using a regression model
  • KNN imputation — find K nearest neighbors, impute from their non-missing values
Setting missing values to ZERO is generally WRONG. If someone's death year is missing (they're still alive), setting it to 0 means they died in year 0. Zero is only correct if zero is a meaningful value in context. Imputation is almost always better than zero-filling.

Outlier Detection

Outliers are data points that are unusually far from the rest of the data. They can represent genuine extreme values or measurement errors.

Do NOT blindly delete outliers

Sometimes outliers represent real, important data. Example: a bank detects a very large credit card transaction as an outlier. But the transaction is legitimate — it is a rare high-value purchase. By NOT removing the outlier, you gain insight into customer spending behavior that is valuable for marketing or risk assessment. Always investigate before deleting.

However, if a dinosaur bone measurement is 50% larger than all others and was likely a transcription error (two digits transposed), removing it improves model quality.

Imbalanced Data EXAM HOT

Imbalanced data occurs when one class has far more examples than another.

Example: 1,000,000 emails in a dataset; only 30 are spam. A classifier that ALWAYS predicts "not spam" achieves 99.997% accuracy — but has ZERO ability to detect actual spam. It is completely useless for its purpose.

Accuracy is NOT a good metric for imbalanced class problems. Better metrics: precision, recall, F1-score.

Ways to fix imbalanced data:

MethodHow it works
Find more minority examplesActively seek out more positive class data
Discard majority examplesRandomly remove records from the dominant class
Weight the minority classMake misclassifying minority examples more costly (but beware overfitting)
Replicate minority examplesDuplicate minority records, ideally with small random variations (SMOTE)

Lecture 4 Summary — 5 Minute Revision

Data wrangling = acquiring + preparing data. Errors = lost data (unrecoverable). Artifacts = fixable processing problems. Never set missing values to zero. Missing data: discard (lose records) or impute (estimate). Imputation methods: mean (keeps mean unchanged), random, regression, KNN. Outliers: investigate before deleting — they may be real. Imbalanced data: accuracy is misleading (99.997% accuracy can still be useless). Always fix imbalance before training. Main data formats: CSV (tables), JSON (APIs), XML (structured), SQL (relational databases).

Practice Questions

Q1. What is the primary problem with setting missing values to zero?

  • A. It increases computation time significantly
  • B. Zero is usually not a meaningful value for missing data and introduces false information
  • C. Most programming languages don't support zero values in datasets
  • D. It only works for quantitative data, not qualitative
Show Answer

Answer: B

Setting missing values to zero introduces false data. A living person's "year of death" being set to 0 implies they died in year 0. The number of sales for a product that wasn't yet released set to 0 implies zero sales rather than unknown. Zero is only appropriate when zero is actually a meaningful value in context. Imputation methods are almost always better.

Q2. Which imputation method replaces missing values with the average of existing values in that column?

  • A. Random imputation
  • B. KNN imputation
  • C. Mean imputation
  • D. Regression imputation
Show Answer

Answer: C

Mean imputation replaces each missing value with the mean of the entire feature column. Its key property is that it leaves the column mean UNCHANGED. Disadvantage: it reduces variance and may distort relationships. KNN imputation uses nearest neighbors to estimate missing values. Regression imputation predicts missing values from other features.

Q3. The key difference between a data error and a data artifact is:

  • A. Errors are larger; artifacts are smaller deviations
  • B. An error is information fundamentally lost; an artifact is a fixable systematic processing problem
  • C. Errors affect structured data; artifacts affect unstructured data
  • D. There is no meaningful difference between them
Show Answer

Answer: B

A data ERROR is information fundamentally lost in data acquisition — it cannot be recovered, only handled as missing. A data ARTIFACT is a systematic problem arising from processing done to the data — it can be identified and corrected by cleaning. The PubMed first-names example is an artifact: the real data exists, but the processing decision created a false spike that cleaning can remove.

Q4. A spam filter is trained on 1,000,000 emails (999,970 non-spam, 30 spam). A classifier that always predicts "not spam" achieves what accuracy, and is it useful?

  • A. 0.003% accuracy — completely useless
  • B. 50% accuracy — moderately useful
  • C. 99.997% accuracy — useless because it never detects spam
  • D. 99.997% accuracy — highly useful for spam detection
Show Answer

Answer: C

Always predicting "not spam" correctly labels 999,970 out of 1,000,000 emails = 99.997% accuracy. But it NEVER identifies a single spam email. Accuracy is completely misleading for imbalanced datasets. The correct metrics are precision (of emails flagged as spam, how many are actually spam?), recall (of all spam emails, how many did we catch?), and F1-score.

Q5. Which statement about outlier removal is most accurate?

  • A. All outliers should always be removed before model training
  • B. Outliers should never be removed under any circumstances
  • C. Outliers should be carefully examined — they may be errors or real important data points
  • D. Outliers are only relevant for regression models
Show Answer

Answer: C

Outliers require careful examination. Sometimes they are genuine errors (a dinosaur bone 50% larger than all others — likely a transcription error) and removing them improves the model. Other times they are legitimate extreme values (a large credit card transaction that reveals real customer behavior). Always investigate: is this an error or a real event?

Q6. KNN imputation works by:

  • A. Replacing missing values with the column mean
  • B. Finding the K nearest neighbors to the record with missing values, then using their non-missing values to estimate the missing ones
  • C. Randomly selecting values from the distribution of that feature
  • D. Building a separate regression model for each missing column
Show Answer

Answer: B

KNN imputation finds the K most similar records (nearest neighbors) to the record with a missing value, based on the non-missing features. It then imputes the missing value from the corresponding values in those K neighbors. This is more sophisticated than mean imputation because it considers the similarity between records, producing more realistic estimates.

Q7. Which data format is best suited for storing data from a web API?

  • A. CSV
  • B. XML
  • C. JSON
  • D. SQL
Show Answer

Answer: C

JSON (JavaScript Object Notation) is the standard format for web APIs. It is lightweight, human-readable, and natively supported by web browsers. CSV is for flat tabular data (spreadsheets). XML is for structured non-tabular data. SQL databases store multiple related tables with relationships between them.

Q8. NASA's Mars Climate Orbiter exploded in 1999 because of a data issue. What type of problem was it?

  • A. Missing data — critical navigation values were not collected
  • B. Data compatibility — one team used metric units, another used English units
  • C. Outlier contamination — extreme sensor readings corrupted the trajectory
  • D. Sampling bias — the navigation algorithm was tested on unrepresentative data
Show Answer

Answer: B

This is a data compatibility / unit conversion problem. One engineering team provided thruster data in metric units (Newton-seconds) while another team's software expected English units (pound-force seconds). This is the type of issue data cleaning must catch — always verify that data from different sources uses consistent units and representations. "Apple to apple" comparisons require careful unit unification.

Q9. Crowdsourcing for data collection means:

  • A. Scraping data from multiple competitor websites
  • B. Using government open data portals
  • C. Building a participatory dataset using a large group of people, often through platforms like Amazon Mechanical Turk
  • D. Extracting data from sensor networks
Show Answer

Answer: C

Crowdsourced data collection is a participatory method of building a dataset with the help of a large group of people. Platforms like Amazon Mechanical Turk (MTurk) allow researchers to cheaply outsource labeling, annotation, and simple tasks to many people. This is valuable for creating labeled training data for ML models.

Q10. Which is the best strategy when you have heavily imbalanced classes?

  • A. Use accuracy as the evaluation metric since it's the most intuitive
  • B. Do nothing — most models handle imbalance automatically
  • C. Address the imbalance (resampling, weighting) and use precision/recall/F1 as metrics
  • D. Delete all examples from the minority class to simplify the problem
Show Answer

Answer: C

For imbalanced data: (1) Fix the imbalance using techniques like resampling (oversample minority, undersample majority), weighting, or replication with perturbation. (2) Use metrics other than accuracy: precision (what fraction of predicted positives are correct?), recall (what fraction of actual positives did we catch?), F1-score (harmonic mean of precision and recall).

Q11. The main disadvantage of discarding records with missing values is:

  • A. It is computationally expensive
  • B. It may remove too many records, making reliable analysis impossible, or lose valuable feature information
  • C. It is not possible in R or Python
  • D. It only works for numeric features
Show Answer

Answer: B

If too many records have missing values and you discard all of them, you may not have enough data left for reliable analysis. If you discard entire feature columns because they have some missing values, you lose information that the model needs to distinguish between classes. This is why imputation is often preferred over simple discarding.

Q12. Accessing proprietary data from companies like Facebook or Google is:

  • A. Easy through standard APIs available to all researchers
  • B. Usually impossible for outsiders; companies may provide rate-limited APIs at best
  • C. Only possible for government researchers
  • D. Always freely available as long as you cite the source
Show Answer

Answer: B

Proprietary data sources like Facebook, Google, and Amazon contain exciting datasets, but getting outside access is usually impossible. At best, companies sometimes release rate-limited APIs (like the Twitter/X API) that allow limited programmatic access. Researchers typically must work with publicly available datasets, government data, or academic datasets instead.