Acquiring and preparing data, cleaning, missing values, outliers, imbalanced data
| R | Statistics, visualization, deepest libraries |
| Python | General purpose, regex, easier text munging |
| Matlab | Fast matrix operations |
| Java / C | Big Data systems requiring speed |
| Excel | Quick exploration and basic analysis |
| CSV | Tables like spreadsheets |
| XML | Structured but non-tabular data |
| JSON | APIs (JavaScript Object Notation) |
| SQL | Multiple related tables in a database |
| Source | Description | Key note |
|---|---|---|
| Proprietary | Internal company data (Facebook, Google, Amazon) | Outside access usually impossible; may release rate-limited APIs |
| Government | Open data portals (data.gov.au) | Privacy is the main barrier to release |
| Academic | Published datasets, GitHub repositories | Data availability now required for many publications |
| Web scraping | Stripping text/data from webpages | Check terms of service first; legal limits apply |
| Sensor / IoT | Devices like GPS, accelerometers, health trackers | Build logging systems early; storage is cheap |
| Crowdsourcing | Amazon Mechanical Turk and similar platforms | Pay people to label/annotate data at scale |
| Sweat equity | Manual data entry from paper/PDF records | Often unavoidable for historical data |
PubMed (medical literature database) started storing author first names in 2002. Before 2002, "G. Wang" and "Guanjin Wang" were treated as different authors. When first names were added in 2002, thousands of previously-listed authors suddenly appeared to be "new" authors, creating a huge spike in that year. This is an artifact — a processing decision created a false pattern in the data. Data cleaning removes such artifacts.
Missing values appear as NaN (Not a Number), NULL, or blank spaces. Most ML tools cannot handle them and will crash or produce unreliable results.
Remove rows (samples) or columns (features) with missing values.
Disadvantage: If too many records are removed, the remaining data may be too small for reliable analysis. If feature columns are removed, the classifier may lose the information it needs to distinguish between classes.
When OK: When you have plenty of data and missing values are rare.
Fill in estimated values instead of discarding records.
Outliers are data points that are unusually far from the rest of the data. They can represent genuine extreme values or measurement errors.
Sometimes outliers represent real, important data. Example: a bank detects a very large credit card transaction as an outlier. But the transaction is legitimate — it is a rare high-value purchase. By NOT removing the outlier, you gain insight into customer spending behavior that is valuable for marketing or risk assessment. Always investigate before deleting.
However, if a dinosaur bone measurement is 50% larger than all others and was likely a transcription error (two digits transposed), removing it improves model quality.
Imbalanced data occurs when one class has far more examples than another.
Example: 1,000,000 emails in a dataset; only 30 are spam. A classifier that ALWAYS predicts "not spam" achieves 99.997% accuracy — but has ZERO ability to detect actual spam. It is completely useless for its purpose.
| Method | How it works |
|---|---|
| Find more minority examples | Actively seek out more positive class data |
| Discard majority examples | Randomly remove records from the dominant class |
| Weight the minority class | Make misclassifying minority examples more costly (but beware overfitting) |
| Replicate minority examples | Duplicate minority records, ideally with small random variations (SMOTE) |
Data wrangling = acquiring + preparing data. Errors = lost data (unrecoverable). Artifacts = fixable processing problems. Never set missing values to zero. Missing data: discard (lose records) or impute (estimate). Imputation methods: mean (keeps mean unchanged), random, regression, KNN. Outliers: investigate before deleting — they may be real. Imbalanced data: accuracy is misleading (99.997% accuracy can still be useless). Always fix imbalance before training. Main data formats: CSV (tables), JSON (APIs), XML (structured), SQL (relational databases).
Q1. What is the primary problem with setting missing values to zero?
Answer: B
Setting missing values to zero introduces false data. A living person's "year of death" being set to 0 implies they died in year 0. The number of sales for a product that wasn't yet released set to 0 implies zero sales rather than unknown. Zero is only appropriate when zero is actually a meaningful value in context. Imputation methods are almost always better.
Q2. Which imputation method replaces missing values with the average of existing values in that column?
Answer: C
Mean imputation replaces each missing value with the mean of the entire feature column. Its key property is that it leaves the column mean UNCHANGED. Disadvantage: it reduces variance and may distort relationships. KNN imputation uses nearest neighbors to estimate missing values. Regression imputation predicts missing values from other features.
Q3. The key difference between a data error and a data artifact is:
Answer: B
A data ERROR is information fundamentally lost in data acquisition — it cannot be recovered, only handled as missing. A data ARTIFACT is a systematic problem arising from processing done to the data — it can be identified and corrected by cleaning. The PubMed first-names example is an artifact: the real data exists, but the processing decision created a false spike that cleaning can remove.
Q4. A spam filter is trained on 1,000,000 emails (999,970 non-spam, 30 spam). A classifier that always predicts "not spam" achieves what accuracy, and is it useful?
Answer: C
Always predicting "not spam" correctly labels 999,970 out of 1,000,000 emails = 99.997% accuracy. But it NEVER identifies a single spam email. Accuracy is completely misleading for imbalanced datasets. The correct metrics are precision (of emails flagged as spam, how many are actually spam?), recall (of all spam emails, how many did we catch?), and F1-score.
Q5. Which statement about outlier removal is most accurate?
Answer: C
Outliers require careful examination. Sometimes they are genuine errors (a dinosaur bone 50% larger than all others — likely a transcription error) and removing them improves the model. Other times they are legitimate extreme values (a large credit card transaction that reveals real customer behavior). Always investigate: is this an error or a real event?
Q6. KNN imputation works by:
Answer: B
KNN imputation finds the K most similar records (nearest neighbors) to the record with a missing value, based on the non-missing features. It then imputes the missing value from the corresponding values in those K neighbors. This is more sophisticated than mean imputation because it considers the similarity between records, producing more realistic estimates.
Q7. Which data format is best suited for storing data from a web API?
Answer: C
JSON (JavaScript Object Notation) is the standard format for web APIs. It is lightweight, human-readable, and natively supported by web browsers. CSV is for flat tabular data (spreadsheets). XML is for structured non-tabular data. SQL databases store multiple related tables with relationships between them.
Q8. NASA's Mars Climate Orbiter exploded in 1999 because of a data issue. What type of problem was it?
Answer: B
This is a data compatibility / unit conversion problem. One engineering team provided thruster data in metric units (Newton-seconds) while another team's software expected English units (pound-force seconds). This is the type of issue data cleaning must catch — always verify that data from different sources uses consistent units and representations. "Apple to apple" comparisons require careful unit unification.
Q9. Crowdsourcing for data collection means:
Answer: C
Crowdsourced data collection is a participatory method of building a dataset with the help of a large group of people. Platforms like Amazon Mechanical Turk (MTurk) allow researchers to cheaply outsource labeling, annotation, and simple tasks to many people. This is valuable for creating labeled training data for ML models.
Q10. Which is the best strategy when you have heavily imbalanced classes?
Answer: C
For imbalanced data: (1) Fix the imbalance using techniques like resampling (oversample minority, undersample majority), weighting, or replication with perturbation. (2) Use metrics other than accuracy: precision (what fraction of predicted positives are correct?), recall (what fraction of actual positives did we catch?), F1-score (harmonic mean of precision and recall).
Q11. The main disadvantage of discarding records with missing values is:
Answer: B
If too many records have missing values and you discard all of them, you may not have enough data left for reliable analysis. If you discard entire feature columns because they have some missing values, you lose information that the model needs to distinguish between classes. This is why imputation is often preferred over simple discarding.
Q12. Accessing proprietary data from companies like Facebook or Google is:
Answer: B
Proprietary data sources like Facebook, Google, and Amazon contain exciting datasets, but getting outside access is usually impossible. At best, companies sometimes release rate-limited APIs (like the Twitter/X API) that allow limited programmatic access. Researchers typically must work with publicly available datasets, government data, or academic datasets instead.