Concept:
Data Cleaning is a key preprocessing step in data science that ensures datasets are accurate, consistent, and suitable for analysis. Real-world data often contains errors, duplicates, and missing values that can negatively affect model performance.
Step 1: {\color{red}What is Data Cleaning?}
Data cleaning involves:
- Removing duplicates
- Fixing formatting errors
- Handling missing or inconsistent data
It improves data reliability and quality.
Step 2: {\color{red}Understanding Missing Values}
Missing data may occur due to:
- Data entry errors
- Sensor failures
- Incomplete surveys
Handling them correctly is essential to avoid biased analysis.
Step 3: {\color{red}Method 1 — Deletion}
Remove rows or columns with missing values:
- Useful when missing data is minimal
- Risky if large portions of data are removed
Step 4: {\color{red}Method 2 — Imputation}
Fill missing values using statistical measures:
- Mean (numerical data)
- Median (robust to outliers)
- Mode (categorical data)
Step 5: {\color{red}Method 3 — Advanced Techniques}
More sophisticated approaches include:
- Predictive modeling (e.g., regression)
- Interpolation for time-series data
- KNN or machine learning-based imputation
Step 6: {\color{red}Choosing the Right Method}
The strategy depends on:
- Amount of missing data
- Data type and distribution
- Impact on analysis goals