🔹 Missing Values Handling in Machine Learning

Cleaning Incomplete Data for Reliable Models

Missing values occur when no data is stored for a variable in an observation. Handling them properly is important because many machine learning algorithms cannot work with missing data.

🔎 Why Missing Values Occur

  • Data entry errors
  • Sensor failure
  • Survey non-response
  • Data corruption
  • Optional fields left blank

🔹 Methods to Handle Missing Values

1️⃣ Remove Missing Data (Deletion Method)

✔ A. Remove Rows (Listwise Deletion)

If only a few rows contain missing values.

Age  Salary
25  50000
NaN  60000
30  55000

Remove the second row.

  • ✅ Simple
  • ❌ Loses data

✔ B. Remove Columns

If a column has too many missing values (e.g., 70% missing).

2️⃣ Mean / Median / Mode Imputation

Replace missing values with statistical measures.

  • Mean – Numerical data, normal distribution
  • Median – Numerical data with outliers
  • Mode – Categorical data
Example (Mode):
Gender
Male
Female
Male
NaN

Mode = Male → Replace NaN with Male

3️⃣ Forward Fill / Backward Fill

Mostly used in time-series data.

  • Forward Fill: Use previous value
  • Backward Fill: Use next value

4️⃣ Interpolation

Estimate missing values based on trends (mainly for time-series).

Example: Temperature missing between 20°C and 24°C → estimate 22°C.

5️⃣ Predictive Imputation (Advanced)

Use machine learning models to predict missing values.

  • K-Nearest Neighbors
  • Random Forest

6️⃣ Using Constant Value

Replace missing values with:

  • 0
  • "Unknown"
  • -1

Useful when missing itself has meaning.

🔹 When to Use What?

Situation      Best Method
Few missing rows  Remove rows
Many missing column Remove column
Normal distribution Mean
Outliers present  Median
Categorical data  Mode
Time series    Forward fill / Interpolation
Large dataset   Predictive imputation

📘 Simple Python Example

import pandas as pd
from sklearn.impute import SimpleImputer

data = pd.DataFrame({
  'Age': [25, None, 30],
  'Salary': [50000, 60000, None]
})

imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)

print(data_imputed)

🔹 Important Tip (Exam / Interview)

  • Check percentage of missing data
  • Understand why data is missing
  • Choose method based on data type