🧹 Data Preprocessing in Machine Learning

Cleaning & Transforming Raw Data for Models

πŸ”Ή What is Data Preprocessing?

Data preprocessing is the process of cleaning and transforming raw data into a format suitable for machine learning models.

Good preprocessing often improves model accuracy more than changing the algorithm itself.

Article Algo

βœ… Why Data Preprocessing is Important

  • Improves model accuracy
  • Reduces training time
  • Prevents overfitting
  • Makes data understandable to algorithms

πŸ“Œ Main Steps in Data Preprocessing

  • Handling missing values
  • Handling categorical data
  • Feature scaling
  • Outlier removal
  • Feature selection
  • Train-test split

1️⃣ Handling Missing Values

Real-world datasets often contain missing values.

Example:
Name  Age  Salary
John  25  50000
Anna  NaN  60000
Mike  30  NaN

Common Methods:

  • Remove rows or columns
  • Replace with mean, median, or mode

2️⃣ Handling Categorical Data

Machine learning models require numerical input.

Example:
Gender
Male
Female

Encoding Techniques:

  • Label Encoding (Male β†’ 0, Female β†’ 1)
  • One-Hot Encoding (Male β†’ [1,0], Female β†’ [0,1])

3️⃣ Feature Scaling

Some algorithms (KNN, SVM) are sensitive to feature scale.

Age  Salary
25  50000

Scaling Methods:

  • Normalization (Min-Max Scaling)
  • Standardization (Z-score)

4️⃣ Removing Outliers

Outliers are extreme values that can negatively impact model performance.

Example: Most salaries are between 40k–60k, but one value is 5,000,000.

5️⃣ Feature Selection

Selecting relevant features improves performance and reduces overfitting.

  • House size βœ…
  • Location βœ…
  • Owner’s favorite color ❌

6️⃣ Splitting the Dataset

  • Training set (70–80%)
  • Testing set (20–30%)

This ensures fair model evaluation.

πŸ“˜ Simple Python Example

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.DataFrame({
  'Age': [25, 30, 35],
  'Salary': [50000, 60000, 70000]
})

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

X_train, X_test = train_test_split(data_scaled, test_size=0.2)

πŸ“š FAQs on Data Preprocessing

Q1. Why is data preprocessing required?

Because raw data is noisy, inconsistent, and unsuitable for ML models.

Q2. Is preprocessing more important than model selection?

Often yes. Clean data can outperform complex models on bad data.

Q3. Should scaling be done before train-test split?

No. Scaling must be done after splitting to avoid data leakage.

πŸ”‘ Key Points to Remember

  • Preprocessing improves accuracy
  • Always avoid data leakage
  • Scale numerical features properly
  • Clean data beats complex models