🔹 Outlier Detection in Machine Learning
Identifying Abnormal Data Points
An outlier is a data point that is very different from other observations. It can occur due to errors, rare events, or natural variation.
🔎 Example (Salary in $)
40,000
45,000
50,000
48,000
5,000,000 ❗ Outlier
That extreme value can distort the model.
🔹 Why Outlier Detection is Important
- Prevents model distortion
- Improves accuracy
- Reduces overfitting
- Important for fraud & anomaly detection
🔹 Methods of Outlier Detection
1️⃣ Z-Score Method (Standard Deviation)
Measures how far a point is from the mean.
Z = (X − μ) / σ
If |Z| > 3 → Outlier
Works best when data is normally distributed.
2️⃣ IQR Method (Interquartile Range)
Most commonly used method.
Steps:- Find Q1 (25th percentile)
- Find Q3 (75th percentile)
- IQR = Q3 − Q1
X < Q1 − 1.5 × IQR
OR
X > Q3 + 1.5 × IQR
Best suited for skewed data.
3️⃣ Box Plot Method
Visual method based on IQR.
Outliers appear as points outside the whiskers.
4️⃣ Using Machine Learning Algorithms
Advanced techniques for large or high-dimensional data:
- Isolation Forest
- Local Outlier Factor (LOF)
- One-Class SVM
📘 Simple Python Example (IQR Method)
import numpy as np
data = np.array([40000, 45000, 50000, 48000, 5000000])
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = data[(data < lower) | (data > upper)]
print("Outliers:", outliers)
🔹 What to Do After Detecting Outliers?
- ✔ Remove them (if error)
- ✔ Cap them (Winsorization)
- ✔ Transform data (log transformation)
- ✔ Keep them (if meaningful, e.g. fraud)
🔹 Real-Life Examples
- Fraud detection (unusual transactions)
- Network intrusion detection
- Medical abnormal readings
- Stock market crash data