![]() |
---|
Image Generated Using Canva |
Written by Tanu Nanda Prabhu
Raw data is rarely clean or usable straight out of the box. Whether you're working on a machine learning project, a data analysis dashboard, or a backend system. How you preprocess your data determines the quality of your results.
In this post, we’ll walk through the must-know techniques in pandas for handling missing data, transforming features, and preparing datasets that are ready to power any model or insight.
- Garbage in = garbage out. Clean data ensures better predictions.
- Saves time in the long run by eliminating inconsistencies early.
- Helps models converge faster and perform better.
- Makes your data pipeline robust and production-ready.
Let’s break it down into real-world tasks you’ll face and how to solve them using Pandas.
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
'Age': [25, 28, np.nan, 35],
'Income': [50000, 60000, 65000, np.nan]
})
# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Income'].fillna(df['Income'].median(), inplace=True)
Tip: Use
.mean()
for normally distributed values, and.median()
for skewed ones.
df = pd.DataFrame({
'Department': ['HR', 'Finance', 'IT', 'HR']
})
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Department'])
Use
get_dummies
for nominal data. For ordinal data, consider LabelEncoder.
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({
'Height': [150, 160, 170],
'Weight': [60, 70, 80]
})
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
Standardizing helps algorithms like SVM, KNN, and Gradient Boosting.
# Using Z-score
from scipy import stats
z_scores = stats.zscore(df)
df_no_outliers = df[(np.abs(z_scores) < 3).all(axis=1)]
Z-score method is simple and effective for normally distributed data.
df = pd.DataFrame({
'Timestamp': pd.to_datetime([
'2023-01-01', '2023-02-15', '2023-03-30'
])
})
df['Month'] = df['Timestamp'].dt.month
df['Weekday'] = df['Timestamp'].dt.weekday
Extracting features from dates can boost model performance in time-related problems.
Use
sklearn.pipeline.Pipeline
ormake_column_transformer
to wrap all preprocessing steps into one clean object.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
In any ML competition or project, preprocessing is where the top performers gain the edge. They know how to clean and transform messy real-world data into a form the model can understand. Even a 1% improvement in data cleaning can lead to huge gains in model performance.
Data preprocessing isn’t a “boring” step, it’s the backbone of all successful projects. Mastering tools like pandas, sklearn, and some domain intuition can help you scale your skills from beginner to expert.
Loved this challenge? Smash that like, drop a comment, and hit follow for daily mind-bending Python questions! Want more in-depth explanations?