Skip to content

Latest commit

 

History

History
143 lines (94 loc) · 5.09 KB

Data_Preprocessing_with_Pandas.md

File metadata and controls

143 lines (94 loc) · 5.09 KB

Mastering the Art of Data Preprocessing with Pandas

Build cleaner, faster, and smarter ML pipelines with real-world data techniques

Discord GitHub forks GitHub stars GitHub repo size GitHub contributors Gitpod ready-to-code commits GitHub last commit (branch) Views Counter
space-1.jpg
Image Generated Using Canva

Written by Tanu Nanda Prabhu

Introduction

Raw data is rarely clean or usable straight out of the box. Whether you're working on a machine learning project, a data analysis dashboard, or a backend system. How you preprocess your data determines the quality of your results.

In this post, we’ll walk through the must-know techniques in pandas for handling missing data, transforming features, and preparing datasets that are ready to power any model or insight.


Why Preprocessing is a Game-Changer

  • Garbage in = garbage out. Clean data ensures better predictions.
  • Saves time in the long run by eliminating inconsistencies early.
  • Helps models converge faster and perform better.
  • Makes your data pipeline robust and production-ready.

Common Data Preprocessing Tasks (with Code)

Let’s break it down into real-world tasks you’ll face and how to solve them using Pandas.

1. Handling Missing Values

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'Age': [25, 28, np.nan, 35],
    'Income': [50000, 60000, 65000, np.nan]
})

# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Income'].fillna(df['Income'].median(), inplace=True)

Tip: Use .mean() for normally distributed values, and .median() for skewed ones.

2. Encoding Categorical Variables

df = pd.DataFrame({
    'Department': ['HR', 'Finance', 'IT', 'HR']
})

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Department'])

Use get_dummies for nominal data. For ordinal data, consider LabelEncoder.

3. Scaling Numerical Features

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'Height': [150, 160, 170],
    'Weight': [60, 70, 80]
})

scaler = StandardScaler()
scaled = scaler.fit_transform(df)

Standardizing helps algorithms like SVM, KNN, and Gradient Boosting.

4. Detecting and Removing Outliers

# Using Z-score
from scipy import stats

z_scores = stats.zscore(df)
df_no_outliers = df[(np.abs(z_scores) < 3).all(axis=1)]

Z-score method is simple and effective for normally distributed data.

5. Feature Engineering, DateTime Example

df = pd.DataFrame({
    'Timestamp': pd.to_datetime([
        '2023-01-01', '2023-02-15', '2023-03-30'
    ])
})

df['Month'] = df['Timestamp'].dt.month
df['Weekday'] = df['Timestamp'].dt.weekday

Extracting features from dates can boost model performance in time-related problems.

Bonus Tip: Pipeline It!

Use sklearn.pipeline.Pipeline or make_column_transformer to wrap all preprocessing steps into one clean object.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

Real-World Use Case

In any ML competition or project, preprocessing is where the top performers gain the edge. They know how to clean and transform messy real-world data into a form the model can understand. Even a 1% improvement in data cleaning can lead to huge gains in model performance.


Conclusion

Data preprocessing isn’t a “boring” step, it’s the backbone of all successful projects. Mastering tools like pandas, sklearn, and some domain intuition can help you scale your skills from beginner to expert.


Stay Connected & Level Up!

Loved this challenge? Smash that like, drop a comment, and hit follow for daily mind-bending Python questions! Want more in-depth explanations?

Check out my GitHub for code and Medium for deep dives!