Data Cleaning with Pandas in Python

Data cleaning using pandas. Data cleaning is a critical step in data preprocessing, as it ensures the dataset is accurate, consistent, and usable for analysis or machine learning tasks.

Pandas, a powerful Python library, offers a plethora of tools and functions for cleaning and transforming data. This guide provides a detailed explanation of common data cleaning tasks, supported by coding examples.

Pandas is an open-source data analysis library that provides data structures like DataFrame and Series for handling structured data. Its ease of use and flexibility make it a go-to library for data cleaning tasks.

Install Pandas using pip if not already installed:

pip install pandas

Loading Data

Before cleaning, data needs to be loaded into a Pandas DataFrame.

import pandas as pd

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)
print(df)

Output:

Handling Missing Values

Missing values are common in real-world datasets and need to be addressed.

Identifying Missing Values

Pandas provides functions to detect missing values.

print(df.isnull())  # Check for null values
print(df.isnull().sum())  # Count missing values in each column

Output:

Dropping Missing Values

You can remove rows or columns with missing values.

import pandas as pd

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)

# Drop rows with missing values
cleaned_df = df.dropna()
print(cleaned_df)

# Drop columns with missing values
cleaned_df = df.dropna(axis=1)
print(cleaned_df)

Output:

Filling Missing Values

Use methods like mean, median, or a fixed value to fill missing data.

# Fill missing numeric values with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing categorical values with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

print(df)

Output

Handling Duplicates

Duplicate rows can skew analysis results.

# Identify duplicates
print(df.duplicated())

# Drop duplicates
df = df.drop_duplicates()
print(df)

Output:

Converting Data Types

Data may need to be converted to appropriate types for analysis.


# Convert Income to numeric
df['Income'] = pd.to_numeric(df['Income'], errors='coerce')
print(df.dtypes)

Output:

Standardizing Data

Ensuring consistent data formatting is vital.

Renaming Columns

# Rename columns
df.rename(columns={"Name": "Full Name", "City": "Location"}, inplace=True)
print(df)

Output:

Handling Outliers

Outliers can distort statistical analyses and models.

import pandas as pd
# Detect outliers using z-score
from scipy.stats import zscore

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)



df['Age_zscore'] = zscore(df['Age'])
outliers = df[df['Age_zscore'].abs() > 3]
print(outliers)

Output:

Combining and Splitting Data

Sometimes datasets need to be merged or split.

Merging DataFrames

import pandas as pd


# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)


additional_data = pd.DataFrame({"Name": ["Alice", "Charlie"], "Score": [85, 90]})
df = df.merge(additional_data, on="Name", how="left")
print(df)

Output:

Conclusion

Data cleaning is an iterative process that involves understanding the dataset, addressing inconsistencies, and ensuring the data is prepared for further analysis. Pandas provides robust functionalities to handle these tasks efficiently.

Key Points Covered:

Handling missing values
Removing duplicates
Converting data types
Standardizing and formatting data
Detecting and managing outliers
Merging and splitting data

By mastering these techniques, you can streamline your data cleaning workflow and prepare datasets for any analytical or machine-learning pipeline.

CodeMagnet

CodeMagnet

Data Cleaning with Pandas in Python – A Complete Guide

Loading Data

Handling Missing Values

Identifying Missing Values

Dropping Missing Values

Filling Missing Values

Handling Duplicates

Converting Data Types

Standardizing Data

Renaming Columns

Handling Outliers

Combining and Splitting Data

Merging DataFrames

Conclusion

Key Points Covered:

Like this:

Author

Leave a ReplyCancel reply

Hangman Game in Python: Beginner-Friendly Project with Source Code

Python Google Trends Analysis Made Easy with TrendSpy-Lite 0.0.3

Pydantic v3: The New Standard for Data Validation in Python (Why Everything Changed in 2025)

Trending

Hangman Game in Python: Beginner-Friendly Project with Source Code

Python Google Trends Analysis Made Easy with TrendSpy-Lite 0.0.3

Pydantic v3: The New Standard for Data Validation in Python (Why Everything Changed in 2025)