,

Data Cleaning with Pandas in Python – A Complete Guide

data cleaning with panda

Data cleaning using pandas. Data cleaning is a critical step in data preprocessing, as it ensures the dataset is accurate, consistent, and usable for analysis or machine learning tasks.

Pandas, a powerful Python library, offers a plethora of tools and functions for cleaning and transforming data. This guide provides a detailed explanation of common data cleaning tasks, supported by coding examples.

Pandas is an open-source data analysis library that provides data structures like DataFrame and Series for handling structured data. Its ease of use and flexibility make it a go-to library for data cleaning tasks.

Install Pandas using pip if not already installed:

pip install pandas

Loading Data

Before cleaning, data needs to be loaded into a Pandas DataFrame.

import pandas as pd

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)
print(df)

Output:

Handling Missing Values

Missing values are common in real-world datasets and need to be addressed.

Identifying Missing Values

Pandas provides functions to detect missing values.

print(df.isnull())  # Check for null values
print(df.isnull().sum())  # Count missing values in each column

Output:

Dropping Missing Values

You can remove rows or columns with missing values.

import pandas as pd

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)

# Drop rows with missing values
cleaned_df = df.dropna()
print(cleaned_df)

# Drop columns with missing values
cleaned_df = df.dropna(axis=1)
print(cleaned_df)

Output:

Filling Missing Values

Use methods like mean, median, or a fixed value to fill missing data.

# Fill missing numeric values with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing categorical values with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

print(df)

Output

Handling Duplicates

Duplicate rows can skew analysis results.

# Identify duplicates
print(df.duplicated())

# Drop duplicates
df = df.drop_duplicates()
print(df)

Output:

Converting Data Types

Data may need to be converted to appropriate types for analysis.


# Convert Income to numeric
df['Income'] = pd.to_numeric(df['Income'], errors='coerce')
print(df.dtypes)

Output:

Standardizing Data

Ensuring consistent data formatting is vital.

Renaming Columns

# Rename columns
df.rename(columns={"Name": "Full Name", "City": "Location"}, inplace=True)
print(df)

Output:

Handling Outliers

Outliers can distort statistical analyses and models.

import pandas as pd
# Detect outliers using z-score
from scipy.stats import zscore

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)



df['Age_zscore'] = zscore(df['Age'])
outliers = df[df['Age_zscore'].abs() > 3]
print(outliers)

Output:

Combining and Splitting Data

Sometimes datasets need to be merged or split.

Merging DataFrames

import pandas as pd


# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "Age": [25, 30, None, 35, 40],
    "City": ["New York", None, "Los Angeles", "Chicago", "New York"],
    "Income": ["50000", "60000", "unknown", "70000", "80000"],
}

df = pd.DataFrame(data)


additional_data = pd.DataFrame({"Name": ["Alice", "Charlie"], "Score": [85, 90]})
df = df.merge(additional_data, on="Name", how="left")
print(df)

Output:

Conclusion

Data cleaning is an iterative process that involves understanding the dataset, addressing inconsistencies, and ensuring the data is prepared for further analysis. Pandas provides robust functionalities to handle these tasks efficiently.

Key Points Covered:

  1. Handling missing values
  2. Removing duplicates
  3. Converting data types
  4. Standardizing and formatting data
  5. Detecting and managing outliers
  6. Merging and splitting data

By mastering these techniques, you can streamline your data cleaning workflow and prepare datasets for any analytical or machine-learning pipeline.

Author

Sona Avatar

Written by

Leave a Reply

Trending

CodeMagnet

Your Magnetic Resource, For Coding Brilliance

Programming Languages

Web Development

Data Science and Visualization

Career Section

<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-4205364944170772"
     crossorigin="anonymous"></script>