Top Pandas Functions Helpful for Every Data Scientist

In our data-focused world, analyzing data is key to unlocking its full potential and making informed decisions. For Organizations, this analysis provides a competitive edge and allows for personalized approaches.

This article will delve into pandas, a powerful Python library crucial for data analysis. We’ll cover its most important functions, making it accessible even to beginners. If you don’t have Python installed, you can installed it or you can also use Google Colaboratory for this tutorial.

Data Viewing

df.head():

If you have a large data set and want to view only first few rows then yu can use the below python code for that with the pandas function df.head() to read to data from a csv file

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.head()  # Show first five rows
print(df.head())

Output:

If you want youu can download the csv file which i have used form here

df.tail(): It displays the last five rows of the sample data

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.tail()  # Show last five rows
print(df.tail())
#this is the code which will help you to get last 5 rows of a data set 
#let me show you the original sheet first from were we are extracting last 5 rows
#the original csv file name is 
#saw the original csv file now lets run this code

df.sample(n):

It displays the random n number of rows in the sample data
For example let us take n value as df.sample(6)

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.sample(6)  # Shows random rows
print(df.sample(6))

df.shape: It displays the sample data’s rows and columns (dimensions).

2. Statistics

This section contains the functions that help you perform statistics like average, min/max, and quartiles on your data.

df.describe(): Get the basic statistics of each column of the sample data

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.describe()  # Shows random rows
print(df.describe())

df.info():

Get the information about the various data types used and the non-null count of each column.

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.info()  
print(df.info())

df.corr():

This can give you the correlation matrix between all the integer columns in the data frame.

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.corr() # This can give you the correlation matrix between all the integer columns in the data frame.  
print(df.corr())

df.memory_usage():

It will tell you how much memory is being consumed by each column.

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.memory_usage()
print(df.memory_usage())

Data Selection

If you want to select the data of any specific row, column, or even multiple columns.

df.iloc[row_num]:

It will select a particular row based on its index
For ex-,

df.iloc[0]

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df.iloc[0]
print(df.iloc[0])

df[col_name]: It will select the particular column

df[[‘col1’, ‘col2’]]: It will select multiple columns given

import pandas as pd
df = pd.read_csv("customers-100.csv", encoding="Latin-1")  # Load the data

df[["First Name", "Last Name"]]
print(df[["First Name", "Last Name"]])

Output:

4. Data Cleaning

Below functions are used to handle the missing data. Some rows in the data contain some null and garbage values, which can hamper the performance of our trained model. So, it is always better to correct or remove these missing values.

df.isnull(): This will identify the missing values in your dataframe.
df.dropna(): This will remove the rows containing missing values in any column.
df.fillna(val): This will fill the missing values with val given in the argument.
df[‘col’].astype(new_data_type): It can convert the data type of the selected columns to a different data type.

5. Data Analysis

Data analysis is all about grouping, sorting and filtering. Let us take a look at some functions which are provided by pandas to achieve these.

Aggregation Functions

You can group a column by its name and then apply some aggregation functions like sum, min/max, mean, etc.

df.groupby("col_name_1").agg({"col_name_2": "sum"})

Filtering Data:

We can filter the data in rows based on a specific value or a condition.

For ex-,

df[df["SALES"] > 5000]

Displays the rows where the value of sales is greater than 5000

You can also filter the dataframe using the query() function. It will also generate a similar output as above.

For ex,

df.query("SALES" > 5000)

Sorting Data

You can sort the data based on a specific column, either in the ascending order or in the descending order.

For ex-,

df.sort_values("SALES", ascending=False)  # Sorts the data in descending order

For large datasets, the best pandas functions are those that allow for efficient data manipulation and analysis without causing performance issues. Some of the most useful functions for working with large datasets include:

read_csv()/read_excel(): These functions allow you to efficiently read large CSV or Excel files into a DataFrame. You can use parameters like chunksize to read the data in chunks, which is useful for processing large files without loading the entire dataset into memory.
to_csv()/to_excel(): These functions allow you to write DataFrame to CSV or Excel files. You can also use chunksize parameter to write data in chunks.
groupby(): This function is used for grouping data based on one or more columns and applying aggregate functions. It can be used to summarize large datasets efficiently.
apply(): This function applies a function along an axis of the DataFrame. It can be used to apply a custom function to each row or column of a large dataset.
concat(): This function is used to concatenate two or more DataFrames along a particular axis. It can be useful for combining large datasets efficiently.
merge()/join(): These functions are used to merge/join two DataFrames based on a common column. They can be useful for combining large datasets based on a key column.
pivot_table(): This function creates a spreadsheet-style pivot table as a DataFrame. It can be used to summarize and analyze large datasets in a tabular format.
astype(): This function is used to change the data type of a column in a DataFrame, which can be helpful for optimizing memory usage for large datasets.
sample(): This function is used to get a random sample of rows or columns from a DataFrame, which can be useful for quickly exploring large datasets.
cut()/qcut(): These functions are used for binning numerical data into discrete intervals, which can be helpful for analyzing large datasets with continuous variables.
drop_duplicates(): This function is used to remove duplicate rows from a DataFrame, which can be useful for cleaning up large datasets.
isna()/notna(): These functions are used to check for missing (NaN) values in a DataFrame, which can be useful for data cleaning and validation.
query(): This function allows you to filter a DataFrame using a boolean expression. It can be more efficient than using boolean indexing for large datasets.
memory_usage(): This function returns the memory usage of each column in a DataFrame. It can be useful for optimizing memory usage when working with large datasets.
nsmallest()/nlargest(): These functions are used to get the smallest or largest n rows based on a specified column or columns. They can be useful for quickly identifying outliers or top performers in a large dataset.
iterrows(): This function returns an iterator that iterates over the rows of a DataFrame as (index, Series) pairs. It can be useful for iterating over large datasets row by row.
to_pickle()/read_pickle(): These functions are used to serialize and deserialize a DataFrame to and from a pickle file. Pickle files can be more efficient for storing and reading large datasets compared to CSV or Excel files.
to_hdf()/read_hdf(): These functions are used to write and read a DataFrame to and from an HDF5 file, which is a high-performance file format for storing large datasets.

How To Use Pandas Stack Function

Python pandas library enables us to perform advanced data analysis and manipulations. These are only a few of them. There are many more. Important thing to remember is that the selection of techniques can be specific which caters to your needs and the dataset you are using.

CodeMagnet

CodeMagnet

Leave a ReplyCancel reply

Python Humanize Package Tutorial: Complete Guide

Caesar Cipher in Python: Complete Guide with Examples, Code, and Explanation

Python tqdm Tutorial: Easily Track Loop Progress with Examples

Trending

Python Humanize Package Tutorial: Complete Guide

Caesar Cipher in Python: Complete Guide with Examples, Code, and Explanation

Python tqdm Tutorial: Easily Track Loop Progress with Examples

AI-Powered Healthcare Diagnosis with Python: A Complete Guide

CodeMagnet

Subscribe to CodeMagnet! 🔔

Top Pandas Functions Helpful for Every Data Scientist

Share this:

Like this:

Author

Leave a ReplyCancel reply

Trending