Polars is a high-performance DataFrame library for Python that offers fast, efficient data manipulation and analysis.
Unlike Pandas, Polars is built on Rust, making it exceptionally fast and memory-efficient, especially with large datasets. It is ideal for tasks where performance is critical, such as data preprocessing, transformation, and analysis in data science or machine learning workflows.
Key Features of Polars
- Speed: Polars outperforms Pandas due to its Rust backend, handling large datasets efficiently.
- Lazy Execution: It allows lazy execution, which optimizes operations by only executing computations when needed.
- Memory Efficiency: Polars uses Apache Arrow as its memory format, allowing efficient memory usage.
- Multi-threading: It is multi-threaded, enabling parallel execution of operations.
- Expressive API: Polars provides an intuitive and user-friendly API for data manipulation.
Let’s dive into a detailed explanation with examples to understand how Polars works.
Installation of Polars
Before using Polars, you need to install it via pip:
pip install polars
Basic Usage of Polars
1. Creating a DataFrame
You can create a DataFrame in Polars in multiple ways, similar to how you would do it in Pandas. Here’s a basic example:
import polars as pl
# Creating a DataFrame from a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 32, 45],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pl.DataFrame(data)
print(df)
Output:
shape: (3, 3)
┌─────────┬─────┬─────────────┐
│ Name │ Age │ City │
│ --- │ --- │ --- │
│ str │ i64 │ str │
├─────────┼─────┼─────────────┤
│ Alice │ 25 │ New York │
│ Bob │ 32 │ Los Angeles │
│ Charlie │ 45 │ Chicago │
└─────────┴─────┴─────────────┘
In this example, we created a simple DataFrame with columns for names, ages, and cities. Polars automatically infers the data types of each column.
2. Selecting and Filtering Data
Polars offers a powerful query interface for selecting and filtering data. Let’s see how to select specific columns and filter rows based on conditions.
# Selecting specific columns
selected_df = df.select(["Name", "Age"])
print(selected_df)
# Filtering rows where age is greater than 30
filtered_df = df.filter(pl.col("Age") > 30)
print(filtered_df)
Output:
shape: (3, 2)
┌─────────┬─────┐
│ Name │ Age │
│ --- │ --- │
│ str │ i64 │
├─────────┼─────┤
│ Alice │ 25 │
│ Bob │ 32 │
│ Charlie │ 45 │
└─────────┴─────┘
shape: (2, 3)
┌─────────┬─────┬─────────────┐
│ Name │ Age │ City │
│ --- │ --- │ --- │
│ str │ i64 │ str │
├─────────┼─────┼─────────────┤
│ Bob │ 32 │ Los Angeles │
│ Charlie │ 45 │ Chicago │
└─────────┴─────┴─────────────┘
The first operation selects the “Name” and “Age” columns, while the second operation filters out rows where the “Age” column is greater than 30.
3. Lazy Execution in Polars
Polars‘ lazy execution mode allows deferred evaluation of operations, which can optimize performance, especially when working with large datasets. In lazy mode, operations are not executed immediately but rather when the result is explicitly requested.
# Creating a lazy DataFrame
lazy_df = df.lazy()
# Defining operations
result = (
lazy_df
.filter(pl.col("Age") > 30)
.select(["Name", "Age"])
.collect() # Triggers execution
)
print(result)
Output:
shape: (2, 2)
┌─────────┬─────┐
│ Name │ Age │
│ --- │ --- │
│ str │ i64 │
├─────────┼─────┤
│ Bob │ 32 │
│ Charlie │ 45 │
└─────────┴─────┘
In this example, operations are defined but not executed until .collect() is called. This feature is helpful in optimizing performance when chaining multiple operations together.
4. GroupBy Operations
Polars also supports group-by operations for aggregating data. Let’s group by the “City” column and calculate the average age for each city.
# Grouping by city and calculating average age
grouped_df = df.groupby("City").agg([
pl.col("Age").mean().alias("Average_Age")
])
print(grouped_df)
Output:
shape: (3, 2)
┌─────────────┬────────────┐
│ City │ Average_Age│
│ --- │ --- │
│ str │ f64 │
├─────────────┼────────────┤
│ New York │ 25.0 │
│ Los Angeles │ 32.0 │
│ Chicago │ 45.0 │
└─────────────┴────────────┘
Here, we grouped the data by the “City” column and calculated the mean of the “Age” column for each group.
5. Joins in Polars
Polars supports various types of joins, just like Pandas. Here’s an example of how to perform a left join between two DataFrames:
# Creating another DataFrame
data2 = {
"City": ["New York", "Chicago", "San Francisco"],
"Population": [8500000, 2700000, 880000]
}
df2 = pl.DataFrame(data2)
# Performing a left join
joined_df = df.join(df2, on="City", how="left")
print(joined_df)
Output:
shape: (3, 4)
┌─────────┬─────┬─────────────┬────────────┐
│ Name │ Age │ City │ Population │
│ --- │ --- │ --- │ --- │
│ str │ i64 │ str │ i64 │
├─────────┼─────┼─────────────┼────────────┤
│ Alice │ 25 │ New York │ 8500000 │
│ Bob │ 32 │ Los Angeles │ null │
│ Charlie │ 45 │ Chicago │ 2700000 │
└─────────┴─────┴─────────────┴────────────┘
In this example, we joined two DataFrames on the “City” column using a left join. The Population column was added to the resulting DataFrame.
Performance Comparison with Pandas
One of the main reasons to choose Polars over Pandas is performance, especially with larger datasets. Let’s compare the performance of Polars and Pandas when performing the same operation on a large dataset.
import pandas as pd
import numpy as np
import time
# Creating a large DataFrame with Pandas
large_data = {
"A": np.random.randint(0, 100, size=10**6),
"B": np.random.rand(10**6)
}
pandas_df = pd.DataFrame(large_data)
# Timing Pandas operation
start = time.time()
pandas_df["C"] = pandas_df["A"] * pandas_df["B"]
print(f"Pandas: {time.time() - start} seconds")
# Creating a large DataFrame with Polars
polars_df = pl.DataFrame(large_data)
# Timing Polars operation
start = time.time()
polars_df = polars_df.with_column((pl.col("A") * pl.col("B")).alias("C"))
print(f"Polars: {time.time() - start} seconds")
Output:

This will give you an insight into the speed differences between Polars and Pandas. On larger datasets, Polars will generally outperform Pandas, especially for complex transformations.
Polars offers several advantages over Pandas, especially in terms of performance, scalability, and memory efficiency. Below are more examples of how to use Polars, showcasing its features, and explaining why it is better than Pandas for certain tasks.
Example 1: Lazy Execution for Optimized Querying
In Polars, lazy execution defers the execution of operations until a result is explicitly requested. This helps Polars optimize query execution by combining multiple operations and reducing unnecessary computations. This is particularly useful when dealing with large datasets.
import polars as pl
# Creating a DataFrame with Polars
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 32, 45, 28],
"Salary": [50000, 60000, 120000, 85000]
}
df = pl.DataFrame(data)
# Converting to lazy DataFrame and chaining multiple operations
lazy_df = df.lazy()
result = (
lazy_df
.filter(pl.col("Age") > 30)
.with_columns((pl.col("Salary") * 1.1).alias("New_Salary"))
.select(["Name", "New_Salary"])
.collect() # Executes the query
)
print(result)
Output:
shape: (2, 2)
┌─────────┬────────────┐
│ Name │ New_Salary │
│ --- │ --- │
│ str │ f64 │
├─────────┼────────────┤
│ Bob │ 66000.0 │
│ Charlie │ 132000.0 │
└─────────┴────────────┘
Why It’s Better Than Pandas:
- Optimization: Since operations are deferred until
.collect()is called, Polars can optimize the entire chain of operations, potentially reducing redundant calculations. - Efficiency: Lazy execution reduces the overhead of large-scale computations by eliminating unnecessary data movement or transformations.
Example 2: Multi-threaded Performance
Polars supports multi-threading, meaning it can perform operations in parallel, speeding up computations on multi-core machines.
import polars as pl
import numpy as np
# Generating a large dataset
data = {
"A": np.random.randint(0, 100, size=10**6),
"B": np.random.rand(10**6)
}
df = pl.DataFrame(data)
# Applying a multi-threaded operation
result = df.with_columns((pl.col("A") * pl.col("B")).alias("C"))
print(result.head())
Output:
shape: (5, 3)
┌─────┬──────────┬──────────┐
│ A │ B │ C │
│ --- │ --- │ --- │
│ i64 │ f64 │ f64 │
├─────┼──────────┼──────────┤
│ 21 │ 0.317263 │ 6.662525 │
│ 22 │ 0.839108 │ 18.460375│
│ 99 │ 0.690399 │ 68.349531│
│ 14 │ 0.333076 │ 4.663064 │
│ 77 │ 0.170543 │ 13.126806│
└─────┴──────────┴──────────┘
Why It’s Better Than Pandas:
- Multi-threading: Pandas operations are usually single-threaded, while Polars can run computations in parallel, taking advantage of multiple CPU cores to significantly speed up operations on large datasets.
Example 3: GroupBy Aggregations
Polars excels in handling group-by operations efficiently, particularly when working with large datasets.
Polars Example:
import polars as pl
# Creating a DataFrame
data = {
"City": ["New York", "Los Angeles", "New York", "Chicago", "Chicago", "Los Angeles"],
"Sales": [100, 200, 150, 300, 400, 100]
}
df = pl.DataFrame(data)
# Performing a group-by operation and calculating total sales for each city
grouped_df = df.groupby("City").agg([
pl.col("Sales").sum().alias("Total_Sales"),
pl.col("Sales").mean().alias("Average_Sales")
])
print(grouped_df)
Output:
shape: (3, 3)
┌─────────────┬────────────┬──────────────┐
│ City │ Total_Sales│ Average_Sales│
│ --- │ --- │ --- │
│ str │ i64 │ f64 │
├─────────────┼────────────┼──────────────┤
│ New York │ 250 │ 125.0 │
│ Los Angeles │ 300 │ 150.0 │
│ Chicago │ 700 │ 350.0 │
└─────────────┴────────────┴──────────────┘
Why It’s Better Than Pandas:
- Speed: Polars uses efficient Rust-based operations that are optimized for performance during group-by and aggregation tasks, making it faster than Pandas, especially with large datasets.
- Parallelization: Polars can perform group-by operations in parallel, further enhancing speed for larger datasets.
Example 4: Memory Efficiency with Arrow
Polars uses Apache Arrow for its memory model, which allows for efficient storage and transfer of data, reducing memory overhead.
Polars Example:
import polars as pl
import numpy as np
# Generating a large dataset
data = {
"A": np.random.randint(0, 100, size=10**6),
"B": np.random.rand(10**6)
}
df = pl.DataFrame(data)
# Checking memory usage
print(f"Memory usage of Polars DataFrame: {df.estimated_size()} bytes")
Why It’s Better Than Pandas:
- Memory Efficiency: Polars is built on top of Apache Arrow, which allows for zero-copy data sharing and efficient memory usage. Pandas often consumes more memory for the same data size.
- Columnar Storage: Arrow’s columnar format allows for faster access to columns, which is beneficial for analytical queries.
Example 5: Handling Missing Data
Polars provides efficient handling of missing data using its .fill_null() method, which can replace missing values with specific defaults, values from other columns, or computed results.
Polars Example:
import polars as pl
# Creating a DataFrame with missing data
data = {
"Name": ["Alice", "Bob", None, "David"],
"Age": [25, None, 45, 28],
"Salary": [50000, 60000, None, 85000]
}
df = pl.DataFrame(data)
# Filling null values
filled_df = df.with_columns([
pl.col("Name").fill_null("Unknown"),
pl.col("Age").fill_null(pl.col("Age").mean()),
pl.col("Salary").fill_null(50000)
])
print(filled_df)
Output:
shape: (4, 3)
┌─────────┬─────┬────────┐
│ Name │ Age │ Salary │
│ --- │ --- │ --- │
│ str │ f64 │ f64 │
├─────────┼─────┼────────┤
│ Alice │ 25.0│ 50000.0│
│ Bob │ 32.6667 │ 60000.0 │
│ Unknown │ 45.0│ 50000.0│
│ David │ 28.0│ 85000.0│
└─────────┴─────┴────────┘
Why It’s Better Than Pandas:
- Speed: Polars can handle missing data efficiently even in large datasets.
- Flexibility: You can fill null values based on computations (e.g., column means) or other flexible options.
Summary: How Polars is Better than Pandas
- Performance: Polars is designed for speed, leveraging Rust and multi-threading to handle large datasets efficiently.
- Memory Efficiency: Using Apache Arrow as its memory model, Polars consumes less memory and supports zero-copy data sharing, which is more efficient than Pandas.
- Lazy Execution: Polars optimizes chained operations by executing them only when needed, unlike Pandas, which executes each step immediately.
- Parallelization: Many operations in Polars are parallelized, which significantly improves performance on modern multi-core machines.
- Advanced Querying: With support for complex operations such as multi-level group-bys, joins, and window functions, Polars offers greater flexibility for data manipulation than Pandas.
Polars is particularly beneficial for large-scale data processing, real-time data applications, and when working with high-performance computing environments. It shines in areas where Pandas can struggle due to memory constraints or speed limitations, making it an excellent alternative for performance-critical applications.





Leave a Reply