Polars vs. pandas: What’s the Difference?

The Python IDE for data science and web development

Download

Data Science

Jodie Burchell

Why use Polars over pandas?

In a word: performance. Polars was built from the ground up to be blazingly fast and can do common operations around 5–10 times faster than pandas. In addition, the memory requirement for Polars operations is significantly smaller than for pandas: pandas requires around 5 to 10 times as much RAM as the size of the dataset to carry out operations, compared to the 2 to 4 times needed for Polars.

You can get an idea of how Polars performs compared to other dataframe libraries here. As you can see, Polars is between 10 and 100 times as fast as pandas for common operations and is actually one of the fastest DataFrame libraries overall. Moreover, it can handle larger datasets than pandas can before running into out-of-memory errors.

Why is Polars so fast?

These results are extremely impressive, so you might be wondering: How can Polars get this sort of performance while still running on a single machine? The library was designed with performance in mind from the beginning, and this is achieved through a few different means.

Written in Rust

One of the most well-known facts about Polars is that it is written in Rust, a low-level language that is almost as fast as C and C++. In contrast, pandas is built on top of Python libraries, one of these being NumPy. While NumPy’s core is written in C, it is still hamstrung by inherent problems with the way Python handles certain types in memory, such as strings for categorical data, leading to poor performance when handling these types (see this fantastic blog post from Wes McKinney for more details).

One of the other advantages of using Rust is that it allows for safe concurrency; that is, it is designed to make parallelism as predictable as possible. This means that Polars can safely use all of your machine’s cores for even complex queries involving multiple columns, which led Ritchie Vink to describe Polar’s performance as “embarrassingly parallel”. This gives Polars a massive performance boost over pandas, which only uses one core to carry out operations. Check out this excellent talk by Nico Kreiling from PyCon DE this year, which goes into more detail about how Polars achieves this.

Based on Arrow

Another factor that contributes to Polars’ impressive performance is Apache Arrow, a language-independent memory format. Arrow was actually co-created by Wes McKinney in response to many of the issues he saw with pandas as the size of data exploded. It is also the backend for pandas 2.0, a more performant version of pandas released in March of this year. The Arrow backends of the libraries do differ slightly, however: while pandas 2.0 is built on PyArrow, the Polars team built their own Arrow implementation.

One of the main advantages of building a data library on Arrow is interoperability. Arrow has been designed to standardize the in-memory data format used across libraries, and it is already used by a number of important libraries and databases, as you can see below.

Polars vs. pandas: What’s the Difference? | The PyCharm Blog (4)

Query optimization

One of the other cores of Polars’ performance is how it evaluates code. Pandas, by default, uses eager execution, carrying out operations in the order you’ve written them. In contrast, Polars has the ability to do both eager and lazy execution, where a query optimizer will evaluate all of the required operations and map out the most efficient way of executing the code. This can include, among other things, rewriting the execution order of operations or dropping redundant calculations. Take, for example, the following expression to get the mean of column Number1 for each of the categories “A” and “B” in Category.

(df.groupby(by = "Category").agg(pl.col("Number1").mean()).filter(pl.col("Category").is_in(["A", "B"])))

If this expression is eagerly executed, the groupby operation will be unnecessarily performed for the whole DataFrame, and then filtered by Category. With lazy execution, the DataFrame can be filtered and groupby performed on only the required data.

Expressive API

Finally, Polars has an extremely expressive API, meaning that basically any operation you want to perform can be expressed as a Polars method. In contrast, more complex operations in pandas often need to be passed to the apply method as a lambda expression. The problem with the apply method is that it loops over the rows of the DataFrame, sequentially executing the operation on each one. Being able to use built-in methods allows you to work on a columnar level and take advantage of another form of parallelism called SIMD.

When should you stick with pandas?

All of this sounds so amazing that you’re probably wondering why you would even bother with pandas anymore. Not so fast! While Polars is superb for doing extremely efficient data transformations, it is currently not the optimal choice for data exploration or for use as part of machine learning pipelines. These are areas where pandas continues to shine.

One of the reasons for this is that while Polars has great interoperability with other packages using Arrow, it is not yet compatible with most of the Python data visualization packages nor machine learning libraries such as scikit-learn and PyTorch. The only exception is Plotly, which allows you to create charts directly from Polars DataFrames.

A solution that is being discussed is using the Python dataframe interchange protocol in these packages to allow them to support a range of dataframe libraries, which would mean that data science and machine learning workflows would no longer be bottlenecked by pandas. However, this is a relatively new idea, and it will take time for these projects to implement.

Tooling for Polars and pandas

After all of this, I am sure you are eager to try Polars yourself! PyCharm Professional for Data Science offers excellent tooling for working with both pandas and Polars in Jupyter notebooks. In particular, pandas and Polars DataFrames are displayed with interactive functionality, which makes exploring your data much quicker and more comfortable.

Some of my favorite features include the ability to scroll through all rows and columns of the DataFrame without truncation, get aggregations of DataFrame values in one click, and export the DataFrame in a huge range of formats (including Markdown!).

If you’re not yet using PyCharm, you can try it with a 30-day trial by following the link below.

Start your PyCharm Pro free trial

pandas polars

Share
Facebook
Twitter
Linkedin

Prev post PyCharm 2024.1.4: What’s New!

Subscribe to PyCharm Blog updates

Discover more

How to Move From pandas to Polars Considering replacing pandas with Polars? Learn how to make the switch, what to keep in mind, and how PyCharm can help. Evgenia Verbina

Best PyCharm Plugins 2024 Explore the top PyCharm plugins to supercharge Python development. Discover tools, extensions, and enhancements for increased productivity and efficiency. Maha Taqi

Buy PyCharm. Support Data Science. This December, we’re offering new users a 30% discount for PyCharm Professional annual subscriptions.We’ll also donate all of the proceeds from this campaign to support NumFOCUS and their sponsored, data science, projects. Roberto Pesce

Guest Post: Four Ways To Quickly Display OpenCV Images During Debugging This is a guest blog post by Adrian Boguszewski, author of OpenCV Image Viewer Plugin.The average programmer makes 70 errors per 1,000 lines of code and spends 75% of their time on debugging (source). In computer vision (CV), this process may involve not only fixing the code but also checking th… Jodie Burchell

Polars vs. pandas: What’s the Difference? | The PyCharm Blog (2024)