You are currently viewing Exploring the Exciting New Features in Pandas 2.0
Introduction to Pandas 2.0

Exploring the Exciting New Features in Pandas 2.0

We have some exciting news for data scientists and analysts: pandas 2.0 release candidate is now available after three years of development. This new version comes with several enhancements and features that promise to make data manipulation tasks even more efficient. One of the significant highlights of pandas 2.0 is the improved support for extension arrays that provides users with greater flexibility to work with custom data types. 

Furthermore, pandas 2.0 introduces non-nanosecond datetime resolution and supports pyarrow for DataFrames. However, it’s essential to note that with these exciting updates, there are also enforced deprecations and API changes that users need to be aware of before utilizing the new features. In this article, we will explore these changes before diving into how the new features can enhance your workflow.

What is Pandas

Pandas is a popular open-source data manipulation library for Python. It provides high-performance, easy-to-use data structures and data analysis tools that simplify the handling of structured data. The library is built on top of NumPy, another popular Python library for scientific computing, and offers efficient data structures for handling time series data, tabular data, and matrices.

Pandas is particularly useful for data cleaning, exploration, and analysis. It offers tools for reading and writing data in various file formats, including CSV, Excel, and SQL databases, and data manipulation, aggregation, and filtering. With its rich set of functions and methods, pandas allows users to perform complex operations on large datasets with ease.

Pandas has one main data structure, the DataFrame, which is a two-dimensional table-like data structure that stores data in rows and columns. The DataFrame provides:

  • A variety of methods for indexing.
  • Selecting, and manipulating data, as well as for working with missing values.
  • Merging and joining datasets.
  • Performing statistical analysis.

There is a wide range of uses for Pandas in data science, machine learning, finance, and other fields where data analysis is important. Its ease of use, adaptability, and capacity for handling big datasets all contribute to its appeal. Additionally, pandas has a large and active community of users and contributors, which ensures that the library is continuously updated and improved.

RELATED: Rename Columns in Pandas

API Changes in Pandas 2.0

The pandas 2.0 release is a significant milestone, as all the deprecations added in the 1.x series have been enforced. As a result, the latest 1.5.3 release contains approximately 150 warnings. If your code runs without warnings on 1.5.3, then you can confidently migrate to 2.0. However, it’s crucial to note some subtle and noticeable deprecations that users should be aware of before utilizing the new features. As we move along, we’ll briefly examine some of these deprecations. You can refer to the release notes provided here for a complete list.

Improved Indexing in Pandas 2.0

One significant change introduced in pandas 2.0 is that the Index now supports arbitrary NumPy dtypes. Prior to this release, an Index only supported int64, float64, and uint64 dtypes, resulting in Int64Index, Float64Index, or UInt64Index classes. These classes have been removed in pandas 2.0, and all numeric indexes are now represented as Index with an associated dtype. For example, we can create an Index with int64 dtype as follows:

import pandas as pd

idx = pd.Index([1, 2, 3], dtype="int64")

print(idx)
Out: Int64Index([1, 2, 3], dtype='int64')

This behavior now mirrors that of extension array backed Indexes, which have been supported since pandas 1.4.0. This change is particularly noticeable when using an explicit Index subclass that no longer exists in pandas 2.0.

Overall, this change provides greater flexibility and improved support for different data types, making indexing in pandas more efficient and versatile. 

RELATED: How to Automate Tasks with Python

Improved Numeric Aggregation in Pandas 2.0

In previous versions of pandas, calling aggregation functions on a DataFrame with mixed-dtypes resulted in varying outcomes. Sometimes the aggregation excluded non-numeric dtypes, while at other times, it raised an error. However, in pandas 2.0, the behavior of the numeric_only argument has been made consistent. If you apply aggregation functions to a DataFrame with non-numeric dtypes, it will raise an error. Therefore, to obtain the same behavior as before, simply set numeric_only to True or limit your DataFrame to numeric columns, thus avoiding accidentally dropping relevant columns from the DataFrame.

For instance, in previous versions of pandas, calculating the mean over a DataFrame with mixed dtypes dropped non-numeric columns:

from pandas import DataFrame

df = DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

print(df.mean())
Out: 
a    2.0
dtype: float64

However, in pandas 2.0, this operation raises an error to prevent the loss of relevant columns in these aggregations:

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

print(df.mean())
TypeError: Could not convert ['x', 'y', 'z'] to numeric

This change ensures that the numeric-only argument works consistently with mixed-dtype DataFrames, thereby providing better support for numeric aggregations.

Improvements and New Features

pandas 2.0 introduces several exciting new features and enhancements that aim to improve performance, accuracy, and overall functionality. The most notable ones are listed below:

Enhancements to nullable dtypes and extension arrays

In pandas 2.0, there is a significant improvement in support for nullable dtypes and extension arrays. The new release uses nullable semantics instead of casting to object when dealing with nullable dtypes such as Int64, boolean, or Float64. This leads to a bunch of performance improvements, making the operations faster and more efficient. For instance, the speed of the drop_duplicates() function in pandas 2.0 is significantly faster than that of pandas 1.5.3.

The groupby algorithms now use nullable semantics, resulting in better accuracy and performance. For improved opting into nullable dtypes, a new keyword, dtype_backend, was added to most I/O functions. When set to “numpy_nullable,” it returns a DataFrame completely backed by nullable dtypes, which is faster than using NumPy arrays with dtype object. The pandas StringDtype is used instead of NumPy arrays with dtype object, and the string columns are backed by either Python strings or PyArrow strings, depending on the storage option.

There is now better integration between the Index and MultiIndex classes and extension arrays. Extension Array support was introduced in 1.4, and it has been continuously improving. In addition, extension array semantics are used for index operations, efficient indexing operations are applied to nullable and PyArrow types, and no materialization of MultiIndexes to improve performance and maintain dtypes. However, some areas are still under development, such as GroupBy aggregations for third-party extension arrays.

PyArrow-backed DataFrames in Pandas 2.0

Pandas 2.0 introduces PyArrow-backed DataFrames, which offer a significant improvement in operating on string-columns as compared to the NumPy object representation that was previously used. The PyArrow-specific extension array supports all PyArrow dtypes on top of it, allowing users to create columns with any PyArrow dtype and/or use PyArrow nullable semantics. Users can create a PyArrow-backed column by casting to or specifying a column’s dtype as f”{dtype}[pyarrow]”, or by creating a PyArrow dtype using the pd.ArrowDtype() method.

In pandas 1.5.0, the API for PyArrow-backed DataFrames was experimental and relied on NumPy quite often. However, with pandas 2.0 and a minimum PyArrow version of 7.0, corresponding PyArrow compute functions can now be used in many more methods, resulting in improved performance and the elimination of PerformanceWarnings that were previously raised when falling back to NumPy. Most I/O methods can now return PyArrow-backed DataFrames through the keyword dtype_backend=”pyarrow”.

Some I/O methods have specific PyArrow engines, such as read_csv and read_json, which offer a significant performance improvement when requesting PyArrow-backed DataFrames. However, these methods do not yet support all the options that the original implementations support. Future versions of pandas are expected to bring many more improvements in this area.

Non-nanosecond resolution in Timestamps in Pandas 2.0

Pandas has long been restricted to representing timestamps in nanosecond resolution. This meant that dates before September 21, 1677, and after April 11, 2264, could not be represented. This was particularly problematic for researchers who needed to analyze time series data spanning millennia and beyond.

With the introduction of Pandas 2.0, timestamps now support other resolutions such as seconds, milliseconds, and microseconds. This allows for time ranges up to +/- 2.9e11 years, which should cover most common use cases.

In previous versions of Pandas, attempting to construct a timestamp outside the supported range would raise an error regardless of the specified unit. However, in Pandas 2.0, the unit is now honored, enabling the creation of arbitrary dates. For instance:

pd.Timestamp("1000-10-11", unit="s")

will return

Timestamp('1000-10-11 00:00:00')

Note that the timestamp is only returned up to the second, as higher precisions are not supported when specifying unit=”s”.

Although support for non-nanosecond resolutions of timestamps is still actively developed, many methods relied on the assumption that timestamps were always given in nanosecond resolution. Therefore, some bugs may still exist in different areas. However, the development team is working hard to resolve these issues and provide a seamless experience for users.

RELATED: Python Libraries for Data Science

Copy-on-Write Improvements in Pandas 2.0

Copy-on-Write (CoW) was first introduced in pandas 1.5.0 to prevent modifying a DataFrame or Series object that shares data with another object inplace. The implementation of CoW in pandas 1.5 provided the general mechanism, but not much else. Since then, a few bugs that didn’t respect CoW were fixed.

In pandas 2.0, CoW is used extensively to defer copying the underlying data until an object’s data are modified. This avoids unnecessary copying of data and results in significant performance improvements. Moreover, enabling CoW leads to a cleaner and easier-to-work-with API.

As a result, if your application does not depend on updating more than one object at once and does not use chained assignment, then enabling CoW will provide a negligible risk to the application. Moreover, developing new features with CoW enabled is recommended to avoid any migration issues later on.

To enable CoW, set the copy_on_write option as shown below:

pd.options.mode.copy_on_write = True

A PDEP was proposed to deprecate and remove the in place and copy keywords in most methods since they become obsolete when CoW is enabled. This proposal is still under discussion, but if accepted, these keywords will be removed when CoW is made the default.

Conclusion

In conclusion, pandas 2.0 introduces exciting new features, such as non-nanosecond resolution in Timestamps, Copy-on-Write improvements, and more. These new features will make working with pandas even more efficient and user-friendly. We hope you found this overview helpful in understanding some of the critical updates in pandas 2.0. We encourage you to explore the new features and provide feedback on this release in the comments below. Thank you for reading!

Leave a Reply