You are currently viewing Python Pandas AI: The Next-Level Tool for Data Analysis and Manipulation

Python Pandas AI: The Next-Level Tool for Data Analysis and Manipulation

Picture a world where you can engage with your data in a friendly and intuitive manner. Enter Pandas AI, a remarkable Python library infused with the power of generative artificial intelligence. So say goodbye to those grueling hours spent scrutinizing rows and columns.

But fear not; Pandas AI isn’t out to replace your beloved Pandas. Instead, it aims to augment and elevate your data analysis and manipulation endeavors. Think of it as the ultimate superhero sidekick, ready to swoop in and save the day while making your life much smoother.

The potential of Pandas AI knows no bounds. Envision having a dataframe that compiles its reports and delves into intricate data sets, presenting you with easily digestible summaries. With Pandas AI by your side, the realm of possibilities expands exponentially.

What is Python Pandas AI?

Let’s delve into what exactly Pandas AI entails.

Pandas AI revolutionizes the way we interact with data frames. It allows you to engage in a literal conversation with your dataset. Yes, you heard it right! You can communicate with your data and receive prompt responses. As a data scientist or analyst, you no longer need to spend countless hours poring over rows and columns. Pandas AI doesn’t replace Pandas but propels it forward with significant advancements.

Data scientists and analysts traditionally devote substantial time to cleansing data during the analysis phase. However, with Pandas AI, they can take their data analysis to new heights. These data professionals now have access to various methods and processes that streamline data preparation, significantly minimizing the time spent on this tedious task.

It’s essential to note that Pandas AI is designed to complement and work hand-in-hand with Pandas, not as a substitute. So, rather than sifting through the dataset and seeking answers alone, you can pose questions to Pandas AI. In return, it will provide solutions in the form of Pandas DataFrames.

Now, you may wonder if proficiency in Python and tools like the Pandas library is no longer necessary for data analysis. With the assistance of the OpenAI API, Pandas AI aims to achieve the remarkable feat of conversing with machines to obtain desired results, eliminating the need for manual programming. Instead, the machine will generate the output in its language – machine-interpretable code, specifically in the form of DataFrames.

RELATED: Pandas vs Vaex: A Comprehensive Comparison of Two Data Science Tools

Pandas AI Setup

Before diving into the world of Pandas AI, let’s go through the setup process.

To begin, you’ll need to install the Pandas AI Python library. You can do this effortlessly by using pip, a package installer for Python. Using a terminal or command prompt, enter the following command:

pip install pandasai

Once the installation is complete, we’re ready to move on to the next step. To utilize the full potential of Pandas AI, we’ll need to use the OpenAI API key. If you don’t have an API key yet, don’t worry! You can generate one by visiting the corresponding website. Then, follow the steps illustrated in the image below to obtain your very own API key:

Running Pandas AI Walkthrough

Let’s walk through the process of running the Pandas AI model on your DataFrame.

First, you need to initialize the Pandas AI model with your OpenAI model:

pandas_ai = PandasAI(openAImodel)

You can run your model once you have set it up on your DataFrame using the run method. This method takes two parameters: the DataFrame you’re working with and the question you want to ask:, prompt='the question you would like to ask?')

For instance, you’re exploring your dataset and want to find the rows where a specific column value exceeds 5. You can achieve this using Pandas AI. Here’s an example code snippet:

import pandas as pd
from pandasai import PandasAI

# Sample DataFrame

df = pd.DataFrame({

    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],

    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],

    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]


# Instantiate an OpenAI model

from pandasai.llm.openai import OpenAI

llm = OpenAI()

pandas_ai = PandasAI(llm), prompt='Which are the 5 happiest countries?')

The output will be a DataFrame displaying the countries with the highest happiness index:

6            Canada
7         Australia
1    United Kingdom
3           Germany
0     United States
Name: country, dtype: object

By running the Pandas AI model on your DataFrame, you can effortlessly obtain insightful results based on the questions you ask, saving you valuable time and effort.

RELATED: Exploring the Exciting New Features in Pandas 2.0

Visualizations using Pandas AI

Pandas AI goes beyond simple queries and empowers you to perform more complex operations, including mathematical calculations and data visualizations.

Let’s explore a data visualization example using Pandas AI:
    "Plot the histogram of countries showing for each the gpd, using different colors for each bar",

This command instructs Pandas AI to generate a histogram visualization based on the DataFrame df. Each bar in the histogram represents a country, and the height of the bar corresponds to its GDP. In addition, each bar will be displayed using a distinct color to enhance visual clarity.

The data visualization output will be displayed, clearly representing the GDP distribution among different countries. This visualization is a valuable tool for understanding and communicating data patterns and insights concisely and visually appealingly.

Source: PandasAI

Future Enhancements 

Despite being a relatively new library, Pandas AI is continuously evolving, with the team actively working on enhancing its capabilities. As of May 10th, they have identified several tasks to improve the library further:

  1. Adding support for more Language Model Models (LLMs): The team aims to expand the range of LLMs compatible with Pandas AI, providing users with a broader selection of AI models to utilize.
  2. Making Pandas AI accessible from a Command Line Interface (CLI): This upcoming feature will enable users to interact with Pandas AI directly from the command line, facilitating seamless integration into existing workflows.
  3. Developing a web interface for Pandas AI: The team envisions creating a user-friendly web interface, allowing users to interact with Pandas AI through a browser-based platform. This intuitive interface will enhance accessibility and ease of use.
  4. Adding comprehensive unit tests: To ensure the stability and reliability of the library, the team is committed to implementing thorough unit tests. This practice ensures that Pandas AI functions as intended across various scenarios and prevents potential issues.

The Pandas AI team welcomes suggestions and contributions from the community. If you’re interested in contributing to the growth and development of Pandas AI, you can refer to the project’s contributing guidelines. Become a part of the future of Pandas AI and contribute to its ongoing improvement by actively participating.

READ: 4 Ways to Rename Columns in Pandas

Considerations for Working with PandasAI and OpenAI Pricing

When utilizing PandasAI, it’s essential to be mindful of the associated OpenAI pricing structure, as this can impact your usage costs. To access the most current pricing details, visit the OpenAI website, where you’ll find up-to-date information regarding their pricing plans.

As of May 2023, the approximate pricing for the GPT-3.5-Turbo Model is around 1000 tokens per $0.002. Therefore, it’s crucial to stay informed about any pricing updates to ensure accurate budgeting and cost estimation for your projects involving PandasAI.

When formulating questions for PandasAI, it’s important to remember that the entire dataframe is passed along with each query. While this facilitates context-aware responses, there may be more suitable approaches when working with extensive datasets. Therefore, alternative strategies for handling large datasets to optimize performance and efficiency are worth considering.

This Post Has 2 Comments

  1. Travellernote

    Great work on this post! The content is insightful, well-researched, and organized. It’s clear that you have a strong grasp of the subject. Thank you for sharing your expertise!

  2. I wanted to express my appreciation for this post. It’s concise yet informative, and I’ve gained valuable insights from reading it. Thank you for sharing your expertise with us!

Leave a Reply