pandas (software)

pandas (software)
Name	pandas
Author	Wes McKinney
Developer	Python Software Foundation, NumFOCUS
Released	11 January 2008
Programming language	Python (programming language)
Operating system	Cross-platform
Genre	Data analysis, Data manipulation
License	BSD 3-Clause
Website	https://pandas.pydata.org

Contents

Overview
Data structures
Functionality
Integration and ecosystem
History and development

pandas (software). pandas is an open-source software library written for the Python (programming language) programming language, designed for high-performance data manipulation and analysis. It provides fast, flexible, and expressive data structures that make working with structured or tabular data intuitive and efficient. The library has become a foundational tool in the fields of data science, financial analysis, and statistics, widely used in academia and industry.

Overview

pandas is built on top of the NumPy library and is a core component of the SciPy ecosystem. Its primary purpose is to offer data structures and operations for manipulating numerical tables and time series data. The library is particularly well-suited for tasks such as data wrangling, cleaning, transformation, aggregation, and visualization. It is a critical dependency for many other data-centric libraries in the Python (programming language) world, including scikit-learn and Statsmodels.

Data structures

The two primary data structures in pandas are the Series, which is a one-dimensional labeled array, and the DataFrame, a two-dimensional labeled data structure with columns of potentially different types. The DataFrame is analogous to a spreadsheet, a SQL table, or the data.frame in the R (programming language). These structures are built with an integrated index, allowing for intuitive data alignment and handling of missing data, represented as NaN values from NumPy. The library also supports hierarchical indexing through its MultiIndex object.

Functionality

pandas provides a vast array of functionality for data manipulation, including efficient methods for reading and writing data between in-memory structures and various file formats like CSV, Microsoft Excel, SQL databases, and HDF5. It supports powerful operations for indexing, selecting, filtering, grouping, and pivoting datasets. For time series analysis, it offers robust tools for date range generation, frequency conversion, and moving window statistics. Its integration with Matplotlib allows for straightforward data visualization directly from DataFrames and Series.

Integration and ecosystem

The library integrates deeply with the broader Python (programming language) scientific stack. It works seamlessly with NumPy for numerical operations, with SciPy for advanced statistical functions, and with scikit-learn for machine learning pipelines. For statistical modeling, it is often used alongside Statsmodels. Its data structures can be easily converted to and from Apache Arrow for efficient interchange with other systems, and it supports output to formats used by Plotly and Bokeh for interactive visualization. Development is supported by the non-profit NumFOCUS.

History and development

pandas was created by Wes McKinney in 2008 while working at AQR Capital Management, out of a need for a high-performance, flexible tool for quantitative financial analysis. The project was open-sourced and has since grown through contributions from a large community. Since 2015, it has been a fiscally sponsored project of NumFOCUS, which provides financial and administrative support. Key milestones in its development have included the introduction of the DataFrame API and continuous performance improvements, often leveraging integration with libraries like NumPy and Numba.

Category:Free data analysis software Category:Python (programming language) libraries Category:Cross-platform software