LLMpediaThe first transparent, open encyclopedia generated by LLMs

Pangeo

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: NERSC Hop 5
Expansion Funnel Raw 84 → Dedup 11 → NER 10 → Enqueued 0
1. Extracted84
2. After dedup11 (None)
3. After NER10 (None)
Rejected: 1 (not NE: 1)
4. Enqueued0 (None)
Pangeo
NamePangeo
DeveloperGoogle, Microsoft, Amazon (company), National Center for Atmospheric Research, Princeton University, University of Washington
Released2016
Programming languagePython (programming language), Dask (software), Xarray (software)
Operating systemLinux, macOS, Windows
LicenseBSD license

Pangeo Pangeo is an open-source community and platform for large-scale geoscientific data analysis that integrates tools from the Python (programming language) ecosystem and cloud computing to enable reproducible research. It emphasizes scalable computation, interactive visualization, and collaborative workflows across institutions such as National Center for Atmospheric Research, Princeton University, University of Washington, University of California, Berkeley, and industry partners including Google, Microsoft, and Amazon (company). Pangeo fosters interoperability among projects like Xarray (software), Dask (software), Zarr (file format), Apache Arrow, and Jupyter Notebook.

Overview

Pangeo combines software, community practices, and cloud infrastructure to analyze multi-dimensional scientific arrays produced by instruments and models such as MODIS, Landsat, Sentinel-2, Argo (oceanography), and CMIP6. It leverages data formats and services like NetCDF, Zarr (file format), OpenDAP, THREDDS, and Amazon S3 while integrating compute backends including Dask (software), Kubernetes, Docker, Google Cloud Platform, Microsoft Azure, and Amazon Web Services. The project supports visualization and exploration through Jupyter Notebook, JupyterLab, Holoviz, Bokeh, Matplotlib, and Cartopy. Community governance draws on contributing organizations such as National Oceanic and Atmospheric Administration, European Centre for Medium-Range Weather Forecasts, and Lawrence Berkeley National Laboratory.

History and development

Pangeo originated from collaborations between academic and national laboratory teams responding to challenges in handling large climate and ocean datasets produced by CMIP5, CMIP6, and observational programs like Argo (oceanography) and GRACE. Early contributors included researchers from University of Washington, Princeton University, Lawrence Berkeley National Laboratory, and National Center for Atmospheric Research. The project evolved alongside technologies from NumPy, SciPy, Pandas, and Xarray (software), and incorporated distributed computing advances from Dask (software), container orchestration from Kubernetes, and object storage standards promoted by Amazon (company) and Google. Pangeo community events and governance were influenced by practices from Software Carpentry, Mozilla Science Lab, NumFOCUS, and funding bodies like the National Science Foundation and European Research Council. Major milestones included adoption for analyses of CMIP6 intercomparison, deployment for operational sites at NOAA, and integration with data portals such as PODAAC and Copernicus.

Architecture and components

Core components of the Pangeo stack include Xarray (software) for labeled N-dimensional arrays, Dask (software) for parallel arrays and dataframes, Zarr (file format) for chunked, compressed array storage, and Jupyter Notebook/JupyterLab for interactive computation. Infrastructure commonly uses container images built with Docker and scheduled by Kubernetes, with data hosted on object stores compatible with Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage. Metadata and discovery integrate with services like THREDDS and OpenDAP, while authentication and authorization leverage providers including Globus and ORCID. Visualization and dashboards rely on Holoviz, Bokeh, Matplotlib, Datashader, and Cartopy. Performance tuning references libraries such as Numba, Blosc, HDF5, and NetCDF. Interoperability with machine learning and statistics is enabled through scikit-learn, TensorFlow, PyTorch, and XGBoost.

Use cases and applications

Researchers apply Pangeo to climate model analysis for CMIP6, oceanography using Argo (oceanography) floats, remote sensing from MODIS and Landsat, and hydrology related to GRACE. Operational agencies like NOAA, European Centre for Medium-Range Weather Forecasts, and NASA use Pangeo-aligned workflows for reanalysis studies such as ERA5 and data assimilation pipelines. Applications extend to ecology studies supported by NEON, air quality investigations tied to EPA (United States Environmental Protection Agency), and interdisciplinary projects funded by the National Science Foundation and Horizon 2020. Pangeo facilitates reproducible workflows cited in publications from institutions including University of California, San Diego, Columbia University, Massachusetts Institute of Technology, and Stanford University.

Governance and community

The Pangeo community is organized around open governance practices influenced by organizations such as NumFOCUS, Software Carpentry, and The Carpentries. Contributors come from academic institutions like Princeton University, University of Washington, University of Cambridge, University of Oxford, national labs including Lawrence Berkeley National Laboratory and National Center for Atmospheric Research, and industry partners like Google, Microsoft, and Amazon (company). The project hosts meetings, workshops, and code sprints modeled on events run by AGU (American Geophysical Union), EGU (European Geosciences Union), and Ocean Sciences Meeting. Funding and partnerships have included grants from the National Science Foundation, collaborations with European Centre for Medium-Range Weather Forecasts, and infrastructure support from Google Cloud Platform and Amazon Web Services.

Adoption and impact

Pangeo has influenced data analysis practices across institutions such as NOAA, NASA, European Centre for Medium-Range Weather Forecasts, Met Office, CNRS, Scripps Institution of Oceanography, Woods Hole Oceanographic Institution, and universities including Princeton University and University of Washington. It has accelerated reproducible research in projects tied to CMIP6, Argo (oceanography), MODIS, Landsat, ERA5, and Copernicus data streams, and enabled integrations with machine learning frameworks like TensorFlow and PyTorch for climate informatics. The ecosystem's tools have been adopted by data portals and services including PODAAC, Copernicus, and institutional data repositories at Lawrence Berkeley National Laboratory and National Center for Atmospheric Research, shaping modern workflows in Earth system science.

Category:Scientific software