Scanpy — LLMpedia

Scanpy
Name	Scanpy
Programming language	Python
Operating system	Cross-platform

Contents

Overview
Features and Functionality
Data Structures and File Formats
Typical Analysis Workflow
Extensions and Ecosystem
Development, Licensing, and Community

Overview

Scanpy provides scalable algorithms for single-cell transcriptomics that support datasets with hundreds of thousands to millions of cells, and interoperates with visualization, clustering, and trajectory tools developed at research centers like Broad Institute, Wellcome Sanger Institute, and European Bioinformatics Institute. It leverages computational libraries created by projects at NumPy, SciPy, Pandas, Matplotlib, Seaborn, scikit-learn, HDF5, Zarr, and ecosystem contributors from PyPI, CondaForge, GitHub, Bitbucket, GitLab, and continuous integration services used by Travis CI and GitHub Actions. Prominent applications include analyses published in journals affiliated with Nature Publishing Group, Cell Press, Science, PNAS, eLife, Genome Research, and datasets from initiatives like the Human Cell Atlas, ENCODE Project, GTEx Project, 1000 Genomes Project, Cancer Genome Atlas, NIH Roadmap Epigenomics Project, Allen Brain Atlas.

Features and Functionality

Scanpy implements preprocessing steps such as normalization, log-transformation, and highly variable gene selection, often used in studies from Broad Institute, Harvard Medical School, Stanford Medicine, Dana-Farber Cancer Institute, and Memorial Sloan Kettering Cancer Center. It offers dimensionality reduction methods including principal component analysis and UMAP, comparable to implementations produced by teams at Facebook AI Research, UMAP community, scikit-learn, and algorithmic work from Maaten and Hinton style t-SNE variants; it integrates clustering algorithms like Leiden and Louvain developed in projects from Leiden University, University of Amsterdam, and Vrije Universiteit Amsterdam. Visualization modules produce embeddings, heatmaps, and dot plots used in collaborative efforts with Broad Institute, Sanger Institute, Harvard, and MIT groups. Scanpy supports neighborhood graph construction, differential expression testing, batch correction routines inspired by methods from Seurat, Harmony, MNN Correct, and aligns with trajectory inference approaches linking to developments from Monocle, Slingshot, PAGA, and STREAM.

Data Structures and File Formats

Scanpy centers on the AnnData data structure, interoperable with formats and tools from HDF5, Zarr, loompy, and projects at Bioconductor that use SingleCellExperiment conventions. AnnData enables storage of expression matrices, cell and gene annotations, and reduced dimensions; it is designed for compatibility with backends such as NumPy, SciPy.sparse, and Pandas DataFrame. File serialization aligns with community standards influenced by consortia like the Human Cell Atlas and infrastructures from European Genome-phenome Archive and NCBI, facilitating exchange with platforms like Galaxy Project, Terra, DNAnexus, and cloud providers including Google Cloud Platform, Amazon Web Services, and Microsoft Azure.

Typical Analysis Workflow

A typical Scanpy pipeline mirrors workflows described in publications by Human Cell Atlas, Broad Institute, Wellcome Sanger Institute, and academic groups at Stanford, Harvard, and MIT: quality control and filtering, normalization, highly variable gene selection, dimensionality reduction, neighborhood graph construction, clustering, marker gene identification, and visualization. Integrative analyses combine batch correction methods from Seurat (Satija Lab), Harmony (Korsunsky Lab), and alignment techniques used by researchers at Sanger Institute and EGA contributors. Downstream steps often include trajectory analysis with tools from Monocle (Trapnell Lab), PAGA (Wolf Lab), or integration into pipelines at Galaxy Project and collaborative platforms like Bioconductor.

Extensions and Ecosystem

Scanpy is extended via plugins and interoperable packages developed within communities at GitHub, PyPI, CondaForge, and research labs including CZI-funded groups, contributors from EMBL-EBI, Wellcome Trust, and vendors like 10x Genomics, Illumina, Pacific Biosciences, Oxford Nanopore Technologies. Notable companion tools and integrations include wrappers for algorithms from Seurat, Monocle, PAGA, visualization tools from Plotly, Bokeh, and single-cell atlasing efforts from Human Cell Atlas, Allen Institute for Brain Science, Cancer Research UK, and Stanford Medicine.

Development, Licensing, and Community

Scanpy development is coordinated on platforms such as GitHub with contributions from academic groups at Max Planck Society, EMBL, Broad Institute, Wellcome Sanger Institute, Harvard University, MIT, Stanford University, ETH Zurich, and supported by funding bodies like European Research Council, Wellcome Trust, NIH, NSF, and philanthropic organizations including Chan Zuckerberg Initiative. Licensing and governance are managed according to open-source norms prevalent on GitHub and PyPI with community engagement via forums, conference presentations at meetings such as ISMB, RECOMB, Gordon Research Conferences, Cold Spring Harbor Laboratory meetings, and workshops at EMBO and Keystone Symposia. Developers and users interact in mailing lists, issue trackers, and community channels associated with institutions like BioConductor, Galaxy Project, and consortia such as the Human Cell Atlas.

Category:Bioinformatics software