NCBI Entrez — LLMpedia

NCBI Entrez
Name	Entrez
Developer	National Center for Biotechnology Information
Released	1991
Programming language	C, Perl, Python
Platform	Web, API
License	Public domain content, NIH policies

Contents

NCBI Entrez is an integrated, text-based search and retrieval system developed by the National Center for Biotechnology Information to provide unified access to a network of biomedical and molecular biology databases. It supports cross-database linking and navigation among sequence repositories, literature, taxonomy, protein structures, and clinical resources, enabling researchers to traverse resources curated by institutions such as the National Institutes of Health, National Library of Medicine, United States National Cancer Institute, Centers for Disease Control and Prevention, and international partners like the European Molecular Biology Laboratory, European Bioinformatics Institute, and Wellcome Trust Sanger Institute. The system interoperates with standards and initiatives exemplified by the Human Genome Project, the International Nucleotide Sequence Database Collaboration, and the Protein Data Bank, facilitating integration with projects led by organizations such as Broad Institute, Cold Spring Harbor Laboratory, Rockefeller University, and Howard Hughes Medical Institute.

Overview

Entrez functions as a federated search portal connecting distinct resources including nucleotide and protein sequences, literature, taxonomy, variation, gene expression, and structure records curated by entities such as the National Center for Biotechnology Information, PubMed Central, GenBank, RefSeq, UniProt, and the Protein Data Bank. It provides cross-references that link entries to external collections maintained by groups like the European Nucleotide Archive, Genome Aggregation Database, ClinVar, dbSNP, OMIM, KEGG, Reactome, and Gene Ontology Consortium, enabling integration with community resources produced by teams at Stanford University, Massachusetts Institute of Technology, Harvard University, Yale University, and University of California, Berkeley.

Entrez was conceived at the National Center for Biotechnology Information during a period of rapid expansion in sequence and literature databases associated with projects like the Human Genome Project and collaborations among the International Nucleotide Sequence Database Collaboration partners: GenBank, European Nucleotide Archive, and DNA Data Bank of Japan. Key contributors and advisors included scientists affiliated with the National Institutes of Health, the National Library of Medicine, and research groups at University of California, San Diego, University of Cambridge, University of Oxford, and California Institute of Technology. Over time Entrez incorporated resources from initiatives such as PubMed, PubMed Central, ClinVar, and structural datasets from the Protein Data Bank, reflecting the influence of consortia like the Human Proteome Organization and funders such as the Wellcome Trust and Howard Hughes Medical Institute.

Entrez indexes a range of databases produced or aggregated by institutions including sequence repositories like GenBank and RefSeq, literature repositories such as PubMed and PubMed Central, variation and phenotype resources like dbSNP and ClinVar, model organism databases including records linked to FlyBase and WormBase, and structure resources derived from the Protein Data Bank. It cross-links to curated resources and ontologies maintained by groups such as the Gene Ontology Consortium, KEGG, Reactome, UniProt Consortium, Ensembl, UCSC Genome Browser, COSMIC, ArrayExpress, Expression Atlas, GTEx Consortium, 1000 Genomes Project, Exome Aggregation Consortium, Genome Reference Consortium, and clinical knowledgebases influenced by National Cancer Institute panels and regulatory records connected to agencies like the Food and Drug Administration and the European Medicines Agency.

Search features leverage indexing strategies and relevance ranking developed with input from informatics groups at National Library of Medicine, Cornell University, University of Washington, Johns Hopkins University, and Princeton University. Users can perform fielded queries, combine Boolean operators, and utilize medical subject headings curated by National Library of Medicine specialists to navigate literature and sequence results. Entrez integrates citation links that tie records to bibliographic metadata in PubMed and PubMed Central, structural cross-references to entries in the Protein Data Bank, taxonomic connections to the Tree of Life Web Project and the Integrated Taxonomic Information System, and clinical annotations that reference OMIM and ClinVar submissions from clinical centers such as Mayo Clinic and Johns Hopkins Hospital.

Entrez provides programmatic access via the Entrez Programming Utilities (E-utilities), enabling automated queries and retrieval interoperable with languages and toolkits used at institutions such as NIH, EMBL-EBI, Broad Institute, and academic groups at Carnegie Mellon University and University of California, San Francisco. The API supports batch retrieval, XML and JSON output, and integration with workflow systems and platforms like Galaxy Project, Bioconductor, CWL, and cloud services used by Amazon Web Services and Google Cloud Platform by teams at Microsoft Research and bioinformatics startups. Developers build client libraries in languages used by researchers at Stanford University, Massachusetts Institute of Technology, and Harvard Medical School, and link Entrez data to resources such as Ensembl, UCSC Genome Browser, KEGG, and Reactome.

Entrez underpins literature discovery, sequence annotation, comparative genomics, clinical variant interpretation, and translational research workflows employed by researchers at National Institutes of Health, Centers for Disease Control and Prevention, World Health Organization, pharmaceutical companies like Pfizer, Roche, Novartis, and biotechnology firms including Genentech, Amgen, and Biogen. It supports public health surveillance efforts coordinated with World Health Organization initiatives and genomic consortia such as the Global Alliance for Genomics and Health, and informs educational resources developed at University of Oxford, University of Cambridge, Yale University, and Columbia University. Entrez data and links are cited in thousands of publications from laboratories at institutions like Salk Institute, Broad Institute, MIT, Harvard Medical School, Stanford School of Medicine, and across clinical centers including Mayo Clinic and Cleveland Clinic.