Entrez — LLMpedia

Entrez
Name	Entrez
Developer	National Center for Biotechnology Information
Released	1993
Latest release version	N/A
Programming language	C, Perl, Python (client libraries)
Operating system	Cross-platform
License	Public domain / government-produced
Website	N/A

Contents

Overview
History and Development
Architecture and Components
Search and Retrieval Features
Data Sources and Coverage
Integration and Interoperability
Usage, Impact, and Criticism

Entrez Entrez is a federated search and retrieval system developed to provide unified access to multiple life sciences and biomedical databases. It supports integration across sequence repositories, literature archives, structural databases, taxonomy resources, and clinical molecular resources to enable cross-database discovery. Initially designed for molecular biology and biomedical researchers, it has been widely used by institutional libraries, bioinformatics groups, and public health organizations.

Overview

Entrez was created to link diverse databases such as sequence collections, bibliographic records, macromolecular structures, and taxonomy indices in a single searchable environment. The system connects entries in repositories like GenBank, PubMed, Protein Data Bank, OMIM, and RefSeq to facilitate navigation among related molecular sequences, published literature, genetic variation records, and organismal classifications. Entrez provides query translation, result filtering, and hyperlinked cross-references to help users move from a literature citation such as those in The New England Journal of Medicine to genomic sequences submitted to GenBank or to structures archived in the Protein Data Bank.

History and Development

Development began at the National Center for Biotechnology Information (NCBI) in the early 1990s as part of an effort to organize and disseminate biomedical information maintained by the National Institutes of Health. Early milestones included integration with GenBank submissions maintained by the National Library of Medicine and linkage to the newly formed PubMed literature index. Subsequent expansions added cross-links to curated resources like RefSeq and disease-focused catalogs such as OMIM, and incorporated structural data from the Protein Data Bank. Entrez evolved alongside web technologies and database indexing practices influenced by developments at institutions such as Stanford University and European Bioinformatics Institute.

Architecture and Components

The Entrez architecture is a federated indexing and linking framework that coordinates multiple underlying databases hosted by the National Center for Biotechnology Information. Core components include an indexing engine, cross-reference tables, a query translation layer, and programmatic interfaces. Client access is supported through web portals and APIs such as the E-utilities, which allow scripted access from languages commonly used in bioinformatics like Perl and Python. Back-end storage relies on curated sequence stores like GenBank and reference assemblies in RefSeq, with taxonomy backbones derived from the NCBI Taxonomy project. Security and access control are governed by policies of the National Institutes of Health and related federal statutes.

Search and Retrieval Features

Entrez supports keyword searching, fielded queries, Boolean operators, and filtering by facets such as organism, publication date, and molecule type. Advanced retrieval includes cross-database linking where a single search can return literature PubMed citations, nucleotide entries in GenBank, protein records in UniProt (via cross-links), and three-dimensional structures from the Protein Data Bank. The E-utilities API enables programmatic fetch, search, and link operations for integration into pipelines used by groups at Cold Spring Harbor Laboratory, Harvard Medical School, and biotechnology companies. Results can be exported in multiple formats for downstream analysis with tools from projects such as BLAST and visualization systems employed at Broad Institute.

Data Sources and Coverage

Entrez aggregates content from major life sciences repositories including nucleotide sequences from GenBank, curated reference sequences from RefSeq, protein information cross-referenced to Swiss-Prot and UniProtKB, literature indexed in PubMed, clinical descriptions from OMIM, and structural coordinates from the Protein Data Bank. Taxonomic classification is provided by the NCBI Taxonomy database, with organism names and lineage linking to sequence holdings. Coverage spans prokaryotic and eukaryotic organisms represented in submissions from research institutions such as Scripps Research and international genome consortia including the Human Genome Project and the 1000 Genomes Project.

Integration and Interoperability

Interoperability is achieved via standardized identifiers, cross-reference tables, and APIs that permit exchange with external resources like UniProt, Ensembl, and clinical variant repositories. The E-utilities permit integration into workflows used at universities and companies such as MIT, Google DeepMind, and Illumina. Entrez’s linking model supports citation chains from journals like Science and Nature to primary sequence deposits and structural models, enabling combined analysis with tools from EMBL-EBI and pipelines used by the National Human Genome Research Institute.

Usage, Impact, and Criticism

Entrez has been instrumental for researchers in molecular biology, genetics, structural biology, and biomedical informatics by enabling cross-database discovery and reproducible retrieval for thousands of studies published in venues such as Cell and The Lancet. It underpins many bioinformatics tools and educational resources at institutions like University of California, San Francisco and Johns Hopkins University. Criticisms have focused on scalability with rapidly increasing sequence data volumes from projects like Genome Project-Write and computational demands for real-time linking, leading to calls from scholars at Max Planck Institute and European Molecular Biology Laboratory for improved indexing performance, richer semantic metadata, and better integration with ontologies such as the Gene Ontology and clinical vocabularies. Some users have also requested more transparent versioning and provenance comparable to practices at Dryad and data citation standards advocated by Research Data Alliance.

Category:Bioinformatics databases