PDB (file format)

PDB (file format)
Name	PDB
Extension	.pdb
Mime	chemical/x-pdb
Owner	Worldwide Protein Data Bank
Released	1976
Genre	Molecular structure file format

Contents

Introduction
File format specification
Coordinate and structural records
Secondary data and annotations
History and evolution
Tools and software support

PDB (file format) is a plain-text format for representing three-dimensional structures of biological macromolecules, originally designed for proteins and expanded to nucleic acids, complexes, and small ligands. It serves as the canonical exchange medium among structural biology resources such as the Protein Data Bank, computational packages like Rosetta (software), and visualization tools including PyMOL and Chimera. The format underpins archival collections maintained by organizations such as the Research Collaboratory for Structural Bioinformatics and the Worldwide Protein Data Bank consortium.

Introduction

The PDB format encodes atomic coordinates, metadata, and experimental parameters for structures determined by techniques associated with institutions like Brookhaven National Laboratory, European Bioinformatics Institute, and RCSB PDB. It has influenced related data standards promulgated by projects including mmCIF and initiatives led by entities such as the International Union of Crystallography and the National Institutes of Health. Structural biologists from laboratories at universities such as Harvard University, Massachusetts Institute of Technology, and University of Cambridge routinely deposit models in this format or its successors. Archives that host PDB-format files support downstream analysis by groups at Stanford University, Max Planck Society, and pharmaceutical companies like Pfizer.

File format specification

The PDB format is a fixed-column, line-oriented record system, with each line beginning with a record name such as ATOM, HETATM, HEADER, or END. The original specification emerged from workshops attended by participants from Brookhaven National Laboratory, Cold Spring Harbor Laboratory, and the European Molecular Biology Laboratory. Files comprise multiple record types for provenance, experimental method, crystallographic parameters, and molecular composition; fields are located in defined character columns to enable parser implementations in languages used at computational centers like Los Alamos National Laboratory and Lawrence Berkeley National Laboratory. Because of column constraints, the format imposes limits on atom serial numbers, residue numbering, and coordinate precision, prompting adoption of alternatives in projects by teams at European Bioinformatics Institute and the RCSB PDB.

Coordinate and structural records

Core records include ATOM and HETATM for standard residues and heterogens, respectively, with X, Y, Z coordinates, occupancy, and B-factor fields enabling representation of atomic positions and thermal motion. Chain identifiers, residue sequence numbers, and insertion codes are encoded to represent polypeptide chains as used in studies from Scripps Research, Weizmann Institute of Science, and Johns Hopkins University. Connectivity is represented via CONECT records used in analyses performed by software developed at University of California, San Francisco and European Molecular Biology Laboratory. Biological assembly annotations reference transformation matrices that map asymmetric units to biologically relevant oligomers, a practice common in structural interpretations from centers like Cold Spring Harbor Laboratory and German Electron Synchrotron.

Secondary data and annotations

PDB files include secondary records such as HEADER, TITLE, REMARK, SEQRES, and SHEET/HELIX annotations to document experimental method, source organism, and secondary-structure assignments produced by pipelines used at GenBank, UniProt, and research groups at California Institute of Technology. Experimental statistics such as resolution, R-factor, and validation flags appear in REMARK sections and are cross-referenced with validation reports generated by consortia including the Worldwide Protein Data Bank and assessment tools developed by European Bioinformatics Institute. Ligand chemistry descriptions, polymer linkage annotations, and sequence cross-references facilitate mapping to resources like DrugBank, ChEMBL, and PubChem for medicinal chemistry projects at organizations like GlaxoSmithKline and Novartis.

History and evolution

Developed in the mid-1970s, the format was driven by structural efforts at Brookhaven National Laboratory and contributors from institutions such as MIT and University of Cambridge, reflecting early needs to disseminate macromolecular coordinates for structures like those from Max Perutz and John Kendrew-era crystallography. Over decades, community-driven changes addressed limitations through extensions and successor formats; initiatives by the International Union of Crystallography and the Worldwide Protein Data Bank fostered transition toward the mmCIF dictionary to overcome numeric and semantic constraints. Milestones include adoption of enhanced nomenclature standards influenced by committees at Protein Data Bank Japan, RCSB PDB, and policy shifts driven by funders such as the National Science Foundation and National Institutes of Health.

Tools and software support

Widespread tooling supports PDB format ingestion, editing, validation, and visualization. Molecular graphics packages including PyMOL, Chimera, VMD, and Coot read and render PDB files for users at University of California, San Francisco and structural groups at Scripps Research. Computational pipelines such as Rosetta (software), GROMACS, AMBER, and CHARMM convert or import PDB coordinates for modeling and simulations performed at centers like European Bioinformatics Institute and Lawrence Livermore National Laboratory. Validation and deposition interfaces provided by the RCSB PDB and PDBe integrate with community resources such as UniProt and SIFTS to ensure cross-resource interoperability.

Category:File formats