Protein sequencing

Contents

History of protein sequencing
Methods of protein sequencing
Applications of protein sequencing
Challenges and limitations
Comparison to nucleic acid sequencing

Protein sequencing is the process of determining the precise order of amino acids within a protein molecule. This fundamental technique in biochemistry and molecular biology provides critical information about a protein's structure, function, and evolutionary relationships. The field has evolved from laborious manual methods to highly automated techniques integrated with mass spectrometry and bioinformatics.

History of protein sequencing

The foundational work in this field was pioneered by Frederick Sanger, who determined the complete amino acid sequence of the hormone insulin in the 1950s, a feat for which he later received the Nobel Prize in Chemistry. His method, involving the use of reagents like dinitrofluorobenzene and phenyl isothiocyanate to sequentially degrade the polypeptide chain, established the first practical methodology. Subsequent key developments included the Edman degradation, automated by Pehr Edman with the introduction of the sequenator, which dominated the field for decades. The work of researchers like Stanford Moore and William H. Stein on ribonuclease A further validated these techniques. The modern era was ushered in with the advent of tandem mass spectrometry and the development of soft ionization techniques such as matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI), pioneered by scientists like John B. Fenn and Koichi Tanaka.

Methods of protein sequencing

Traditional chemical sequencing is primarily accomplished via the Edman degradation, which sequentially removes amino acids from the N-terminus of a protein. For larger-scale and high-throughput analysis, mass spectrometry-based methods are now predominant. In a typical "bottom-up" proteomics workflow, proteins are first digested by enzymes like trypsin, and the resulting peptides are analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS). De novo sequencing algorithms interpret the MS/MS spectra to deduce peptide sequences without relying on a genomic database. "Top-down" proteomics, conversely, involves analyzing intact proteins or large fragments directly by mass spectrometry, often using instruments like Orbitrap or FT-ICR mass spectrometers. Other techniques include the use of carboxypeptidase enzymes for C-terminal analysis and cryo-electron microscopy for indirectly inferring sequence in structural contexts.

Applications of protein sequencing

This technology is indispensable in proteomics for identifying proteins in complex mixtures from sources like cell lysates or blood plasma. It is critical for characterizing post-translational modifications such as phosphorylation, glycosylation, and ubiquitination, which regulate protein function. In biotechnology and pharmaceutical development, it is used to confirm the structure of therapeutic proteins like monoclonal antibodies and recombinant insulin. Within clinical diagnostics, it aids in identifying biomarkers for diseases like cancer and Alzheimer's disease. Furthermore, it is essential in evolutionary biology for constructing phylogenetic trees by comparing homologous protein sequences across species, such as cytochrome c or hemoglobin.

Challenges and limitations

A primary challenge is the analysis of proteins with post-translational modifications or those that are membrane-bound, which can be difficult to isolate and ionize. The "bottom-up" approach can lose information about the connectivity of peptides and the combinatorial patterns of modifications present on a single protein molecule. While "top-down" methods address this, they are currently limited by the mass range and complexity of fragmentation spectra for very large proteins. Sample preparation remains a hurdle, as low-abundance proteins in a background of highly abundant species (like albumin in blood serum) can be masked. Furthermore, de novo sequencing purely from mass spectrometry data is computationally intensive and less reliable for distinguishing between certain isobaric amino acids like leucine and isoleucine.

Comparison to nucleic acid sequencing

While DNA sequencing determines the order of nucleotides in a gene, this process directly analyzes the gene's functional product. DNA sequencing technologies, such as those developed by Sanger and later next-generation sequencing platforms from companies like Illumina and Oxford Nanopore Technologies, are typically higher-throughput and more cost-effective for decoding genomic information. However, it cannot detect post-translational modifications or confirm the final expressed protein sequence, which can differ due to RNA splicing or protein processing. The two fields are highly complementary; genomic data provides a reference database that greatly accelerates protein identification by mass spectrometry, a approach central to the field of proteogenomics.

Category:Molecular biology Category:Proteomics Category:Laboratory techniques