FASTA — LLMpedia

FASTA
Name	FASTA
Title	FASTA
Developer	William R. Pearson
Released	1985
Operating system	Unix, Linux, Microsoft Windows
Genre	Bioinformatics

Contents

Introduction
History and development
File format and specification
Algorithms and tools
Applications in bioinformatics
Performance and accuracy
Limitations and alternatives

FASTA

FASTA is a suite of bioinformatics tools and a sequence file format widely used for comparing nucleotide and peptide sequences. It provides fast local alignment search, database comparison, and sequence analysis functions that influenced later systems and standards in computational biology. The software and format are integral to molecular biology pipelines for sequence similarity, annotation, and phylogenetic inference.

Introduction

FASTA originated as a set of programs for rapid sequence comparison, addressing needs in molecular cloning, Human Genome Project, National Center for Biotechnology Information, Swiss-Prot, GenBank, EMBL, and other large sequence repositories. The suite includes algorithms for local alignment, database searching, and format-conservative sequence handling used by researchers at institutions such as Harvard University, Stanford University, University of California, Berkeley, and Cold Spring Harbor Laboratory. By enabling comparisons across protein families like cytochrome c, hemoglobin, and ATP synthase, FASTA supported studies connected to projects like Yeast Genome Project and consortia such as the 1000 Genomes Project.

History and development

Developed in the mid-1980s by William R. Pearson and collaborators, FASTA built on earlier alignment ideas exemplified by algorithms from Daniel S. Hirschberg and methods referenced in work from Temple Smith and Michael S. Waterman. It emerged contemporaneously with tools such as BLAST from the National Institutes of Health and was adopted by research groups at Massachusetts Institute of Technology, University of Cambridge, and European Bioinformatics Institute. Over time, FASTA was updated to support growing databases maintained by organizations like National Center for Biotechnology Information, UniProt, and national initiatives including Wellcome Trust projects.

File format and specification

The FASTA format is a simple text-based sequence representation used by databases such as GenBank, RefSeq, and UniProtKB. Each record begins with a single-line description introduced by a "greater-than" character and metadata conventions used by projects like Ensembl and Drosophila Genome Project. Sequence lines follow and may represent amino acids or nucleotides in datasets produced by platforms from Illumina or Pacific Biosciences. The format's simplicity allowed integration with tools from vendors like Applied Biosystems and standards groups including the Open Bioinformatics Foundation and International Nucleotide Sequence Database Collaboration.

Algorithms and tools

FASTA implements heuristic local alignment strategies influenced by theoretical work from Stephen Altschul, David J. Lipman, and Timothy F. Smith, and is often compared to BLAST and exact algorithms such as the Needleman–Wunsch algorithm and the Smith–Waterman algorithm. Tool variants—such as fasta36, ssearch, and glsearch—offer different scoring schemes, gap penalties, and affine-gap models used in analyses at University of Washington, Sanger Institute, and Los Alamos National Laboratory. Integration with visualization tools from UCSC Genome Browser, conservation analyses from Conserved Domain Database, and pipelines using Galaxy (platform) demonstrates its interoperability. Implementations are available for command-line environments on UNIX System V, FreeBSD, and cloud platforms run by Amazon Web Services and Google Cloud Platform.

Applications in bioinformatics

Researchers apply FASTA-based searches in comparative genomics projects like ENCODE and 1000 Genomes Project, in annotation workflows at European Molecular Biology Laboratory, and in targeted studies of gene families such as HOX genes and G protein-coupled receptors. Clinical and translational groups at institutions like Mayo Clinic and Johns Hopkins University use FASTA format data for variant annotation alongside resources like ClinVar and dbSNP. Environmental and metagenomics studies from initiatives such as Tara Oceans and Human Microbiome Project use FASTA-format sequence dumps for taxonomic assignment and functional profiling.

Performance and accuracy

FASTA prioritizes speed via heuristics while retaining sensitivity for biologically meaningful local alignments; benchmarks often compare fasta36 and ssearch against BLAST+ and implementations of Smith–Waterman. Performance metrics evaluated by labs at European Bioinformatics Institute, Broad Institute, and Lawrence Berkeley National Laboratory consider database size, scoring matrices such as BLOSUM62 and PAM250, and computational resources from supercomputing centers like Oak Ridge National Laboratory. Accuracy depends on substitution matrices, gap models, and post-processing filters similar to those used in pipelines at JGI and NCBI.

Limitations and alternatives

Limitations include heuristic misses of optimal alignments in extreme cases and format constraints when representing complex metadata compared with richer standards used by BioProject and BioSample. Alternatives and complementary methods include BLAST, exact dynamic-programming tools implementing Smith–Waterman, profile-based searches like HMMER and PSI-BLAST, and modern k-mer and indexing approaches used in Kraken (software), DIAMOND, and Minimap2. Workflows in large consortia such as ELIXIR and cloud-native bioinformatics platforms often combine FASTA tools with container technologies from Docker and orchestration by Kubernetes.

Category:Bioinformatics software