CATH — LLMpedia

CATH
Name	CATH
Type	Biological database
Focus	Protein structure classification
Country	United Kingdom
Institution	University of Dundee
Established	1997

Contents

Overview
History and Development
Classification System and Hierarchy
Methodology and Algorithms
Database Content and Resources
Applications and Impact

CATH CATH is a hierarchical protein domain classification resource integrating structural and evolutionary information to organize protein domains from experimental structures. It connects structural entries from databases such as Protein Data Bank with sequence resources like UniProt and functional annotations from sources including Gene Ontology, enabling comparative analyses across families, superfamilies, and folds. CATH supports researchers in structural biology, bioinformatics, and drug discovery through curated assignments, automated pipelines, and downloadable datasets.

Overview

CATH provides a hierarchical scheme that groups protein domains by Class, Architecture, Topology, and Homologous superfamily, linking entries to major resources such as Protein Data Bank, UniProt, Pfam, InterPro, and SCOP. The resource is maintained by research groups at institutions including the University of Dundee and collaborates with consortia like the Structural Genomics Consortium and projects funded by bodies such as the Wellcome Trust and the Biotechnology and Biological Sciences Research Council. CATH outputs are used in contexts spanning pharmaceutical industry pipelines, academic studies in molecular evolution, and annotation workflows in databases like Ensembl.

History and Development

CATH originated in the late 1990s as an effort to systematize domain-level classification of structures deposited in the Protein Data Bank and to complement parallel initiatives such as SCOP. Key contributors include researchers affiliated with the MRC Laboratory of Molecular Biology, the European Bioinformatics Institute, and the University of Manchester. Over successive releases, CATH incorporated automated methods developed alongside manual curation practices, adopted interoperability with resources such as Pfam and PDBsum, and expanded datasets to address challenges noted in meetings like the Gordon Research Conference on protein folding. Funding and collaborative frameworks involved agencies like the European Commission and institutions such as the Wellcome Sanger Institute.

Classification System and Hierarchy

The CATH hierarchy classifies domains into discrete levels: Class (C), Architecture (A), Topology (T), and Homologous superfamily (H). This scheme interoperates with structural and sequence-centric resources including SCOP, Pfam, UniProt, PROSITE, and InterProScan to map relationships among entries. At the Homologous superfamily level, evolutionary links connect proteins studied in contexts like the Human Genome Project, comparative analyses across model organisms such as Escherichia coli, Saccharomyces cerevisiae, and Arabidopsis thaliana, and clinically relevant families including enzymes characterized by groups at institutions like the Max Planck Institute and Institut Pasteur.

Methodology and Algorithms

CATH combines manual curation with automated algorithms for domain boundary detection, structural comparison, and homology inference, integrating tools inspired by methodologies from groups at the European Bioinformatics Institute and the University of California, San Francisco. Key algorithmic components include structural alignment methods, hidden Markov models similar to those used by HMMER, and clustering approaches comparable to techniques from the BLAST and PSI-BLAST toolkits. Validation and benchmarking reference datasets assembled from the Protein Data Bank, curated families from Pfam, and evolutionary information leveraged in projects such as TreeFam and OrthoDB.

Database Content and Resources

CATH hosts curated domain assignments, release notes, downloadable classifications, and derived resources such as sequence families, multiple sequence alignments, and functional annotations connected to Gene Ontology terms. The database cross-references entries to external identifiers from UniProt, structural accessions from the Protein Data Bank, domain families in Pfam, and pathway contexts in KEGG and Reactome. Users access CATH data through web interfaces, FTP mirrors, and APIs modeled after services from European Nucleotide Archive and EMBL-EBI infrastructures; community contributions and feedback have come from groups at Stanford University, Massachusetts Institute of Technology, and Cold Spring Harbor Laboratory.

Applications and Impact

CATH-derived classifications underpin research in structural genomics initiatives like the Structural Genomics Consortium, guide annotation in genome projects such as the Human Genome Project and 1000 Genomes Project, and inform drug-target studies at pharmaceutical companies and academic centers including GlaxoSmithKline and Novartis Institutes for BioMedical Research. Applications include fold recognition in pipelines used by groups at Cambridge University and Harvard University, functional inference employed in projects at the European Bioinformatics Institute, and machine learning feature sets in studies at Google DeepMind and university labs. The resource influences education and community standards through workshops coordinated with organizations like the International Society for Computational Biology and contributes to reproducible science practices advocated by institutions including the Wellcome Trust.

Category:Biological databases