Cancer Genomics Hub

Cancer Genomics Hub
Name	Cancer Genomics Hub
Founded	2011
Location	University of California, Santa Cruz
Key people	David Haussler
Parent organization	National Cancer Institute
Website	https://cghub.ucsc.edu/

Contents

Overview
Data and Resources
Access and Usage
Scientific Impact
Technical Infrastructure
Related Projects

Cancer Genomics Hub. It was a centralized bioinformatics repository and data coordination center established to support The Cancer Genome Atlas, a landmark project funded by the National Institutes of Health. Managed by the University of California, Santa Cruz Genomics Institute, the hub served as the primary archive for raw genomic sequence data from thousands of cancer patients, facilitating large-scale collaborative research. Its creation represented a major effort in big data management for biomedical research, aiming to accelerate discoveries in oncology by providing researchers worldwide with standardized, high-quality datasets.

Overview

The hub was launched in 2011 as a critical component of the National Cancer Institute's efforts to catalog genomic alterations in cancer. It operated under the guidance of principal investigator David Haussler and his team at the UC Santa Cruz Genomics Institute. Its core mission was to aggregate, validate, and distribute the vast amounts of DNA sequencing data generated by The Cancer Genome Atlas and other NCI-sponsored projects like the Therapeutically Applicable Research to Generate Effective Treatments initiative. By creating a unified, cloud-accessible resource, it addressed the significant computational challenges posed by the Human Genome Project's legacy, enabling integrative analyses across different cancer types and research institutions.

Data and Resources

The repository housed a comprehensive collection of multi-modal genomic data, including whole-genome sequences, exome captures, and RNA-Seq transcriptome profiles from tumor and matched normal tissues. All data were aligned to the standard GRCh37 human reference genome and followed strict quality control protocols established by the Broad Institute and other sequencing centers. In addition to raw sequence files, the hub provided essential metadata, clinical annotations, and validated mutation calls, which were crucial for studies published in journals like Nature (journal) and Science (journal). This resource empowered investigations into driver mutations, molecular subtypes, and the genomic basis of drug response across cancers such as glioblastoma, lung cancer, and breast cancer.

Access and Usage

Access to the hub's data was governed by the National Institutes of Health's data sharing policies, requiring researchers to apply through the Database of Genotypes and Phenotypes system. Authorized users could search, browse, and download datasets via a web portal or using specialized command-line tools compatible with high-performance computing environments. The system supported the Cancer Genomics Cloud pilots and integrated with analysis platforms like the UCSC Genome Browser and the Integrative Genomics Viewer. This framework ensured compliance with patient privacy regulations such as the Health Insurance Portability and Accountability Act while maximizing the utility of the data for global research consortia.

Scientific Impact

The availability of data through the hub directly fueled a transformative period in cancer research, leading to seminal papers in Cell (journal) and the New England Journal of Medicine. It enabled the discovery of novel cancer genes, the characterization of mutational signatures associated with tobacco or UV light, and the development of prognostic biomarkers. Research facilitated by the hub contributed to the foundation of precision oncology initiatives at major centers like the Memorial Sloan Kettering Cancer Center and the Dana-Farber Cancer Institute, influencing clinical trial design for targeted therapies and immunotherapy.

Technical Infrastructure

The hub's infrastructure was built on a scalable, secure architecture designed to manage petabytes of data. It utilized distributed storage systems and was hosted in partnership with the Texas Advanced Computing Center. Data transfer was optimized using the Aspera high-speed protocol and the GeneTorrent system, which were essential for handling large BAM files and FASTQ files. The backend employed robust database systems to track complex metadata, ensuring data integrity and reproducibility for projects like the International Cancer Genome Consortium.

The hub's operational model and success informed subsequent, larger-scale genomic data commons. It was a direct precursor to the NCI Genomic Data Commons, launched in 2016, which expanded its scope and adopted more advanced data harmonization practices. Other related endeavors include the European Genome-phenome Archive, the cBioPortal for Cancer Genomics, and the ICGC Data Portal. These projects collectively represent the ongoing evolution of international efforts, such as the All of Us Research Program, to leverage genomic big data for understanding human disease.

Category:Cancer research Category:Bioinformatics Category:Genomics databases