ChEMBL — LLMpedia

ChEMBL
Name	ChEMBL
Title	ChEMBL
Producer	European Bioinformatics Institute
Country	United Kingdom
History	Established 2009

Contents

Introduction
Data Content and Structure
Curation and Quality Control
Access and Tools
Applications and Impact
History and Development

ChEMBL ChEMBL is a large, curated chemical biology database widely used for drug discovery, cheminformatics, and bioinformatics. It aggregates bioactivity data from medicinal chemistry literature, patents, and screening efforts, supporting research across pharmaceutical companies, academic institutions, and public agencies. The resource interfaces with major initiatives and platforms in biomedical research, enabling integration with structural biology, genomics, and pharmacology data.

Introduction

ChEMBL integrates curated bioactivity records linking small molecules, targets, and assays with contextual information relevant to Structural Genomics Consortium, Wellcome Trust, European Molecular Biology Laboratory, European Bioinformatics Institute, European Commission, National Institutes of Health, National Center for Biotechnology Information, PubChem, DrugBank, UniProt, Protein Data Bank, Ensembl, Gene Ontology Consortium, Reactome, KEGG, PharmGKB, ClinicalTrials.gov, Food and Drug Administration, European Medicines Agency, World Health Organization, GlaxoSmithKline, Pfizer, AstraZeneca, Novartis, Roche, Sanofi, Merck & Co., Bayer, Eli Lilly and Company, Johnson & Johnson, Bristol-Myers Squibb, and Takeda Pharmaceutical Company partnerships and users. The database supports interoperability with cheminformatics toolkits such as OpenBabel, RDKit, CDK (chemistry), and with programming ecosystems like Python (programming language), R (programming language), Jupyter Notebook, GitHub, and Docker (software).

Data Content and Structure

ChEMBL stores molecules, targets, assays, and activity measurements organized into relational schemas interoperable with MySQL, PostgreSQL, SQLite, and Oracle Corporation environments. Molecule records include standardized structures, computed properties, and cross-references to external identifiers used by International Chemical Identifier, SMILES, InChIKey, CAS Registry Number, PubMed, Patent Cooperation Treaty, European Patent Office, United States Patent and Trademark Office, and vendor catalogs such as Sigma-Aldrich and ChemSpider. Target annotations map to protein entries in UniProtKB, gene models in Ensembl, orthology resources such as OrthoDB, and pathway context from KEGG Pathway, Reactome Pathway, and BioCyc. Assay ontologies and activity endpoints reference standards developed alongside OpenPHACTS, BioAssay Ontology, and community-driven resources including FAIR data principles advocates like GO FAIR and ELIXIR.

Curation and Quality Control

Curation workflows employ expert curators and automated pipelines drawing on literature from Nature Publishing Group, Springer Nature, Elsevier, Wiley-Blackwell, American Chemical Society, and patent offices. Quality control uses checks against stereochemistry norms from IUPAC, tautomer handling guidance from IUPAC InChI, nomenclature standards from International Union of Pure and Applied Chemistry, and activity unit harmonization referencing SI (International System of Units). Provenance tracking links entries to sources such as PubMed Central, Europe PMC, and journal articles by authors affiliated with University of Oxford, University of Cambridge, Massachusetts Institute of Technology, Harvard University, Stanford University, University College London, Karolinska Institutet, and pharmaceutical research groups at Cambridge Biomedical Campus and Cambridge University Hospitals.

Access and Tools

Users access the resource via a web interface, a RESTful application programming interface influenced by OpenAPI Initiative, bulk downloads, and client libraries compatible with Python Package Index, Bioconductor, and Conda (software). Visualization and analysis tools integrate with RDKit, KNIME, Cytoscape, PyMOL, Chimera, UCSF ChimeraX, PLIP (protein–ligand interaction profiler), and cheminformatics platforms like ChemAxon. The project participates in data distribution through infrastructures such as EMBL-EBI, EBI Search, and mirrors hosted by collaborators including European Genome-Phenome Archive, Zenodo, and community repositories on GitHub. Authentication and attribution align with identity providers like Elixir AAI and licensing frameworks exemplified by Creative Commons.

Applications and Impact

The dataset underpins target identification, lead optimization, and repurposing studies cited in publications from Nature, Science, Cell (journal), The Lancet, Nature Medicine, Journal of Medicinal Chemistry, Chemical Science, and conference proceedings from American Chemical Society National Meeting, Gordon Research Conferences, EMBO, ISMB, and BioIT World. Industry use cases span hit-to-lead projects at AstraZeneca R&D, GSK Discovery Partnerships, and academic drug discovery units at University of Dundee Drug Discovery Unit, European ScreeningPort, and translational programs tied to National Center for Advancing Translational Sciences. Integrations with machine learning pipelines leverage frameworks like TensorFlow, PyTorch, scikit-learn, and data standards promoted by Open Targets and ELIXIR-CONVERGE, supporting computational chemistry advances including QSAR, molecular docking with AutoDock Vina, and virtual screening in collaboration with Drug Design Data Resource initiatives.

History and Development

The initiative originated from projects within European Molecular Biology Laboratory-EBI and evolved through funding and collaboration with Wellcome Trust, Innovative Medicines Initiative, European Union Horizon 2020, and industry consortia involving GlaxoSmithKline, Pfizer, Novartis Institutes for BioMedical Research, and Bristol-Myers Squibb. Key milestones include public releases synchronized with standards efforts from OpenPHACTS, adoption by databases such as PubChem BioAssay, and incorporation into education and training programs at institutions like EMBL-EBI Training, European Bioinformatics Institute Course, and workshops at Cold Spring Harbor Laboratory. Ongoing development is driven by community contributions hosted on GitHub and collaborations with initiatives such as Open Targets, FAIRsharing, GO FAIR, and regional bioinformatics networks including ELIXIR and the Global Alliance for Genomics and Health.

Category:Chemical databases