Chemical Markup Language

Chemical Markup Language
Name	Chemical Markup Language
Developer	Royal Society of Chemistry; Blue Obelisk contributors
Typing	Declarative
File extension	.cml
License	Various (open)

Contents

Overview
History and Development
Language Structure and Syntax
Applications and Use Cases
Implementations and Tools
Standards and Governance
Criticisms and Limitations

Chemical Markup Language

Chemical Markup Language is an XML-based format designed to represent chemical information in a structured, machine-readable way. It enables exchange of molecular structures, reactions, spectra, and crystallographic data among software systems and databases. The format connects cheminformatics tools, computational chemistry packages, and publishing workflows to facilitate reproducible research and data integration across institutions such as European Bioinformatics Institute, National Institutes of Health, Lawrence Berkeley National Laboratory and Massachusetts Institute of Technology.

Overview

Chemical Markup Language encodes chemical entities, properties, and relationships using XML elements and attributes, promoting interoperability among systems like Open Babel, Avogadro, RDKit, Jmol, and RCSB PDB. It supports representation of atoms, bonds, coordinates, spectra, and reaction mechanisms, enabling transfer between cheminformatics projects at IBM Research, Microsoft Research, Google Research, Lawrence Livermore National Laboratory and publishing platforms used by Springer Nature, Wiley, Elsevier, and American Chemical Society. By leveraging standards from World Wide Web Consortium and aligning with ontologies from BioPortal and FAIR principles adopters like European Molecular Biology Laboratory, CML aims to be reusable across repositories managed by Protein Data Bank and data initiatives at Wellcome Trust.

History and Development

Development traces to collaborations among chemists and informaticians in the early 2000s, with contributors from Royal Society of Chemistry, Blue Obelisk advocacy groups, and academic labs at University of Cambridge, University of Oxford, University of California, Berkeley, and Caltech. Key milestones involved integration with initiatives at National Center for Biotechnology Information and coordination with XML work at the World Wide Web Consortium. Open-source projects such as Jmol and Open Babel incorporated CML support, and community stewardship included participants from European Chemical Society and funding from programs at European Commission and National Science Foundation.

Language Structure and Syntax

CML builds on XML namespaces and schemas defined in collaboration with standards bodies such as World Wide Web Consortium and metadata schemes used by Dublin Core adopters in scholarly publishing. Documents use hierarchical elements for molecules, atoms, bonds, and properties that map to semantics familiar to practitioners at Max Planck Society, ETH Zurich, and California Institute of Technology. Syntax supports coordinate sets compatible with formats used by Gaussian (software), GAMESS, NWChem, and crystallographic data consistent with International Union of Crystallography conventions. The schema allows embedding of spectral data used by instruments from Bruker Corporation and Agilent Technologies and annotation compatible with ontologies developed by Gene Ontology and Biomedical Ontologies consortia.

Applications and Use Cases

CML is used in cheminformatics workflows at pharmaceutical companies like Pfizer, Novartis, GlaxoSmithKline, and Roche for compound registries, reaction informatics in initiatives at Merck & Co., and data exchange between quantum chemistry calculations produced by ORCA (software), Psi4, and Q-Chem. Academic projects at Stanford University, Harvard University, Princeton University, and Yale University use CML for teaching, reproducible research, and linking publications in journals of American Chemical Society and Royal Society of Chemistry. Repositories and electronic lab notebooks developed by Benchling, LabArchives, and institutional repositories at CNRS and University of Tokyo utilize CML for machine-actionable records.

Implementations and Tools

Software libraries and converters include modules in Open Babel, bindings for languages used at Apple research groups, and plugins for molecular editors like Avogadro and viewers such as JSmol. Toolchains link to data analysis platforms developed at The Alan Turing Institute and high-performance computing centers like Oak Ridge National Laboratory and Argonne National Laboratory. Integration with publishing tools used by Overleaf and manuscript submission systems at PubMed Central facilitates embedding CML in articles. Community toolsets maintained by Blue Obelisk and projects at SourceForge and GitHub provide parsers and validators.

Standards and Governance

Governance has been largely community-driven, with oversight and contributions from organizations including Royal Society of Chemistry, Blue Obelisk, and collaborations with standards bodies such as World Wide Web Consortium and coordination with data stewardship initiatives at CODATA and Research Data Alliance. Schema evolution and best practices have involved input from institutions like European Bioinformatics Institute, National Institutes of Health, and editorial offices of Nature Research and Science (journal). Licensing and adoption strategies draw on open data policies championed by Wellcome Trust and mandates from funding agencies such as Horizon Europe and National Science Foundation.

Criticisms and Limitations

Critics point to complexity and verbosity compared with compact binary representations favored in some high-throughput environments at Google and Amazon Web Services. Interoperability challenges persist between CML and proprietary formats used by vendors like ChemDraw and instrument suites from Thermo Fisher Scientific. Adoption barriers include fragmented tooling, inconsistent schema extensions by academic groups at MIT and Columbia University, and limited enforcement of metadata standards in repositories such as Zenodo and Figshare. Performance issues arise when handling very large datasets typical of projects at European Organization for Nuclear Research or national supercomputing centers, prompting alternative approaches in projects at Netflix Research and cloud platforms.

Category:Chemical file formats