InChI — LLMpedia

InChI
Name	International Chemical Identifier
Abbreviation	InChI
Developer	IUPAC; NIST
First release	2005
Stable release	1.06 (example)
License	open-source
Website	International Union of Pure and Applied Chemistry

Contents

Introduction
Format and Structure
Generation and Tools
Applications and Adoption
Limitations and Criticisms
Historical Development

InChI The International Chemical Identifier is a textual identifier for chemical substances designed to provide a standard, machine-readable representation that facilitates linking and searching across databases. It complements graphical depictions and registry numbers by encoding structural information in a layered ASCII string intended for interoperability among publishers, laboratories, and data repositories.

Introduction

InChI was developed to enable consistent indexing and retrieval across digital resources such as PubChem, ChEMBL, ChemSpider, Reaxys, and SciFinder and to support workflows used by organisations including IUPAC, NIST, the Royal Society of Chemistry, the European Bioinformatics Institute, and the US National Library of Medicine. It interacts with tools and platforms like OpenBabel, RDKit, CDK, Accelrys, and ChemOffice and underpins exchange between projects such as the Human Genome Project, the Protein Data Bank, the Cancer Genome Atlas, the European Molecular Biology Laboratory, and the Wellcome Trust. The identifier has been adopted by journals and databases maintained by the American Chemical Society, Elsevier, Springer Nature, Wiley, Nature Research, and PLOS as part of efforts to link publications with chemical data.

Format and Structure

An InChI string is composed of multiple layers that encode molecular connectivity, isotopic configuration, stereochemistry, and electronic charge, designed to represent structures deterministically for systems used by IBM, Google, Microsoft Research, Amazon, and Oracle in cheminformatics pipelines. The layers are concatenated with delimiters to form a canonical representation that tools such as Chemaxon, Biovia, MolSoft, and PerkinElmer can parse, similar to canonicalization approaches used in SMILES employed by Daylight Chemical Information Systems and OpenEye. Major layers include the main layer for formula and atom connectivity, a hydrogen layer, charge and protonation layers, stereochemical layers, isotopic layers, and reconnected layers for disconnected fragments — concepts echoed in projects at Stanford University, MIT, Harvard University, Caltech, and the Max Planck Society where consistent molecular identifiers are critical.

Generation and Tools

Generation of an InChI string typically involves conversion software that reads formats like MOL, SDF, CML, PDB, and XYZ produced by instruments and applications from companies such as Thermo Fisher Scientific, Agilent Technologies, Bruker, Waters Corporation, and Shimadzu. Open-source generators include the official InChI software distributed by IUPAC and NIST, OpenBabel maintained by a community that includes contributors from the University of Cambridge and University of Oxford, and RDKit supported by AstraZeneca, Pfizer, GlaxoSmithKline, Novartis, and Merck for cheminformatics workflows. Integration with electronic laboratory notebooks used at institutions like Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, and CERN enables automatic annotation in data repositories such as Zenodo, Dryad, Figshare, and the Internet Archive. Command-line tools, APIs, and plugins allow conversion in pipelines relying on Jenkins, GitHub Actions, Docker, and Kubernetes.

Applications and Adoption

InChI strings are used to link chemical records in large-scale projects including the United Nations' SDG-related chemistry initiatives, the World Health Organization's chemical safety work, the Food and Agriculture Organization, and the European Chemicals Agency. Pharmaceutical companies like Roche, Johnson & Johnson, Bayer, Eli Lilly, and Takeda use InChI for compound registration, while academic consortia at the Wellcome Sanger Institute, the Broad Institute, and EMBL-EBI employ them for data integration. Publishers including the Royal Society, Cell Press, The Lancet, and BMJ recommend or require machine-readable identifiers alongside figures and supplementary data to facilitate reproducibility with resources such as ORCID, CrossRef, PubMed, and Scopus. Environmental monitoring efforts by the EPA, NOAA, and NASA utilize InChI for pollutant tracking and metabolomics databases like HMDB, MetaboLights, and GNPS incorporate InChI for spectral annotation and linking to UniProt, KEGG, and Reactome.

Limitations and Criticisms

Critiques of the identifier have been raised in contexts involving complex coordination chemistry, polymers, ill-defined natural product mixtures, and macromolecules encountered at the Scripps Research Institute, ETH Zurich, and the University of Tokyo. Limitations include handling of tautomers, ambiguous stereochemistry, variable protonation states, and supramolecular assemblies, which challenge representation in repositories such as ChEMBL, DrugBank, and BindingDB. Some stakeholders at pharmaceutical firms, regulatory agencies like ECHA and FDA, and data standards bodies argue for complementary use of formats such as canonical SMILES, hashed keys like InChIKey, and structural ontology terms developed by RSC and IUPHAR to ensure unambiguous indexing. Discussions at conferences hosted by the ACS, ECCO, EuChemS, and Gordon Research Conferences examine trade-offs between canonicalization, human readability, and computability.

Historical Development

The initiative to create a standardized identifier emerged through collaborations among IUPAC, NIST, academic groups at Cambridge, Oxford, MIT, and industrial partners including DuPont and Shell, formalized in the early 2000s and released in 2005. Subsequent updates and maintenance have involved global contributions from laboratories and organizations such as the Royal Society of Chemistry, the European Bioinformatics Institute, the US National Institutes of Health, and national research councils in Japan, Germany, and Canada. Milestones include establishment of the InChI Trust, broad adoption by publishers and databases, integration into cheminformatics libraries like OpenBabel and RDKit, and ongoing development addressing stereochemistry, polymers, and reaction InChI variants discussed at workshops organized by IUPAC, the WHO, and the OECD.

Category:Chemical nomenclature