LLMpediaThe first transparent, open encyclopedia generated by LLMs

SDF (file format)

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: Avogadro project Hop 4
Expansion Funnel Raw 66 → Dedup 15 → NER 14 → Enqueued 0
1. Extracted66
2. After dedup15 (None)
3. After NER14 (None)
Rejected: 1 (not NE: 1)
4. Enqueued0 (None)
SDF (file format)
NameSDF
Extension.sdf
TypeChemical file format; structure-data file
OwnerOpen standards / proprietary variants
Released1990s
GenreChemical information, cheminformatics

SDF (file format) is a structure-data file format used to represent chemical structure information and associated data fields in a plain-text, line-oriented form. It serves as a container for one or more molecular connection tables paired with keyed metadata, enabling exchange among cheminformatics systems and databases. The format has been used across academic, industrial, and governmental projects to transfer compound registries, screening libraries, and annotation tables.

Overview

The format originated in cheminformatics toolchains and was popularized by software ecosystems and institutions such as MDL Information Systems, Accelrys, Cambridge Crystallographic Data Centre, PubChem, and Protein Data Bank consumers. It is commonly encountered alongside other cheminformatics formats like SMILES, InChI, MOL file, CML and database formats used by ChemSpider, ChEMBL, DrugBank, and ZINC. SDF is used by researchers at organizations including Novartis, Pfizer, GlaxoSmithKline, Merck & Co., and by national labs such as Los Alamos National Laboratory for compound annotation and high-throughput screening data interchange. The format supports integration with tools from vendors and projects like Open Babel, RDKit, Bioclipse, ChemAxon, and Schrödinger.

File Structure and Syntax

An SDF file is a concatenation of multiple records; each record contains a molfile (derived from MOL file) followed by zero or more data items encoded as tagged fields. Each record ends with a delimiter that separates entries for databases and pipelines operated by institutions such as EBI, NCBI, Harvard University, MIT, and Stanford University. The molfile portion carries atom and bond lists compatible with conventions used in software from ChemAxon, OpenEye Scientific, and Tripos; atom coordinates, stereochemistry flags, isotopes, and charges are represented using the same conventions found in legacy formats developed by MDL and subsequent implementers. Data items follow a header line beginning with a bracketed tag and are commonly used to store identifiers and metadata created by projects such as KEGG, ChEMBL, UniProt cross-references, assay descriptors from PubChem BioAssay, and provenance labels assigned by corporate ELNs used at Bayer or Eli Lilly.

Variants and Versions

Over time, SDF has seen informal extensions and version-like conventions rather than a single rigid standard; implementations by MDL Information Systems, Elsevier, Accelrys, and open-source suites introduced variations in field encoding, quoting, and multi-record handling. Vendors like ChemAxon and projects such as Open Babel and RDKit implemented de facto behaviors for metadata encoding, Unicode handling, and line endings. Variants include compressed or binary-wrapped forms used in enterprise settings at companies like Johnson & Johnson and Roche, and JSON- or XML-serialized equivalents employed by platforms such as ELIXIR nodes and Galaxy Project workflows. Community-driven interoperability efforts from groups associated with IUPAC, RDA and informatics consortia addressed ambiguities but did not produce a single authoritative revision number.

Applications and Tools

SDF is widely used for structure registration, data exchange, virtual screening, and data mining in environments maintained by National Institutes of Health, European Bioinformatics Institute, Wellcome Trust Sanger Institute, and commercial research divisions at AbbVie and AstraZeneca. Tool support includes conversion and validation in Open Babel, chemical informatics operations in RDKit, depiction in Jmol and PyMOL, and integration with cheminformatics platforms such as KNIME and Pipeline Pilot. Laboratory information management systems (LIMS) and ELNs provided by vendors like PerkinElmer and Thermo Fisher Scientific often import/export SDF for compound catalog synchronization. Data curators at repositories such as PubChem, ChEMBL, and ChemSpider routinely ingest SDF submissions from collaborations with universities like UC Berkeley and ETH Zurich.

Compatibility and Interoperability

SDF’s compatibility relies on common conventions shared by cheminformatics libraries; crosswalks to SMILES, InChIKey, CML, and database schemas used by Oracle and PostgreSQL-backed chemical registries are routine. Interoperability challenges arise with character encoding (UTF-8 vs legacy encodings), platform line endings encountered in archives from GitHub or institutional FTP servers, and vendor-specific extensions created by Accelrys or MDL. Integration into workflow systems at institutions such as European Molecular Biology Laboratory and Cold Spring Harbor Laboratory typically requires normalization steps—canonicalization routines implemented in RDKit or conversion utilities in Open Babel—to harmonize stereochemistry, tautomeric representation, and property naming conventions.

Security and Limitations

SDF is plain text and thus vulnerable to typical file-based risks observed in data pipelines at organizations like NSA-adjacent contractors and major research infrastructures; malformed entries can trigger parser bugs in libraries like Open Babel or proprietary readers from ChemAxon and Schrödinger. Limitations include lack of formal schema for complex metadata, ambiguous serialization of multi-valued fields, and inefficient representation for very large libraries compared with binary formats or relational databases used by Oracle Life Sciences and high-performance computing centers at Argonne National Laboratory. Best practices adopted by repositories such as PubChem and large pharma—validation, schema mapping, and use of checksums—mitigate risks but do not eliminate semantic inconsistency across heterogeneous datasets.

Category:Chemical file formats