Common Data Model

Common Data Model
Name	Common Data Model
Caption	Conceptual diagram of a canonical data schema and integration flows
Developer	Multiple organizations and consortia
Released	2010s
Language	Schema definitions, XML, JSON, RDF, SQL

Contents

Overview
History and development
Structure and components
Implementations and variants
Use cases and applications
Governance and standards
Criticisms and limitations

Common Data Model The Common Data Model is a standardized, canonical schema and set of conventions for representing data so disparate systems can interoperate. It provides shared entity definitions, attribute naming, and relationship semantics to ease integration among enterprise applications, cloud platforms, research repositories, and analytics pipelines. Organizations adopt the model to reduce schema mapping, accelerate data sharing, and enable cross-system queries across heterogeneous sources.

Overview

The model defines canonical entities (for example, customer, product, transaction) and canonical relationships with explicit attribute types and reference semantics. It operates alongside serialization formats such as XML, JSON, and RDF and is consumed by extract-transform-load tools, middleware, and platforms. Prominent adopters include vendors and institutions from the software industry, cloud computing providers, healthcare consortia, and standards bodies. Implementations map proprietary schemas from vendors like Oracle, SAP, Salesforce, Microsoft, and IBM into the canonical representation to enable analytics in systems such as Snowflake, Databricks, and Google Cloud.

History and development

Early motivations trace to enterprise data integration efforts in the 1990s and 2000s when organizations struggled with point-to-point mappings between systems such as Oracle Database, SAP ERP, Microsoft Dynamics 365, Salesforce and bespoke applications. Influences include data modeling work from ISO, metadata efforts by W3C, and domain ontologies used in projects at NASA and CERN. In the 2010s cloud providers and analytics vendors codified canonical models for customers, while healthcare consortia drew on regulated terminologies like LOINC, SNOMED CT, and ICD-10 to build domain-specific variants. Academic centers at MIT, Stanford University, UC Berkeley, and Carnegie Mellon University contributed research on semantic interoperability and schema alignment.

Structure and components

A typical Common Data Model comprises entity definitions, attribute vocabularies, relationship graphs, and mapping metadata. Entities are modeled as tables or classes with typed attributes aligned to standards such as SQL, RDF, and JSON Schema. The vocabulary often reuses identifiers and code lists from standards organizations like HL7, ICD-10, and ISO to ensure semantic fidelity. Components include canonical keys, temporal attributes, audit fields, and extension mechanisms for vendor-specific fields. Tooling layers include extract-transform-load connectors from vendors such as Informatica and Talend, knowledge-graph adapters from projects at W3C and Stanford University, and SDKs provided by cloud platforms like Microsoft Azure, Amazon Web Services, and Google Cloud Platform.

Implementations and variants

Multiple organizations publish their own implementations and domain-specific variants. Major cloud vendors provide baked-in schemas: Microsoft published a model for Dynamics and Power Platform integrations; other implementations exist in data lakes managed by Snowflake, Databricks, and Cloudera. Healthcare-focused variants align with initiatives led by HL7 and national health services, while financial services adapt models to match standards from SWIFT and ISO 20022. Academic and open-source implementations appear in repositories associated with Apache Software Foundation projects, research groups at Harvard University and ETH Zurich, and consortia like Open Data Institute.

Use cases and applications

Use cases include master data management for enterprises such as Procter & Gamble and Walmart, patient record exchange in health systems like Mayo Clinic and NHS England, cross-platform analytics for retailers such as Target and Amazon (company), and regulatory reporting for banks complying with Basel Committee on Banking Supervision guidelines. It supports data warehousing with platforms like Teradata and SAP HANA, real-time streaming analytics with Apache Kafka and Confluent, and machine learning pipelines using frameworks from Google AI and OpenAI. Research data interoperability leverages the model in projects at NIH and European Commission research programs.

Governance and standards

Governance models range from vendor-led stewardship to community-driven standards bodies. Stewardship may be exercised by corporations such as Microsoft Corporation or by consortia incorporating stakeholders from IBM, Oracle Corporation, Salesforce, and regulatory agencies. Standards alignment is critical: mappings reference codified vocabularies from ISO, terminologies from SNOMED International, and messaging standards from HL7. Change control mechanisms include versioning, compatibility matrices, and certification programs run by industry associations and standards organizations.

Criticisms and limitations

Critics argue canonical models can be heavyweight and slow to evolve relative to agile development, pointing to governance challenges experienced by large vendors and consortia. Mapping complexity remains high when reconciling legacy schemas from systems like PeopleSoft or bespoke mainframe stores with modern cloud-native schemas. Domain-specific nuances can be lost when forcing heterogeneous data into a single canonical representation; this concern is often raised by practitioners in healthcare, finance, and scientific data management at institutions like CERN and national agencies. Additionally, proprietary extensions by vendors risk fragmentation, while licensing and intellectual-property issues can complicate community adoption.

Category:Data modeling