Comma-separated values

Comma-separated values
Name	Comma-separated values
Extension	.csv
Mime	text/csv
Genre	Data interchange
Owner	None

Contents

Overview
History and development
Format and syntax
Variants and dialects
Parsing and software support
Common issues and limitations
Use cases and interoperability

Comma-separated values is a plain-text file format used for tabular data exchange among software such as Microsoft Excel, LibreOffice, Google Sheets, Apache OpenOffice, and Apple Numbers. It is widely supported by database systems like MySQL, PostgreSQL, SQLite, and ETL tools from IBM, Oracle Corporation, SAP SE, and Microsoft Corporation. The format's simplicity makes it a lingua franca for data transfer between applications including R (programming language), Python (programming language), MATLAB, SAS Institute, and Stata.

Overview

CSV stores records as lines with fields separated by commas; implementations often accept other delimiters used by European Union projects, United Nations, and corporations. Common consumer and enterprise tools—Microsoft Access, Tableau, QlikTech, Salesforce, and Amazon Web Services services—import and export CSV for reporting, analytics, and backups. Data interchange standards bodies like RFC 4180 have attempted to standardize behaviors observed across vendors such as IBM, Microsoft Corporation, and Google LLC. Libraries including Pandas (Python library), OpenCSV, Apache Commons CSV, and Boost provide parsing utilities used in systems developed by Facebook, Twitter, LinkedIn, and Netflix.

History and development

The format evolved from legacy export conventions in spreadsheet and mainframe software such as VisiCalc, Lotus 1-2-3, and early Microsoft Excel. Early database and data exchange practices at institutions like Bell Labs and universities using UNIX tools influenced CSV's line-oriented design. Standards efforts by bodies such as the Internet Engineering Task Force (IETF) produced RFC 4180, while vendors like Sun Microsystems, Oracle Corporation, and IBM documented proprietary behaviors. Open-source projects—GNU Project, Apache Software Foundation, and Free Software Foundation—helped propagate parsers and writers used by Debian, Fedora Project, and Red Hat distributions.

Format and syntax

A CSV file typically uses a header row naming fields followed by data rows; fields containing the delimiter, line breaks, or quotation marks are quoted. Common practice includes quoting with double quotes and escaping embedded quotes by doubling, behaviors mirrored by implementations in Excel, LibreOffice, RStudio, and Python's csv module. Regional variations—driven by locales like France, Germany, Spain, and Brazil—may use semicolons, tabs, or pipes; software from Microsoft Corporation, Apple Inc., and Google LLC often auto-detects separators. Metadata practices vary: some projects tie CSV to JSON or XML manifests, while others rely on schemas from W3C initiatives or ISO standards to constrain types.

Variants and dialects

Numerous dialects exist, including delimiter choices (semicolon, tab, pipe), quoting styles (minimal, all fields), and record terminators (CR, LF, CRLF) used by platforms like Microsoft Windows, Apple macOS, Linux Kernel-based systems, and legacy CP/M tools. Specialized forms such as "TSV" are common in academic tools including Bioconductor, NCBI, and European Bioinformatics Institute pipelines. Projects like RFC 7111 and communities around OpenRefine, Wikidata, Data.gov and Kaggle datasets have developed conventions for dates, encodings (UTF-8, UTF-16), and escape sequences.

Parsing and software support

Parsing is provided by languages and frameworks including Python (programming language), Java (programming language), C#, JavaScript, and Perl through libraries like Pandas (Python library), OpenCSV, Apache Commons CSV, and built-in parsers in Microsoft .NET Framework. Database import utilities in PostgreSQL, MySQL, SQLite, Microsoft SQL Server, and Oracle Database offer COPY or LOAD commands with delimiter options. Cloud platforms—Google Cloud Platform, Amazon Web Services, Microsoft Azure—and data warehouses such as Snowflake and Google BigQuery include CSV ingestion paths, while ETL suites from Talend, Informatica, and Pentaho handle dialect normalization.

Common issues and limitations

CSV lacks standardized metadata, leading to ambiguity in field types, encodings, and schema evolution; this complicates use with systems like SAP SE ERP, Salesforce, and statistical packages such as SPSS and SAS Institute. Locale-driven delimiters and date formats cause interoperability problems between regions like United Kingdom and Germany. Large-scale datasets used by NASA, European Space Agency, and CERN often exceed simple CSV’s capabilities for nested structures, prompting adoption of formats like Parquet (file format), Avro (data serialization system), and ORC (file format). Security issues include CSV injection vulnerabilities noted by projects and vendors including OWASP, Mozilla Foundation, and Google Security teams.

Use cases and interoperability

CSV is prevalent for data export/import in business intelligence workflows with SAP SE, Salesforce, and Microsoft Power BI; for scientific data exchange among researchers at NIH, Wellcome Trust, and Max Planck Society; and for open data portals run by European Commission, World Bank, and national agencies. Interoperability is aided by tooling from OpenRefine, Kettle (Pentaho Data Integration), and libraries in R (programming language), Python (programming language), and Julia (programming language), which transform CSV into typed formats consumable by systems like Hadoop, Spark, and Kubernetes-orchestrated pipelines. Despite limitations, CSV remains a ubiquitous, interoperable medium across enterprise, scientific, and civic ecosystems.

Category:File formats