CSV (file format)

CSV (file format)
Name	CSV
Extension	.csv
Mime	text/csv
Owner	Informal
Released	1970s
Genre	Data interchange

Contents

Overview
Format and Syntax
Variants and Dialects
Handling and Parsing Issues
Use Cases and Applications
Software Support and Tools
Security and Privacy Considerations

CSV (file format) CSV is a plain-text tabular data format widely used for data exchange between software like Microsoft Excel, Google Sheets, LibreOffice Calc, R (programming language), and Python (programming language). It originated from early computing practices on systems such as UNIX, VMS, and CP/M and is commonly encountered alongside formats like JSON, XML, YAML, and Parquet (file format).

Overview

CSV represents records as lines and fields as values separated by delimiters, enabling interoperability among tools like IBM, Oracle Corporation, SAP SE, Salesforce, and Tableau (software). Its simplicity made it popular in contexts involving United Nations statistical reporting, World Bank datasets, International Monetary Fund publications, Eurostat, and national agencies like U.S. Census Bureau. Despite lacking a single formal standard, it is referenced informally alongside standards from RFC 4180, practices from ISO 8601, and conventions used by projects such as Apache Hadoop, PostgreSQL, MySQL, SQLite.

Format and Syntax

A CSV file encodes rows as newline-terminated records and columns separated by characters such as commas, semicolons, tabs, or pipes; implementations in Java (programming language), C#, JavaScript, Perl, and C++ vary in delimiter handling. Fields containing delimiters, newlines, or quotes are often quoted using double quotes following conventions seen in RFC 4180 and in libraries like OpenCSV, Pandas (software), NumPy, Boost (C++ Libraries), and GLib. Header rows may provide column names used by systems including SAP HANA, Microsoft SQL Server, Amazon Redshift, Google BigQuery, and Snowflake (computing) for schema inference. Line endings differ across platforms such as Windows NT, Mac OS, and Linux kernel distributions, while character encodings like UTF-8, ISO/IEC 8859-1, UTF-16, ASCII, and EBCDIC affect portability.

Variants and Dialects

Numerous dialects arise from choices made by applications including Microsoft Excel, LibreOffice, Google Drive, Apple Numbers, and Zoho Corporation: delimiter choice (comma, semicolon, tab), quoting rules, escape techniques, and header presence. Regional conventions driven by locales such as France, Germany, Brazil, Spain, and Italy often substitute semicolons or use comma decimals, impacting interchange with systems like Eurocontrol, Deutsche Bahn, Banco Central do Brasil, Renfe, and ENI. Domain-specific variants include TSV used in Linux, delimited exports from SAS Institute, SPSS (software), Stata, and bioinformatics formats used in NCBI, EMBL-EBI, GenBank, UCSC Genome Browser, and Galaxy (platform).

Handling and Parsing Issues

Parsing CSV reliably is complicated by malformed records, inconsistent quoting, mixed encodings, and embedded newlines, causing interoperability problems for parsers in Apache Commons CSV, Python csv module, R readr package, Pandas, and Excel. Ambiguities lead to data corruption in pipelines managed by Apache NiFi, Talend, Informatica, Pentaho, and Alteryx when schema inference from headers fails or when type casting conflicts arise with systems such as Snowflake (computing), BigQuery, Azure Data Factory, AWS Glue, and Databricks. Edge cases include handling of byte order marks used by Microsoft Notepad, inconsistent escape sequences in applications like WordPress, and mismatches between expected and actual newline conventions from platforms like GitHub, Bitbucket, and GitLab.

Use Cases and Applications

CSV is used for data export/import between databases like MySQL, PostgreSQL, SQLite, MongoDB, and Microsoft SQL Server and analytics tools like RStudio, Jupyter Notebook, MATLAB, SAS Institute, and SPSS (software). Governments and organizations including United Nations, World Bank, OECD, European Commission, and National Institutes of Health distribute statistics and research data as CSV for accessibility. CSV is common in machine learning workflows using TensorFlow, PyTorch, scikit-learn, Keras, and XGBoost as well as in ETL processes orchestrated by Airflow, Luigi, Prefect, and Apache Spark.

Software Support and Tools

Wide support exists across commercial and open-source ecosystems: editors like Microsoft Excel, Google Sheets, LibreOffice, Apple Numbers; programming libraries including Pandas (software), Python csv module, OpenCSV, fast-csv, csvkit; and database utilities in MySQL Workbench, pgAdmin, MongoDB Compass, DBeaver, and HeidiSQL. Data validation and cleaning tools such as OpenRefine, Trifacta, DataWrangler, Talend, and Alteryx provide CSV transformations, while version control and collaboration platforms like GitHub, GitLab, Bitbucket, Confluence, and Jira host and track CSV-based datasets.

Security and Privacy Considerations

CSV files may contain sensitive personal data regulated by laws and institutions like General Data Protection Regulation, Health Insurance Portability and Accountability Act, California Consumer Privacy Act, European Data Protection Board, and International Organization for Standardization standards; accidental disclosure can occur via misconfigured exports from services such as Salesforce, Workday, SAP, Oracle Corporation, and ServiceNow. CSV content can be abused for injection attacks or metadata leakage in systems like Microsoft Excel, LibreOffice, Google Sheets, Tableau (software), and Power BI unless mitigations—such as sanitizing formulas, handling byte order marks, enforcing strict parsers, and access controls from AWS Identity and Access Management, Azure Active Directory, Google Cloud IAM, and Okta—are applied. Secure handling practices align with guidance from organizations like NIST, ENISA, OWASP, ISO/IEC JTC 1/SC 27, and ITIL.

Category:File formats