CSV — LLMpedia

CSV
Name	Comma-separated values
Type	File format
Introduced	1970s
Extension	.csv
Mime	text/csv

Contents

Overview
History
Format and Specifications
Use and Applications
Limitations and Ambiguities
Implementations and Library Support
Security and Privacy Considerations

CSV

CSV is a plain-text tabular data format used for exchanging and storing structured records in delimited lines. It appears across software ecosystems from spreadsheet applications to command-line utilities and database import/export tools, facilitating interoperability among vendors, institutions, standards bodies, and research projects. Implementations by major vendors and open-source communities have produced a de facto family of variants rather than a single formal specification.

Overview

CSV encodes rows as lines and fields as delimited tokens, typically separated by commas but sometimes by semicolons, tabs, pipes, or other characters to accommodate regional and application-specific constraints. Adoption spans desktop applications like Microsoft Excel, server systems such as PostgreSQL, web platforms exemplified by Apache HTTP Server ecosystems, and scientific workflows involving R (programming language), Python (programming language), and Julia (programming language). Interchange patterns emerged in contexts including OECD data exchanges, Eurostat publications, and proprietary data feeds from institutions like Bloomberg L.P. and Thomson Reuters.

History

Delimited plain-text tabular formats predate modern personal computing, with early use in mainframe and minicomputer operations at organizations like IBM and Bell Labs. The comma-separated convention became prominent as microcomputers and spreadsheet programs such as Lotus 1-2-3 and VisiCalc popularized import/export. Standardization efforts and discussions occurred among standards bodies and working groups, including stakeholders from IETF and W3C communities, while commercial vendors including Google LLC and Apple Inc. influenced practical de facto behaviors through Google Sheets and Numbers.

Format and Specifications

The basic convention places one record per line, with fields separated by a delimiter and optional quoting for fields containing delimiters, newlines, or quotes. Notable specification documents and proposals include work circulated among IETF mailing lists and vendor notes from RFC 4180-adjacent discussions, while other proposals referenced implementations in LibreOffice and OpenOffice.org. Variations address newline sequences from Microsoft Windows versus Unix line endings, quoting conventions influenced by RFC 4180-style guidance, and character encoding choices such as UTF-8 versus legacy code pages used by Microsoft Windows locales. Escape sequences for embedded quotes, header row conventions, and variant delimiters (e.g., semicolon in some European Union datasets) are frequent points of divergence.

Use and Applications

Fields encoded in delimited text appear in scientific data exchanges among institutions like NASA and CERN, in financial reporting by New York Stock Exchange participants, in census and statistical releases from agencies such as United States Census Bureau and ONS (Office for National Statistics), and in public datasets published by organizations such as World Bank and UNESCO. CSV files are used to seed relational systems like MySQL, to share contact lists exported from Salesforce, to exchange bibliographic exports involving Zotero, and to support machine-learning pipelines managed in TensorFlow and scikit-learn. Analysts use tools like Pandas (software) and dplyr workflows to ingest and transform CSV-formatted data.

Limitations and Ambiguities

Delimited formats suffer from ambiguities in field typing, character encoding, line termination, and delimiter selection, leading to interoperability problems between clients like Microsoft Excel and server applications like PostgreSQL. Header row semantics and schema discovery are under-specified, complicating automated processing in systems developed with Apache Hadoop or Spark (software). Locale-specific conventions (decimal separators in Germany versus United States) and lack of standardized metadata impede reliable round-trip fidelity for datasets exchanged among organizations like OECD and Eurostat. Edge cases such as embedded newlines, binary blobs, or hierarchical data force ad hoc extensions like quoting conventions or escape sequences used by tools from GNU Coreutils or libraries developed for RStudio.

Implementations and Library Support

Support exists across languages and ecosystems: parsers and writers in Python (programming language)'s standard library and third-party packages, in R (programming language)'s readr and utils packages, in Java libraries used by Apache Commons and Spring Framework, and in C#/.NET components used in Microsoft Visual Studio projects. Database engines such as SQLite and Microsoft SQL Server provide import/export utilities, while command-line tools like awk and sed and utilities in GNU Coreutils enable ad hoc processing. Cloud platforms like Amazon Web Services and Google Cloud Platform include connectors and dataflow components that handle delimited text as part of ETL pipelines.

Security and Privacy Considerations

CSV ingestion can introduce risks when applications implicitly execute formulas, macros, or interpret content as expressions, a concern addressed after incidents involving spreadsheet formula injection in environments using Microsoft Excel and web-based viewers from vendors like Google LLC. Malformed records can trigger parser vulnerabilities in libraries maintained by communities such as PyPI and Maven Central, while large files and malformed encodings may enable denial-of-service vectors affecting services on AWS Lambda or Heroku. Privacy risks arise when sensitive fields from datasets maintained by institutions such as Health and Human Services (HHS) or National Institutes of Health are exported without pseudonymization; secure handling practices mirror controls advocated by regulators like European Commission frameworks and data protection authorities.

Category:File formats